rule refinement in inductive knowledge based systems


Published on

rule refinement in knowledge bases created through inductive methods

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

rule refinement in inductive knowledge based systems

  1. 1. RULE REFINEMENT IN INDUCTIVE KNOWLEDGE-BASED SYSTEMS by Mohamed Arteimi Department of Computer Science Witwatersrand University Johannesburg, 1999
  2. 2. Abstract This paper presents empirical methods for enhancing the accuracy of inductive learning systems. It addresses the problems of: learning propositional production rules in multi-class classification tasks in noisy domains, maintaining continuous learning when confronted with new situations after the learning phase is completed, and classifying an object when no rule is satisfied for it. It is shown that interleaving the learning and performance-evaluation processes, allows accurate classifications to be made on real-world data sets. The paper presents the system ARIS which implements this approach, and it is shown that the resulting classifications are often more accurate than those made by the non-refined knowledge bases. The core design decision that lies behind ARIS is that it employs an ordering of the rules according to their weight. A rule’s weight is learned by using Bayes’ theorem to calculate weights for the rule’s conditions and to combine them. This model focuses the analysis of the knowledge base and assists the refinement process significantly. The system is non-interactive, it relies on heuristics to focus the refinement on those experiments that appear to be most consistent with the refinement data set. The design framework of ARIS consists of a tabular model for expressing rule weights, and the relationship between refinement cases and the rules satisfied for each case to focus the refinement process. The system has been used to refine knowledge bases created by ARIS itself, as well as to refine knowledge bases created by the RIPPER and C4.5 systems [6,25] in ten selected domains. Two main advantages have been observed. First, the ability to gradually improve the knowledge base as the refinement proceeds. Second, is the ability to learn strong rules utilising condition’s weight using minimum covering algorithm. Thus, by taking account of this information we improve the applicability and quality of refinement. Keywords: knowledge-base refinement; inductive learning. 1. Introduction The growth in database technology and computer communication have resulted in the creation of huge, efficient data stores in almost every domain. For example, credit-card transactions, medical images, and large-scale sky surveys are all stored in massive and ever growing databases. These streams of data need to be analysed to extract higher-level information that could be useful for decision making and for understanding the process generating the data. For large systems of this type, we need automated methods that acquire decision-making knowledge. Humans attempt to understand their environment by using a simplification of this environment (called a model). This model represents events in the environment, and similarities among objects [12]. Similar objects are grouped in classes and rules are constructed to predict the behaviour of new objects of such a 1
  3. 3. class. In machine learning we try to automate this process and construct class descriptions (models) such as a decision tree or a set of production rules using an iterative search strategy on a library of examples. This is known as inductive learning. There have been various methods for learning rules inductively by analysing a set of training examples to extract interesting and relevant information from the databases. Among these are RIPPER [6], and C4.5 [25]. This information is higher-level knowledge in the form of class descriptions that are generalisations about what it means to be a member of a class or a category. This gives us the ability to make inferences that are true for most or all members of the class. In this paper we present a method for improving or refining existing knowledge bases by using a refinement library of examples. Section 2 gives an overview of two existing refinement systems, and inductive learning systems. Section 3 describes the ARIS knowledge-base refinement system. The application of ARIS to knowledge bases generated by C4.5 and RIPPER systems is described in Section 4. Section 5 compares inductive refinement and incremental learning. Section 6 concludes with discussion and future enhancements. 2 Related Work 2.1 Inductive systems There has been a relatively long history of induction to solve complex problems, by creating a model of the problem domain, and inferring from that model instead of from the domain itself. The constructed model is a particularly useful tool if it is possible to perform classification by simple, easy-to-understand tests, whose semantics are intuitively clear, such as decision trees [3,5,9] and production rules [1,6,25]. Learning algorithms vary in: - the way in which they search the training data set, - the way in which they generalise, - the way they represent class descriptions, and - the way in which they cope with errors and noise in the training data. Generally, we can classify concept learners into two categories depending on the learning strategy involved. S Simultaneous covering algorithms At each step in the search, the algorithm selects among alternative attributes by comparing the partitions of the data that each attribute generates. Decision-tree algorithms belong to this category, examples include ID3 [24], C4.5 [25], and CART [3]. These algorithms use an overfit-and-simplify learning strategy involving post pruning of the generated trees. 2
  4. 4. S Sequential covering algorithms Here, the algorithm chooses among alternative attribute-value pairs to include in rules by computing the subsets of data they cover. Direct rule learning algorithms such as RIPPER [6], and HYDRA [1] belong to this category. Two variations on this approach have been explored: 1. General-to-specific search An approach to learning one rule at a time is to organise the search in the hypothesis space in the same way as in simultaneous covering algorithms, but to follow only the most promising branch in the tree at each step. The search starts by considering the most general condition possible (i.e. the empty test that matches every instance), then adds the attribute test that most improves rule performance measured over the training data. Once this test has been added, the process is repeated by adding a second attribute test, and so on. This is different from simultaneous covering approach since at each step only a single descendent is followed, rather than growing a subtree that covers all possible values of the selected attributes. Some learning systems such as BEXA [27] extend this approach to maintain a list of the k best attribute tests at each step, rather than a single best attribute test. 2. Specific-to-general search Another way to organise the search space is to start the search process beginning with the most specific rule and gradually generalise to cover more positive cases [18]. One way to see the significance of the difference between the above two approaches is to compare the number of distinct choices made by the two algorithms in order to learn the same set of rules. To learn a set of n rules, each containing k attribute-value tests in their conditions, sequential covering algorithms will perform nk primitive search steps, making an independent decision to select each condition of each rule. In contrast, simultaneous covering algorithms will make many fewer independent choices, because each choice of a decision node in the decision tree corresponds to choosing the condition for the multiple rules associated with that node. In other words, if the decision node tests an attribute that has m possible values, the choice of the decision node corresponds to choosing a condition for each of the m corresponding rules. Thus, sequential covering algorithms make a larger number of independent choices than simultaneous covering algorithms. The other difference is that in simultaneous covering algorithms different rules test the same attribute while in sequential covering algorithms this behaviour is not required. 3
  5. 5. 2.2 Refinement systems Previous knowledge-base refinement systems use knowledge bases elicited from experts in a fairly well understood domains. The knowledge base is believed to be complete and only minor tweaking is required. This section describes two such refinement systems SEEK [22] and KRUST [7]. They are the forerunners of the ARIS refinement algorithm. The statistical approach of SEEK is the basis for ARIS’s rating of rules and imbedded hypothesis. SEEK uses statistical evidence to apply suitable refinements. The refinements are: S Generalise a rule which will fire if it is enabled. S Specialise the error-causing rule to allow an enabled correct rule to fire instead. The rules in SEEK are represented in the EXPERT language but expressed in a tabular format that divides the findings associated with a conclusion into two categories; major and minor, and the level of confidence associated with a conclusion into three categories: definite, probable, and possible. A list of major and minor symptoms is defined for each conclusion that the system can reach. For example, major symptoms for connective tissue disease (one of the rheumatic diseases) include swollen hands and severe myositis, while examples of minor symptoms include anemia and mild myositis. The rule changes include the following operators: 1) Change the certainty of a diagnosis, 2) Increase/decrease the number of majors or minors (i.e. suggestive symptoms), 3) Add/delete an exclusion (i.e. a symptom that should not be present), 4) Add/delete a requirement (i.e. a necessary symptom) and, 5) Change the conclusion (RHS) of a rule. The rule changes do not include numerical range changes. SEEK requires the involvement of an expert to decide on whether or not a suggested change by SEEK can be implemented. The main criticism of SEEK is that it uses a criterion table for representing the hypotheses. This works well in the domain of Rheumatoid diseases, but may be difficult to implement in other domains. SEEK is also unable to add new rules. Politakis [22] gives a method to integrate condition weights and rule weights for classification. The method used relies on a very specific set of weights regarding the relationship between a set of conditions and a conclusion. The condition weight can either be minor or major while the weight of a rule can have a value in the set {possible, probable, definite} depending on the combination of conditions used in a rule. This technique is strongly connected to our implementation of condition and rule weights as will be discussed in Section 3. The proposed scheme in this paper is dynamically adjusted with respect to a probabilistic estimate (depending on the data distribution) and it is defined on a continuous scale, while in SEEK system an expert is required to assign weights for both conditions and rules. 4
  6. 6. The use of an exclusion component in SEEK is another important aspect that is similar to the proposed approach. We use an UNLESS clause to rule out a conclusion for a particular situation taking care of special cases in the domain of interest. The KRUST system [7] refines a knowledge base when there is a conflict between the expert’s conclusion and the system’s conclusion for a particular test case. This test case is one of a bank of tasksolution pairs used by KRUST. KRUST normally outputs a single knowledge base, based on the performance of a set of proposed refined knowledge bases on all the task-solution pairs. In the event of a tie for best knowledge base, or if the original knowledge base is judged best, the expert is consulted. However, unlike SEEK, KRUST does not require the involvement of an expert during refinement. KRUST follows the “first enabled is fired” conflict resolution strategy, and its refinement operations are: S Generalise a rule, S Specialise a rule, S Change the priority of a rule, S Add a new rule for the misdiagnosed case. KRUST’s generation of a new rule is naïve [7], and needs to be improved. Again KRUST believes that the original knowledge base is complete, and no redundant rules exist. Hence, there is no redundancy checking to remove undesirable rules. Moreover, there is no way to handle or exclude special cases as is the case with SEEK. KRUST’s approach in generating many refinements is based on the realisation that the application of a variety of refinement operations could lead to knowledge bases that have different performances. The research in this paper emphasises the need for trying all possible refinement operations that are applicable for the observed misdiagnoses and to choose one that best improves the knowledge-base performance. 3. The ARIS inductive refinement system This section is devoted to a description of the mechanisms involved in representing/creating the knowledge base, improving the knowledge base, and the associated weighting schemes. This is implemented in a system called ARIS (standing for Automated Refinement of Inductive Systems) which refines propositional production rules. The architecture of ARIS is shown in Figure 1, and the main features are: • a rule weighting scheme for providing rule priorities and ordering, • ranking refinement of cases by combining the weights of rules satisfied for each case, • ability to classify an object, when no rule is satisfied, through partially satisfied rules, and • ability to perform all refinement operations on continuous and discrete attributes. 5
  7. 7. Refinement data Test data Rule generator Rule classifier Rule Tree generator Training data (inference engine) base Ordered rules Refinement generator Prediction result Classifier Refined rule set Refined knowledge base Refinement module Architecture Figure 1. The ARIS architecture ARIS initially generates a knowledge base using induction on a set of training examples, it then proceeds to test the knowledge base on a separate set of testing data for refinement purposes called the refinement data set. It is only after this testing, and if some cases in the refinement set are mis-classified, that the refinement subsystem is invoked. Finally, the system is tested on a separate testing data set for evaluation of the refinement process. The refinement subsystem identifies possible errors in the knowledge base, and calls on a library of operators to develop possible refinements, guided by a global heuristic. The best refinement is implemented, and the process repeats until no further refinements are possible. A knowledge-based refinement system is successful to the extent that it can improve the accuracy of its initial knowledge base on new and unseen cases and produce a more accurate and comprehensible knowledge base than the initial knowledge base. ARIS performs a search through a space of both specialisation and generalisation operators in an attempt to find a minimal refinement to a knowledge base. Conceptually, the refinement subsystem has three main phases, two of them are executed for every hypothesis present in the knowledge base, while keeping the rules ordered with respect to their weights: Phase 1: (Localisation process) During the first phase, all misdiagnosed cases from the refinement set belonging to a particular hypothesis (class) are identified. Each case receives a weight from the rules satisfied for the case. This indicates the rule overlapping at this point (case) in hypothesis space. The case that has the highest absolute weight among the misdiagnosed cases is selected, since it points to the strongest rule from the set of erroneous rules (i.e. with highest weight) and frequently many rules overlap at this point. 6
  8. 8. Phase 2: (Refinement generation, verification and selection) During this phase, the rule responsible for the misdiagnoses is determined, and all possible refinement operations applicable are tried. Namely; the erroneous rule is specialised, a similar rule reaching the intended class is generalised, and a new rule is created. The resultant performance is stored for each refinement operator and finally the refinement operator or group of operators that produce the best performance is chosen. The process is repeated until no further improvement is possible. Phase 3: (Check completeness and remove superfluous rules) Finally, the knowledge base is checked for incompleteness. Each case must be covered by at least one rule. If there are cases that are not covered by the existing rules, new rules are created. Moreover, any superfluous rules are removed. Figure 2 demonstrates the algorithm. The main components of ARIS are: a tree generator, a rule generator, a refinement generator, a judgement module, and an inference engine. The refinement generator is responsible for applying all possible refinements to remedy the misdiagnoses. The set of refinements reflect how the rules interact. Rules can be changed by allowing them to fire (called enabling), or preventing them from firing (called disabling). The judgement module selects the best refinement operator or group of operators which result in the best improvement of knowledge-base performance while correcting the misdiagnosed cases. 7
  9. 9. Algorithm : ARIS Inputs: A set of training data A set of refinement data, A set of test data, Optionally, a set of rules of the form: IF conditions THEN hypothesis Output: Refined rule set of either IF conditions THEN hypothesis Or IF conditions THEN hypothesis UNLESS Conditions First: If no rules are given generate a set of production rules by processing the training data, Second: 1. for hypothesis ←each hypothesis in the hypothesis space {Identify rules that are not consistent with refinement data set. These may at later stages be specialised but will not be generalised. } • Rank rules with respect to their weights. • For each misdiagnosed case in the refinement set, calculate combined weight of rules satisfied for the case (see Section 3.1). • Select a case that has the highest absolute weight among the misdiagnosed cases. • Refine the knowledge base for the selected case (i.e. missed case) by generating all possible refinements. • Select the best refinement, and implement it. End for 2. • Maintain knowledge-base completeness, by guaranteeing that each case is covered by at least one rule, use a minimum covering algorithm to generate missing rules (see Section 3.4). • Prune all superfluous rules. Figure 2: The ARIS algorithm 3.1 Learning condition weights Feature weights have been used in case-based learning [14,19] to denote the relevance of features in similarity computations and to guide the selection of relevant features. The basic approach involves storing cases and their associated classes in memory and then, when given a test instance, the task 8
  10. 10. becomes one of finding the training cases nearest to that test case and using them to predict the class of the test instance. Several methods for constructing feature weights have been proposed, based on either a naïve distance metric that weighs attributes equally [28] or a weighted Euclidean metric in which the weight for each feature comes from a small set of options, for example [0,0.1,0.2,…,1], where 0 means that the feature is irrelevant, and higher weight denotes greater relevance such as in [19]. Some methods cannot handle numeric features directly, and have to discretise numeric values by classification; others have to convert multi-valued symbolic features to a set of binary features with respect to the number of its values (i.e. a symbolic feature with N values is mapped into N-1 binary features) such as quantification method II [19]. Politakis [22] used a domain specific scale for weighing conditions and rules. His approach can be summarised as follows: The conditions in a rule are grouped under the following categories: majors considered to be very important conditions, minors less specific but suggestive for reaching a particular conclusion, exclusions conditions that must not be satisfied in order to reach a given conclusion. The certainty in the conclusion of a rule is measured by the confidence level it reaches. There are three levels of confidence: (1) Possible, (2) Probable, and (3) Definite. Different combinations of conditions could result in reaching the same conclusion but with different certainty. In contrast, the only source of information we have is the training data. In this research, conditional probability theory is used to take advantage of the data distribution in the training set. The intention is to measure the importance of a condition to a particular class, through the measure of change in belief in the class with respect to that condition. This poses the following question. For a particular hypothesis h, if given a piece of evidence e, how will the relation between h and e change? To formalise the concept, let us consider a sample space that is partitioned by events E1,E2,…En. Let H+ be an event in the space denoting a certain class or a concept with probability of P(H+)>0; Then, by Bayes’ Theorem + P( E j | H ) = P( H + | E j )P( E j ) n ∑ P( H + | E i )P( E i ) i =1 For a simple problem, with two conditions, and hypotheses H+ and HP( E 1 | H + ) = P( H + | E 1 )P( E 1 ) P( H + | E 1 )P( E 1 ) + P( H + | E 2 )P( E 2 ) 9
  11. 11. = = P( H+ | E1 )P( E1 ) P( H+ | E1 )P( E1 ) + P( H+ |~ E1 )P(~ E1 ) P( H+ | E1 )P( E1 ) P( H+ | E1 )P( E1 ) + P( H+ |~ E1 )(1 − P( E1 )) We can rewrite this as P( E 1 | H + ) = QA × P( E 1 ) (1) Where QA = P( H + | E1 ) P( H + | E1 )P( E1 ) + P( H + |~ E1 )( 1 − P( E1 )) Or using the terminology of Table 2. QA = X 1 X 3 X + X 1 2 (1 − X 3 ) If we take the logarithm of QA, the incremental nature of the updating process of the evidence on the condition of the hypothesis becomes apparent. In other words, the posterior probability P(E1|H+) in Equation 1 above, is dependent on the factor QA. If QA is greater than one, the posterior probability will be greater than the prior probability P(E1); otherwise, if QA is less than one, it would result in P(E1|H+) less than P(E1). We can thus view this quantity as a weight, carried by the evidence E1, which sways the belief one way or the other. The weight QA calculated as above ranges from zero to infinity; however we would like to map the values of the evidence weight within the interval [-1,+1] with a value of 1 for QA mapping to 0 (central), so that it is easier to distinguish whether the given evidence is in favour of the particular hypothesis or not. In other words positive weight denotes supporting evidence, and negative weight indicates opposing evidence. Therefore the following function is used to map the value to the desired range W=F(QA) Where 2 QA −1 1 + QA Another candidate for this transformation [23] is: F ( QA ) = F (QA) = 1 e −QA +1 −1 Some other possible forms have been suggested in the literature for this mapping [13]. 10
  12. 12. Positive weight indicates a supporting evidence for the hypothesis, and negative weight indicates opposing evidence against the hypothesis. Symbol Description Ei Indicates an event or evidence (eg. Age<=20) ~Ei Indicates the complement of Ei (eg. Age>20) H+ Denotes space of positive instances of hypothesis (eg. Healthy) H - Denotes space of negative instances of hypothesis (eg. Sick) P(Ei) Represents the prior distribution of objects(cases) within the scope of the condition relative to the total number of examples covered by the entire population (e.g. P(Age>20) X1 = P(H|Ei) Fraction of positive instances of hypothesis covered by the condition Ei (i.e. True positives TP) X2 = P(H|~Ei) Fraction of positive instances that are not covered by the condition Ei, relative to the complement of Ei (i.e. False negatives FN) X3=P(Ei) Fraction of all instances covered by the evidence Ei, relative to the entire population Table 2: Terminology for weight model We emphasise that this weighting scheme defines class-specific weights, depending on class frequency and according to the degree to which its distribution of values (for a class) differs from its distribution for other classes. 3.2 Rule weights and ordering In what follows, we discuss the rationale for weighting and ordering the rules in an inductive knowledge-based system. The natural properties of production rules allow many ways for rules to interact and overlap. In other words, a rule may collide with or subsume another rule, because the available conditions are not usually independent. This requires some process to use the available information to resolve any existing conflict and generalise well over the training data, while keeping the rules as simple as possible. Knowledge-based systems are characterised by the separation of rules, data, and control scheme [8], and the task execution is accomplished through the following phases: Phase1: Matching phase in which the knowledge-based system matches its rules against a given instance, to determine the conflict set (the set of rules satisfied), Phase2: Conflict resolution. Here the system selects a rule to fire from the conflict set. Phase3: Rule application, which entails execution of the selected rule by following its condition part. 11
  13. 13. The matching phase is known to be a time consuming operation, but the conflict resolution is particularly important as it determines which rule will fire and hence how the case will be classified. Several general schemes of conflict resolution have appeared in the literature utilising different kinds of heuristics, for example: 1. Rule specificity was used to favour specific rules [11]. 2. Connection strength [8] which measures rule strength that codes its “relatedness” to other rules in the knowledge base for selecting the next rule to fire on the basis of experience. Their scheme imposes an explicit priority ordering given the last rule satisfied during task execution. The following heuristic is adopted: Wherever rule A fires, then based on experience the most likely candidate rule to be fired and eventually lead to success is rule B. This idea is very different from our approach. 3 Logical sufficiency [2] is a probabilistic estimate of rule reliability, that measures the proportion of the training examples of class c, for which the rule was true divided by the proportion of the training examples of other classes for which the rule was true. 4 Belief certainty such as certainty factor model used in MYCIN [4] is used to represent uncertainty to rules conclusion. This is based on expert estimates which are the integration of many factors including probabilistic knowledge and attitude. Traditionally, conflict resolution, requires computing the entire conflict set (except scheme 2 above), and then selecting a rule according to certain property (like frequency of rule success). We propose the addition of a different mechanism to the inference engine for ordering the rules with respect to a strength of the association between rule’s conditions and the intended hypothesis (i.e. right hand side of the rule). This is similar to Politakis’ approach [22] preferring rules with strongest confidence over rules with less confidence. The goal of refinement is to improve knowledge-base efficiency with as few modifications as possible. This entails reducing the number of rules it has to consider for the purpose of improving the prediction quality. For the proposed method we maintain a simple matcher that matches rules individually. Selecting the next rule to match depends on its location (position in a list) and on its weight, thus avoiding the expensive computation of the entire conflict set. This means that the first rule to match will be fired, so there is never a conflict set. The rule weight can also be used for purging low utility rules from the system. Changing the priority of a rule affects its ability to fire, and it is the easiest operational change to perform [7]. Several knowledge-based systems use a rule priority change as a plausible action in their conflict resolution schemes [7,16]. Craw [8] advocates changing priority of a rule while rejecting the change if it proves to be ineffective in later stages. Laird [16] supports the notion of changing control 12
  14. 14. rather than the conditions of the rules in his system Soar (i.e. modifying the decisions in which the knowledge is used instead of modifying the rules that encode the knowledge) to recover from incorrect decisions. In his system the decisions made on the basis of preferences are categorised into three classes (acceptability, necessity, and desirability) to select a rule for firing. Other authors [29] took the opposite extreme view point, namely, not to change the rule order under any circumstances; but there is no clear evidence for keeping a specific ordering of the rules, if resorting to a small change like reordering could improve knowledge-base efficiency. ARIS starts with an unordered set of rules, and imposes an order required to keep the rules in line with their utility to the system, in the sense that: (1) Provides as few matching operations as possible to produce the desired action without explicit conflict resolution computation. (2) Improves refinement efficiency by supplying a process for locating the rule responsible for misdiagnoses directly. For example, if we have a knowledge base containing rules that help making decisions about outdoor activities for children, such as: R1: IF Today’s temperature < 22 THEN Play-Soccer R2: IF Today’s temperature < -10 THEN Build Snowman As you can see, for a temperature of -12 degrees, the knowledge-base system will suggest playing soccer, which is not correct for this situation. While there are several ways to amend this problem, the simplest one is to reorder the rules; here we must use some criterion that allows the learning system to perform the action (evaluation then ordering) automatically. We can provide the learning system with semantic ability to relate each class with particular valid condition, so that the learning system can distinguish irrelevant conditions, and order the rules with the proper context. The other way is to assign a weight to a rule which relates the strength of the conditions to the conclusion of the rule. For the example above, the condition “Today’s temperature<22”, covers some false positive cases from the second rule’s class. Now suppose that the rule R2 is complete and consistent. The intended scheme of rule weighting would result in the weight of rule R2 being higher than the weight of rule R1. As a result R2 will be placed in the first location. This yields better prediction performance. In Section 3.1, we described the weighting technique for rule conditions. We must admit the fact that there is no consensus on how best to measure the combined effect of the conditions on a conclusion in a rule, because in real-world applications, the conditions are not independent, and the relation between conditions themselves is not clear. However, we use a heuristic for combining condition weights to produce a weight for the rule. 13
  15. 15. The following combination function [4,13] will be used to calculate the weight of a rule Rule' weight= w1 + w2 (1− w ) + w3[1−[w + w2 (1− w )]]+ ... s 1 1 1 This computes the impact of the tests in a rule’s antecedent, where wi is the weight of condition i. Since the weights of conditions can be either positive or negative evidence (for or against) the hypothesis, we need to split the equation above as follows: 0 if no supporting evidence  MSE =  for positive conditions  w1 + w 2 (1 − w1 ) + w 3 [1 − [ w1 + w 2 (1 − w1 )]] + ... 0 if no opposing vidence  MOE =  for negative conditions  w1 + w2 (1 − w1 ) + w3 [1 − [ w1 + w2 (1 − w1 )]] + ... The weight of the rule is then MSE-MOE where MSE= Measure of supporting evidence, MOE= Measure of opposing evidence, wi = Weight for evidence i. 3.3 Improving an existing knowledge base It is our hypothesis that tackling the most difficult part of the knowledge base is a good approach, and we feel that other misdiagnoses could be solved as a by-product of the process. To test this hypothesis, we maintain a list of the available rules in the knowledge base, against a set of refinement cases selected independently from training data set as shown for example in Table 3 below: CASE RULE R2 R3 + WEIGHT R1 - + REFINEMENT CASE 1+ * 2 3+ * * * . . Table 3: Generating refinement locations 14
  16. 16. For a particular class in the given domain, each rule maintains its weight and a label (+ or -) with respect to the class under investigation. Each case is labelled + if it belongs to the class of interest or otherwise. A ‘*’ means that the case is covered by the rule. All the rules satisfied for a case in the refinement data set collectively produce a weight for that case. To determine this weight, we follow the formulation given in [4] which defines evidence combination as a special case of the Dempster-Shafer theory of evidence, that is: n F ( w1 ,..., w n ) = 1 − ∏(1− w j ), n >= 1 (2) j =1 Where w is weight of rule J. Since the rules in our approach are ordered with respect to their weight, we reformulate equation (2) as follows ∏ (1 − w )+ Max[w], n >= 1 n F ( w) = F ( w1 ,..., w n ) = 1 − j j =1 Where Max[w] is the highest weight of a rule among all the satisfied rules for a particular case. This combined weight F(w) is an indication of rule overlapping. The case is misdiagnosed by the system if a satisfied rule with the highest weight belongs to a different class (i.e. has a different sign). We would like to find the points that have strongest overlapping (i.e. interaction) of rules where misdiagnoses exist. The algorithm in Figure 5 describes how to assign case weights based on the rules satisfied. While the weight of the rules is used as an ordering criterion to order the rules in the knowledge base, for aiding classification, it also forms a basis for enhancing the refinement process. Refinement locations are meant to be places in the knowledge base where misclassification may occur. We identify refinement locations by a negative weight of misclassified instances. The number of different rules which satisfy a particular refinement case represents its potential, i.e. the degree of overlap in the knowledge base, which must be resolved. Our method of analysis calculates the performance of rules in the knowledge base with respect to the refinement data set. For each stored case, all the satisfied rules are identified. It is common to find multiple rules satisfied for a given case. The problem of misclassification appears when rules of incorrect conclusions are satisfied for the case. Because of this, we separate these rules into two groups one group corresponds to the active hypothesis (the class of the misdiagnosed case) for the refinement location, and the second group contains rules concluding other classes. We would like to determine the point at which the refinement would produce the optimal result with respect to increased performance of the knowledge base, and least refinement effort with respect to number of refinement operators implemented. Finding such an optimal refinement is not possible, but the case weight as calculated in Figure 5 is a useful heuristic indicating areas of high interest for possible refinements. Figure 3 summarises the refinement process. 15
  17. 17. Evaluate KB performance Stored refinement cases KB system conclusion conclusion Verify KB completeness Incorrect knowledge base diagnoses ? No Yes Verify KB redundancy Summarize rules performance on each refinement case Return Instantiate heuristics to locate the most critical part of the rule overlapping Perform refinement operations and select the best Apply refinement operator Evaluate the KB Update the best refinement Implement the best refinement Figure 3: Refinement process 16
  18. 18. Refine (erroneous, missed case) {The target rule is a rule satisfying missed case and concluding the correct hypothesis for the missed case} 1. Specialise the erroneous rule S Select best specialisation S If there is a target rule, accept the best specialisation 2. If there is no target rule: { • Find closest rule to generalise, then {Two successive scans of the rule are performed in order to generalise it } (1) Generalise by deleting a condition from the rule, such that the deletion does not degrade the performance of the rule, so as to cover the missed case (2) Generalise by expanding numerical scope of a condition if it is continuous, so that the missed case will be correctly covered. • If the weight of closest rule after generalisation > weight of the erroneous rule, then adopt the generalisation operation only, Else { • Augment erroneous rule with UNLESS clause • Apply both generalisation (to closest rule) and specialisation (to erroneous rule) operators • Evaluate the knowledge-base performance using each of the above refinements, and accept the one that better enhances the knowledge-base performance. } } 3. If there is no rule to generalise, then • Create a new rule using a minimum covering algorithm • If the weight of the created rule > weight of erroneous rule, then add new rule to the knowledge base; Else { • Specialise the erroneous rule, so it no longer covers the missed case, and • Add the newly created rule. } Figure 4: The Refine algorithm 17
  19. 19. Combined weight (missed case) Fp←Weight of rules supporting the hypothesis h, and satisfying missed case. Fn←Weight of rules that do not support the hypothesis h, and satisfy the missed case. Maxwp←Maximum weight carried by a rule, among those supporting the hypothesis h. Maxwn←Maximum weight carried by a rule, among those satisfying the missed case but do not conclude hypothesis h. Wi←Weight of rule(i) { Calculated using the approach discussed in Section 3.2 } • Calculate combined weight of rules satisfying the missed case and concluding h k Fp ← 1− ∏ (1 − w )+ Max i wp i =1 • Calculate combined weight of rules satisfying the missed case, and do not conclude h ∏ (1 − w )+ Max R Fn ← 1 − j wn j =1 • Compute the combined weight for the case, to indicate whether the case is misdiagnosed or not. If the combined weight of rules in favour of the hypothesis under consideration, is greater than the combined weight of the other rules, then the case will be diagnosed correctly; otherwise it will be misdiagnosed. Case weight = Fp − Fn • Return case weight. Figure 5: The combined weight algorithm Example: Table 4, describes a small classification system, to demonstrate our approach. It contains two final hypotheses (+) and (−). There are 16 training cases C1,C2,…,C16. The cases are marked as (+) if they belong to the hypothesis (+), or (−) otherwise. The rules are shown with their weights. The marks indicate the cases to which the rules R1 to R4 are applicable (i.e. the conditions of the rule are satisfied for the case). The rules are assigned (+) sign if they conclude hypothesis (+) or (−) otherwise 18
  20. 20. Case Combined R3 – R2+ R5 – R6 + R1 + R4 + case weight 0.40386 0.28568 0.225048 0.000778 -0.0003322 -0.256594 1+ 0.6864 * 2+ 0.6864 * 3+ 0.6864 * 4+ 0.1709 * 5+ 0.1709 * 6+ 7− 8+
  21. 21. * * * * -0.5148 * -0.5148 * -0.5148 *
  22. 22. * * * * * * 9+ 0.5155 10− -1.3064 * 11− -1.3064 * 12− -1.3064 * 13− -1.3048 * 14+ 0.2582 * 15+ 0.2582 * 16− 0.2582 *
  23. 23. * * * *
  24. 24. * *
  25. 25. * * *
  26. 26. Table 4: Refinement locations * means the case is covered by the corresponding rule means case is incorrectly covered
  27. 27. means case is correctly covered The assignment of the weights to the rules is done using the approach discussed in Section 3.2. The application of rule ordering with respect to the rule weight (R3,R2,R5,R6,R1,R4) lets us directly conclude that case 6 and case 8 will be misdiagnosed by the inference engine as they have a negative combined weight, while they belong to hypothesis (+), this in turn will lead to specialising rule 5, for its overly general. Notice that if the rules were in a different order, for example, if rule 5 preceded rule 2, other cases (i.e. case 4 and case 5) will be misclassified, but the ordering scheme adopted avoided this problem. 3.4 Minimum covering algorithm Many algorithms for concept learning search hypothesis space by considering a general-to-specific ordering of hypotheses. However, a minimum covering algorithm organises the search for a hypothesis beginning with the most specific possible hypothesis, then generalises this hypothesis at each step to cover an observed positive example. To be more specific, the minimum covering algorithm specifies a set of conditions that describe some cases centred around a particular instance which is normally a 19
  28. 28. misdiagnosed case; thus providing an approach that is constrained to focus on a subset of a particular concept by only a subset of the relevant instance features observed. It represents an inductive generalisation of consistent production rules, by adopting a least generalisation strategy. First, we select the most central case to be covered by the initial rule, this is normally the misdiagnosed case, and form the most specific rule that covers that case. Depending upon the condition values used in the case, an over-specific (OSpec) rule is developed, such that the OSpec rule covers only the central case. All example cases that belong to that class of the central case are considered to be positive examples and all cases that do not belong to the class are considered to be negative examples. Next, we generalise against the positive cases for each condition. The newly generated rule is a generalisation of OSpec. Then a new rule (Min_Cover_Rule) is created from this rule. The technique used to do this generalisation is the following: 1. Min_Cover_Rule ←Ø {Make an empty rule} 2. Create OSpec rule . { This is an overly specific rule which covers only the misdiagnosed case} 3. Generalise OSpec rule S Ordinal range expansion All the positive cases are ordered for the particular condition to facilitate expanding ordinal ranges which will result in a rule covering more cases. Each time the next value for the condition in the range is examined. For the less than () operator the next larger value on the range is tried. While the next smaller value is used to examine the greater than () operator. This results in the range of values for the condition being expanded in both directions. S Condition deletion For each discrete condition in OSpec rule, if deleting the condition improves quality of the rule then execute the deletion operation. 4. Create Min_Cover_Rule This involves a sequential scan through the conditions of the OSpec rule, after ordering them with respect to their weights. As each condition is examined it is added to Min_Cover_Rule. If the resulting rule is inferior to the original then the condition is removed from Min_Cover_Rule. The following rule evaluation function is employed to test the rule efficiency: Efficiency ( Min _ Cover _ Rule, α ) = (1 − α + consistency ) completeness + (α + consistency ) consistency Completeness = # correctly covered cases # cases of the same class # correctly covered cases # covered cases Where α is a weighting parameter. Consistency = In our experiments, we set the value of the variable α to 0.8. The use of this function results in the development of rules that are complete and consistent with regard to the data set, whenever this is possible. A formal description of the above algorithm is provided in Figure 7. This process is illustrated 20
  29. 29. in Figure 6. In this figure, example cases are presented as points in a two dimensional space representing a case’s age and weight. Positive cases are represented by (+) sign. Negative cases are represented by (–) sign. Each rule can be viewed as defining a rectangle that includes those points that the rule assigns to the positive class. The thick line indicates the rule to be refined. IF Age=8 Age1 Weight=8.2 Weight3 THEN class= Negative This rule covers eight negative and three positive cases. Assuming that the misdiagnosed case determined by the refinement subsystem is case A, then the OSpec rule that covers case A and only A is: IF Age=6 Age 5.59 Weight=6.2 Weight 6.19 THEN class= Positive we then generalise against the case B and C. This results in the final rule: IF Age =6.5 Age3 Weight=7.5 Weight5 THEN class= Positive weight 9 + _ 8 7 + _ _ 5 A+ _ C+ + _ 4 2 _ B+ _ 6 3 _ _ + + _ _ + + 1 1 2 3 4 5 6 7 8 9 10 11 12 13 age Figure 6: Example of minimum covering algorithm The minimum covering algorithm is incorporated to help determine a missing rule for a particular misdiagnosed case; and helps determine which condition to add to a rule for specialising it; as well as to decide the UNLESS clause for imbedded hypotheses. The algorithm performs a specific-to-general 21
  30. 30. search, through a space of hypotheses, beginning with the most specific rule, that initially covers only one case (namely the missed case) and terminating when the rule is sufficiently general. The algorithm employs a performance measure for rule quality to guide selecting the most promising conditions. At each stage, the rule is evaluated by estimating the utility of removing a condition or changing a numerical range, by considering the number of positive and negative cases covered. One might ask why the learning system did not find the newly learned rule when the knowledge base was initially constructed. In the first place, the specified case may be an exceptional one and the rules that are consistent with it can cover only a small number of other positive instances; thus those rules dealing with this exception may not be found during the initial knowledge-base construction. Secondly, the initial training set may not be a representative sample of cases in the category of interest. Thirdly, since ARIS can be used to refine any knowledge base these rules might not have been discovered by the original technique. Minimum covering algorithm (missed case) {Returns a rule that covers some examples belonging to case's hypothesis, including the missed case itself. Conducts a specific-to-general search for the best rule, guided by a utility metric.} Min_Cover_Rule ← Ø Rule ←Ø 1. Initialise Rule to the most specific hypothesis covering the given missed case 2. While Rule is overly specific Do • Generate the next more general Rule: • For each continuous condition {-Expand the range of the condition to cover the closest uncovered case - If PERFORMANCE increases, then accept the modifications Else Refrain from implementing the last changes } • For each condition If removing the condition improves PERFORMANCE, execute the deletion operation 3. Sort conditions of Rule with respect to their weight 4. {From Rule, add to Min_Cover_Rule, one condition at a time} I ←1 Do { Min_Cove_Rule ← Min_Cover_Rule + condition I If (Efficiency (Min_Cover_Rule) does not improve) Remove condition I I++ }Until (no more conditions are available) 5. Return Min_Cover_Rule of the form IF conditions THEN hypothesis Figure 7: The minimum covering algorithm 22
  31. 31. 4. Experimental evaluation 4.1 Effects of new information The research presented in this paper attempts to find better way of utilising information in domains where available data is large and ever growing, particularly in automated data collection environments. The ARIS algorithms were applied to problems from several different domains [20]. The aim was to deliver an improvement on existing inductive learning systems. All the executed experiments employ data sets which are randomly divided into a training data set, a refinement data set, and a test data set. The training and testing data sets each comprises 40% of the available data, while the remaining 20% is assigned to the refinement data set. Table 5 presents the results of the performance of ARIS before and after refinement, where the rules are generated using consistency and completeness criteria. Table 6 demonstrates the performance of ARIS on knowledge bases created by C4.5 system. Table 7 shows the effect of refinement on knowledge bases created by RIPPER system. The experiments show that the main operators used in knowledge-base refinement by the ARIS system are rule creation and rule specialisation. The biggest influence is the number of error-causing rules because each of them must be prohibited from firing for the misdiagnosed cases, to allow for the correct rule to be fired, or a new rule can be created. They can be specialised either by: strengthening the left hand side of the error causing rule which may be achieved by adding another condition or shrinking a numerical range of a condition, or appending a negation of a conjunction in a form of UNLESS clause. The experiments also showed that reduction in number of rules produced is vital. In practice, a knowledge base is created so that the rules have the maximum generality and least specificity possible, in other words that the rules must be general enough to accommodate future situations and commit the fewest possible misclassifications. The experiments indicate that the effect of generalisation operation is rare relative to the use of other refinement operations (i.e. approximately 3% of the total number of refinement operations) while specialisation was 13%. The most common operations were rule creation (43%), UNLESS clause (21%) and rule deletion (20%). We would expect the refinement system to follow this trend, because the initial knowledge base is susceptible to incompleteness and inconsistency. 23
  32. 32. Domain Audiology Flag Mushroom Heart Wine Hepatitis Iris Hypothyroid Adult Artificial data Before refinement #Rules Accuracy #Rules Accuracy After refinement # of #of Gens Specs # of UNLESS # of new rules # of rules deleted 22.9 31.2 22.7 26.8 3.1 3 2.6 3.4 160.8 19.4 16.1 15.9 9.8 26.2 4.1 4.5 3 4.5 57.1 20.3 58.20 58.20 99.85 49.58 88.12 79.52 95.67 98.20 71.44 98.11 3.9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4.3 1.5 0.9 2.8 1.3 1.7 0.7 1.6 7.5 1 11.1 16.8 13.8 3.4 0.3 0.2 0.3 0.5 11.01 0.1 43.04 56.39 99.69 48.93 87.36 78.55 93.17 97.89 71.43 98.09 0 0 0 0 0 0 0 0 0 0 Table 5: ARIS performance on knowledge bases created by ARIS using completeness and consistency criteria on ten selected domains and averaged on ten trials per domain Domain Audiology Flag Mushroom Heart Wine Hepatitis Iris Hypothyroid Adult Artificial data Before refinement #Rules Accuracy #Rules Accuracy 23.2 33 34.3 38.9 5.7 5.9 4.6 7.6 228 47.9 12.6 14.2 13.6 12.3 4.3 3.7 2.5 5.1 36.4 33.2 49.6 60.24 98.50 50.66 89.72 81.61 93.5 97.67 78.61 93.6 44.02 55.3 98.47 49.26 89.72 80.32 93.5 96.99 77.62 92.0 After refinement #of Gens # of Specs 0.1 0.4 0.3 0.2 0.1 0.1 0.1 0.4 # of UNLESS # of new rules # of rules deleted 0.2 0.1 0.2 0.1 0.2 0.8 0.1 0.1 4 3 0.2 0.6 14.6 21.8 20.5 27.2 1.5 2.7 2.4 4.4 192 17 0.5 0.2 0.3 1.9 0.4 2.3 Table 6: ARIS performance on knowledge bases created by C4.5 system and averaged on ten trials per domain Domain Audiology Flag Mushroom Heart Wine Hepatitis Iris Hypothyroid Adult Artificial data Before refinement #Rules Accuracy #Rules Accuracy 12.8 8.6 7.5 2.9 2.8 1.3 2.6 2.5 4 15.6 12.8 13.7 7.5 9.5 3.8 3.4 3.1 2.6 7.5 17.9 68.04 55.66 99.86 53.03 87.64 77.58 90.99 98.4 81.27 96.35 66.56 52.89 99.84 52.2 86.25 77.26 90.67 98.4 81.27 96.34 After refinement # of #of Gens Specs 0.1 0.5 0.1 0.5 # of UNLESS # of new rules # of rules deleted 0.3 0.5 2.9 6.1 0.4 7.1 1.6 2.2 0.8 0.3 3.8 2.7 2.8 1 0.4 0.3 0.5 0.1 0.3 0.2 0.7 0.3 0.2 Table 7: ARIS performance on knowledge bases created by RIPPER system and averaged on ten trials per domain 24 0.4
  33. 33. 4.2 Fair comparison between ARIS and other systems We have shown that the refinement approach improved concept-description accuracy of both C4.5 and RIPPER systems, and reduced knowledge-base complexity (in terms of number of rules) whenever possible for several data bases, using a separate set of data for refinement. In this section we compare the approach of learning plus refinement to simply learning rules from all the data. Our approach in the comparison involves the following strategy: S Induce a knowledge base by training the learning system with 40% of the available data and then refine it using 20% of the available data, S Induce a knowledge base by training the learning system using 60% of the available data, S Compare the performance of the generated knowledge bases by testing both approaches on the remaining 40% of the data. Table 8 is another comparison of three systems, namely ARIS, C4.5, and RIPPER on the selected test domains. The refinement results were compared with knowledge bases induced by incorporating both the training data and refinement data sets as a unified training set. This gives a fairer comparison between ARIS, C4.5, and RIPPER. The first column names the domain used. The second column shows the performance of the ARIS system on the test data used when trained on 40% of the available data. The third column gives the performance of the ARIS system on the same test data when trained on the combined training data (i.e. training and refinement data sets together). The fourth column gives the performance of the knowledge base on the test data after refinement. Column five gives the performance of the C4.5 system on the test data when trained on 40% of the available data. Column six shows the performance of C4.5 on the test data when trained on the combined data. Column seven indicates the performance of the refined knowledge base induced by C4.5 system on the same test data. Column eight shows the performance of RIPPER system on the same test data when trained on 40% of the available data. Column nine gives the performance of RIPPER on the test data when trained on the combined data. Column ten gives the performance of the RIPPER’s refined knowledge-base on the same test data set. The tick marks indicate cases where training followed by refinement yielded better results than simply training on all the data. The experiments display the striking difference in rule induction between C4.5 and RIPPER. Specifically, the C4.5 system generates many rules, some of which have the potential of causing rule contradiction. The ARIS refinement process demonstrated that deleting such superfluous rules often increases knowledge-base accuracy. On the other hand, RIPPER’s approach for inducing rules creates fewer rules. Therefore, during refinement, ARIS performs more rule creations, and few rule deletions on the RIPPER knowledge bases. 25
  34. 34. Finally, let us analyse the causes of an increase in the accuracy in inductive learning systems. In general, it can be expected that concept-description accuracy will increase as the size of the database increases. This general feeling is illustrated by the accuracy for artificial data set (last row in Table 8), where training a learning system on the large non-noisy data produced better results than training on small portion and then refining the induced knowledge base. This will then begin to deteriorate as noise increases such as in hepatitis and hypothyroid data. Comparing the use of ARIS on the combined data set (60% of the data) just to generate the rules without refinement, with training ARIS on 40% of the data and refining on 20%, we have observed that on 7 out of 10 of the domains, the results of refinement approach is better, and was inferior on 3 domains (wine, artificial data, and flag). Refining C4.5 rules produced with 40% of the available data was better than training C4.5 system on the combined data set in 7 out of 10 domains. This indicates that a refinement approach is useful when knowledge bases are created without testing their predictive abilities as in the case with C4.5 system. In RIPPER, the rules are generated by one training data set and post-pruned using another separate pruning-data-set. This is clearly a limited refinement scheme which is the advantage of RIPPER over other learning systems such as C4.5, HYDRA, and AQ15. This is the reason behind the good quality results of RIPPER when trained on the combined data set as can be seen from Table 8. However, it is still error prone due to noise, for example, on the Hypothyroid data set, the RIPPER system produced results (when trained on combined data) which are inferior to the results obtained by training the system on small data (40%) and then using ARIS to refine the rules. The hepatitis data is the smallest noisy data set (135 instances) we have. The attributes available (19 attributes) enabled the refinement system to draw the class boundaries better than when the learning system was trained on the combined data. The Heart data set has similar characteristics to the hepatitis data, and similar results are obtained. On the mushroom data set, the same accuracy was reached using both approaches. The remaining six domains expressed better results for RIPPER when trained on the combined data set. In summary, refinement mechanism improved concept-description quality for all the algorithms in the three medical domains (i.e. Hepatitis, Hypothyroid, and Heart) characterised by noise and small disjuncts problems. Moreover, improvement on several other domains was obtained with C4.5 and ARIS. It is therefore, recommended that learning systems use a refinement mechanism on a data set which is separate from the training set used to induce the knowledge base such as that of RIPPER and ARIS to obtain good quality concept descriptions. 26
  35. 35. Domain Iris ARIS using completeness and consistency Trained on Trained on Refined KB 40% combined data 93.17 93.67 95.67 Trained on 40% 93.5 C4.5 system Trained on combined data 94.0 93.5 Trained on 40% 90.67 RIPPER system Trained on Refined KB combined data 93.33 90.99 Wine 87.36 89.72 88.12 89.72 90.0 89.72 86.25 90.14 87.64 Hepatitis 78.55 78.39 79.52 80.32 77.96 81.61 77.26 76.94 77.58 Hypothyroid 97.89 98.14 98.2 96.99 97.74 97.67 98.4 98.20 98.4 Heart 48.93 47.05 49.59 49.26 50.19 50.66 52.2 52.21 53.03 Flag 56.39 59.64 58.2 55.3 55.54 60.24 52.89 57.23 55.66 Audiology 43.04 40.20 49.02 44.02 43.12 49.12 66.56 70.69 68.04 Mushroom 99.69 99.77 99.85 98.47 98.37 98.5 99.84 99.87 99.86 Adult 71.43 70.97 71.44 77.62 78.07 78.61 81.27 82.22 81.27 Artificial data 98.09 98.86 98.11 92.0 98.16 96.6 96.34 97.53 96.35 27 Refined KB
  36. 36. 4.3 Statistical comparisons To test the results statistically, we used a two-way analysis of variance to compare the means with a significance level of 5%. The raw data from the ten cross-validation experiments was used. The data set was one feature and the type of analysis the other. The six analysis types used corresponded to the results shown in Table 8, comparing using 60% of the data for training to using 40% for training and 20% for refinement. The following results show the ANOVA table and the Duncan groupings for the type of analysis. Source DF Type I SS Mean Square F Value PrF Analysis type 5 1567.54 312.3083 6.77 0.0001 DAT 9 192014.52 21334.9474 462.53 0.0001 Table 9: ANOVA table for experimental results Duncan Grouping Mean N EFF A 82.42 100 RIPPER trained on 60% A 81.47 100 RIPPER trained on 40% and using 20% for refinement B 79.57 100 ARIS trained on 40% and Using 20% for refinement B 79.42 100 C4.5 trained on 40% and using 20% for refinement B 78.37 100 C4.5 trained on 60% B 77.86 100 ARIS trained on 60% Table 10 Duncan grouping for overall mean values using 3 inductive learning systems with refinement Refinement on the results from the C4.5 system resulted in better performance than training on 60% of the available data, by more than 1%. Training ARIS (just to create a knowledge base) using 40% then refine it resulted in better performance than training ARIS on 60% by more than 1.7%. However, training RIPPER on the combined data (i.e. 60%) is better than training on 40% and then refining using 20%, by less than 1%. It is important to note again that RIPPER includes a refinement scheme during the creation of a knowledge base. Therefore, exposing the system to more data for refinement would often produce better results than refining on only 20%. Moreover, the lack of some refinement operators within RIPPER lead to inferior results than using our refinement approach in noisy domains with small disjuncts problem as was discussed previously. 28
  37. 37. In a second analysis we confine our comparison to C4.5 and ARIS systems and exclude RIPPER as it includes an internal refinement approach. The following tables show the output with 5% significance level. Source DF Type I SS Mean Square F Value PrF Analysis type 3 203.9567 67.9856 1.6 0.188 DAT 9 150542.5780 16726.9531 394.24 0.000 Table 11: ANOVA table for experimental results for two inductive systems Duncan Grouping N EFF 79.57 A Mean 100 ARIS trained on 40% and used 20% for refinement BA 79.42 100 C4.5 trained on 40% and used 20% for refinement BA 78.37 100 C4.5 trained on 60% B 77.86 100 ARIS trained on 60% Table 12: ANOVA comparison of results with and without refinement Table 12 shows that training ARIS system on 40% and then refining using 20% produces better results than training using the combined data (i.e. 60% of the available data) and the difference is statistically significant. The table also shows that training C4.5 system on 40% and refining using 20% is better than training on 60%. 5. Differences between inductive refinement and incremental learning The top level goal of both subareas of incremental learning and inductive refinement is to produce learning systems that are able to modify a knowledge base as a result of experience. Both subareas interleave learning and performance (making learning systems themselves adaptable). However, each subarea has its own characteristic strengths and weaknesses. We note some of them below: S Source of knowledge Incremental learning systems modify their own inferred knowledge base. However, inductive refinement has the power to revise knowledge bases inferred by other batch learning systems and hence can be valuable for improving inductive learning. S Nature of refinement operators Inductive refinement systems generally have the ability to refine a knowledge base by generalising/specialising their rules, creating new rules, or removing superfluous rules. In contrast, incremental learning specialises or generalises only its current hypothesis. 29
  38. 38. S Input schema used Whereas incremental learning systems accept their own input one object at a time; inductive refinement systems receive the output of a batch learning system and a set of refinement data before commencing the refinement process. S Evaluation strategy Refinement systems require access to an additional view of the environment (i.e. refinement data set) to perform refinement to the knowledge base leading to better performance with regard to all the refinement set while correcting the misdiagnosed cases. In contrast incremental learning processes its input one object at a time to guarantee covering the given object correctly. 6. Conclusions 6.1 Relationship between ARIS and other systems The main limitation of inductive learning systems is that the results from the testing phase have no role in improving the learning process, but are merely used for validation purposes. After the learning process is completed, the generated concept descriptions are used over and over for prediction. If new information is observed the learning system cannot utilise it. In such a situation the previously learned knowledge base will be discarded and all the data (old and new) has to be supplied to the learning system to create the knowledge base anew. In ARIS the results obtained are analysed and fed back automatically to the learning mechanism. This causes the learning system to refine the knowledge base, add missing rules, and delete superfluous rules, producing a more comprehensive and accurate knowledge base. While retaining the fundamental ideas behind inductive learning ARIS is more powerful and capable of performing many additional actions to improve its knowledge base. The SEEK system [22,23] is able to help an expert focus on a problem within a knowledge base, but it is then up to the expert to suggest whether to accept it or not. One feature of SEEK is the use of exclusions to rule out a decision in some situations (which is a common phenomenon in the medical domain). But because the rules were provided in advance by experts in the field (Rheumatic diseases), and are expected to be complete which is a strong restriction, SEEK was unable to produce new exclusions to its rules neither was it able to add new rules. Specialisation was limited to adding conditions and numerical ranges cannot be altered. ARIS takes this further, suggesting new possible exclusions, as well as adding or altering conditions (both discrete and ordinal) allowing slight modifications to the existing rules, in contrast to radical changes of deleting and adding conditions. ARIS uses a minimum covering approach to reach the exclusion in a form of a additional clause (i.e. UNLESS) appended to the original rule which states that a combination of facts rules out a particular decision in certain situations. The 30
  39. 39. minimum covering approach facilitates the process of reaching what condition to add to a rule for specialisation. The other feature of SEEK is the distinguishing between conditions by defining a condition as minor or major. This is a realisation of the different effect of conditions to conclusions of rules, and it is a form of weighing the conditions. ARIS weighs conditions in rules using a probabilistic estimate with respect to data distribution in training set and the weight is continuous in the range between -1 and +1. SEEK2 [10] an extension to SEEK, learns new rules by processing some examples. By using a set of examples for which the hypothesis failed, SEEK2 finds a set of common features which are then used to form a new rule. A complete refined knowledge base is presented to the expert (SEEK2 also works on a knowledge base predefined by an expert). ARIS finds a set of features by using a minimum covering algorithm. It starts by using a misdiagnosed example and gradually allows more facts to be incorporated. This gives more flexibility and allows ARIS to succeed in many situations requiring rule modifications. EITHER [21] uses multiple approaches to refine an existing knowledge base, guided by a pre-classified test examples. Missing rules can be generated, however, conditions can only be added or removed. Ordinal tests cannot be modified, and numerical ranges cannot be expanded or reduced, for example IF age=14 THEN teen-ager cannot be modified to IF age=18 THEN teen-ager KRUST [7] shows that generating many refinements to a knowledge base, with the help of a set of preclassified examples selected with precision (called chestnuts) and without considering statistical measures, is another alternative for refinement. Different versions of refinements are generated, each leading to a whole new knowledge base, then one or more refined knowledge bases which are believed to be good are suggested to the collaborating expert. However ARIS is designed with a different idea in mind. It tries to learn general rules using a heuristic search on a library of cases, and gradually extends or modifies its knowledge as required. The lesson learned from KRUST is that applying all possible refinement operators and selecting the best performing operator or a group of operators to remedy the misdiagnoses is better than accepting the first successful refinement. Table 13 presents and compares the following important characteristics for refinement systems: S Knowledge-base source Previous knowledge-base refinement systems rely on domain experts to construct rule sets that are complete with respect to the domain of interest. However, our approach depends on heuristics to generate/refine knowledge. S Knowledge-base representation 31
  40. 40. Several different representations have been used to facilitate knowledge transfer from domain experts to computers. Some representations are domain-specific, for example criteria table in medical domain (used by SEEK) and cannot be easily used in other domains. A propositional production rule language is used in ARIS because it is easy to comprehend and expresses knowledge efficiently. S Conflict resolution strategy The conflict resolution scheme provides the framework for pin-pointing the conclusion among various possible conclusions that can be reached. Some systems accept the conclusion of the first rule satisfied such as KRUST and EITHER. Other systems accept a conclusion that is reached with a highest certainty among several reached conclusions such as in SEEK and SEEK2. Our approach orders rules according to their weight which is a probabilistic estimate of rule quality. S Pre-assumptions All other knowledge-base refinement systems assume that the knowledge base is complete and only minor tweaking is required. This view assumes the availability of experts and that the domain is fairly well understood. Inductive concept-learning deals with more general environments that are characterised by: - lack of understanding of the particular domain, - availability of data describing instances of the environment behaviour, - regular change of environment such as is the case with medical images and credit-card transactions These characteristics often make it impossible to guarantee completeness of the initial knowledge base. S Missing refinement operators Early refinement systems such as SEEK and SEEK2 were developed for specific problem-domain, and some refinement operators were missing. For example, as a result of the completeness assumption, creating new rules was not available in SEEK. We implemented all the refinement operators applicable for propositional production rules. S User interaction Interaction of domain-expert/knowledge-engineer is a common characteristic of previous knowledge-base refinement systems. Our approach is non-interactive. S Knowledge-base evaluation approach Reaching a good quality refined knowledge base can be viewed as a search through a space of possible knowledge bases. The EITHER system accepts the first successful refinement; while SEEK relies on local statistics that favours a refinement which is deemed to correct more misdiagnoses. KRUST on the other hand generates many refinements then select best one. ARIS evaluates the quality of a knowledge base after application of every possible refinement operation for the misclassifications, and chooses best refinement resulting in the highest knowledge-base quality which is measured by its completeness and consistency conditions. 32
  41. 41. In summary, ARIS has improved on inductive learning systems, by using performance analyses as a feed-back to the learning process. The results for several knowledge bases are encouraging. However there are extensions that would make it a more powerful learning tool as discussed in Section 6.3. 33
  42. 42. Characteristic Knowledge base source Knowledge base representation ARIS Inductively generated by analysis of pre-classified data set Propositional production rules Conflict resolution IF conditions THEN conclusion UNLESS conditions First rule satisfied is fired SEEK Expert originated KRUST Expert originated EITHER Expert originated Criteria tables Backward chaining production rules, using semantic nets Horn clause rules A rule with the highest certainty is fired It is assumed that the knowledge base is complete, and only minor tweaking is required S No rule creation S No KB redundancy check S Refinement of continuous conditions is missing First rule satisfied is fired First rule satisfied is fired It is assumed that the knowledge base is complete, and only minor tweaking is required S No KB redundancy check S New rule creation is not efficient S Does not handle exceptional cases Yes It is assumed that the knowledge base is complete, and only minor tweaking is required S No KB redundancy check S Refinement of continuous conditions is missing S Does not handle exceptional cases yes Generate many refinements, and then select one. The KB's are tested on a special data set called chestnuts, that are selected precisely for the given domain, so that a good KB does not misclassify it. First successful refinement is accepted. Pre-assumptions Training data is an adequate representation of the domain Missing refinement operators All refinement operators are implemented Expert interaction No interaction with the system Yes Knowledge base evaluation technique Completeness and consistency factors are used to evaluate the system Local statistical measure, and first successful refinement is accepted 34
  43. 43. 6.2 Conclusion An inductive refinement model has been developed that is capable of creating a knowledge base from a library of pre-classified cases, and continuously updating it to accommodate new facts. This model is of particular importance in regularly changing and noisy domains such as credit-card transactions and medical images. We have developed a method of learning rule weights based on an estimate of the relationship of rule conditions to a conclusion. The rules are ranked according to their weight to determine the misdiagnosed cases more easily. We have further developed a method for learning rules centred around a misdiagnosed case, that is what we call the “minimum covering algorithm” which help determining missing conditions in erroneous rules for specialisation, as well as adding new rules to the knowledge base. We have also developed an algorithm for inducing rules from trees utilising completeness and consistency criteria for rules. Specific areas of contribution include: S Scaling learning algorithms to large and ever growing databases Databases are large and growing larger, so learning systems must be able to handle enormous amounts of data. Inductive refinement introduces a method for inductive systems to cope with future changes in the data without forgetting past experience. This is useful since it is easier to only investigate the recently observed data. For example, it is easier to just process the last thousand credit-card transactions than a full database of millions. S Focus of refinement process Learning weights for the rules and conditions makes it possible to focus the refinement process and resolve conflicts more easily. S Implementation that demonstrates goals The ideas put forth in this paper have been implemented, and the implementation has demonstrated the feasibility and effectiveness of the approach including: (a) More accurate than C4.5 ARIS’s refinement approach proved to be more accurate than training C4.5 on the same data all at once (combined training) on many noisy data sets (see Table 8). C4.5 is more accurate on one large noise-free data set (i.e. artificial data) and equal on two other small noise-free data sets (i.e. iris and wine data sets). (b) Occasionally equal or more accurate than RIPPER Table 8 shows that ARIS’s refinement approach is more accurate than RIPPER on three test domains (i.e. heart, hypothyroid, and hepatitis) that are characterised by noise and small disjuncts problems (i.e. rules covering few cases). When RIPPER creates new rules it ignores any rule that has accuracy less than 50% on the pruning data set (part of the 35
  44. 44. training data provided). In presence of small disjuncts problems combined with noise RIPPER misses some rules that prove to be useful in future. The refinement mechanism corrects such problems when observed. On one data set (i.e. mushroom data set) where the majority class forms 51.81% of the domain the two approaches are equal. However, RIPPER is better when trained on all the data combined on the rest of test domains. It is important to note that RIPPER includes an internal limited form of refinement scheme. Therefore, it can be expected that exposing a learning system to a larger view of the environment will often perform better especially in noise-free domains such as the artificial data set used in the experiments. Therefore, improving on RIPPER on some domains is good. This is a benefit of the increase of search effort and the more refinement operators available that can modify the class boundaries gradually resulting in improvements of accuracy of the knowledge-based system. The goal of this research has been to provide an effective approach to inductive learning, based on a batch-learning strategy making learning systems responsive to domain changes. The experiments with the ARIS system, and the lessons learned during refining knowledge bases induced by other learning systems such as C4.5 and RIPPER serve as a good demonstration of the significance behind refinement in inductive learning to improve system’s performance. Moreover, RIPPER can be improved if it: (1) performs more refinement operators such as expanding and shrinking numeric ranges rather than adding/deleting conditions only, (2) Allows the use of negation of conditions in the developed rules, as used in ARIS. 36
  45. 45. 6.3 Limitations and further work This research has pointed out several potential areas for further research and investigation: S Learning intermediate concepts In order to get better benefit from concept induction, we need to relax the mutual exclusivity of the classes. This would permit a class to be named as a condition in a rule, resulting in more compact concept descriptions. S Parallel implementation More work needs to be done in order to speed up the refinement process. Since refinement operators (specialisation/generalisation) are performed independently, this makes the refinement process suitable for parallel implementation. The problem would be to design an optimal framework to distribute and control the workload among a fixed set of processors (equivalent to the number of refinement operations). This will lead to an increase of the speed of the refinement system. S Constructing new features In many learning problems, there are hundreds or thousands of features describing each input case. Current concept-learning algorithms such as C4.5 [25] do not scale well when there are many features. Moreover, it is easy for learning algorithms to be confused by noisy features and construct poor classifiers. Constructing new features from the given data enables learning systems to change their representation to redescribe the given domain. New features can be constructed by assigning names to groups of best-performing features for each class and/or combining feature values into larger units. Constructing new features during the refinement cycle will produce more compact concept descriptions, and might lead to increase in performance similar to that reported by some constructive induction research in the DNA domain [17]. Acknowledgements I gratefully acknowledge the intellectual support of my supervisor Professor David Lubinsky, during my research work, and for his helpful comments on this paper. REFERENCES 1. Ali K. (1996). Learning probabilistic relational concept description. Ph.D. thesis, University of California. 37
  46. 46. 2. Ali K., Pazzani, M. (1993). HYDRA: A noise- tolerant relational concept learning algorithm. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, Chamberg, France. 3. Breiman L., Friedman J., Olshen R., Stone C. (1984). Classification and regression trees. Wadsworth, pacific Grove, CA. 4. Buchanan B., Shortliffe E. (1984). Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley Publishing Company. 5. Cloet I., Theron H. (1991). CID3: an extension of ID3 for attributes with ordered domains. South African Computer Journal, 10-16. 6. Cohen W. (1995). Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Leaning. San Francisco, CA. 7. Craw S. (1991). Automating the refinement of knowledge-based systems. Ph.D. thesis, University of Aberdeen. 8. Fayad U., et al. (1988). Learning control information in rule-based systems: a weak method. In: Proceedings of the Fourth Conference on AI applications, 14-19. 9. Friedman J., Kohavi R., Yun Y. (1996). Lazy decision trees. In: Proceedings of the National Conference on Artificial Intelligence.Portland. 10. Ginsberg A. (1986). Refinement of expert systems: A metalinguistic framework for heuristic analysis. Ph.D. thesis, Rutgers University. 11. Holland J. (1986). Escaping brittleness: The possibilities of general purpose learning algorithms applied to parallel rule-based systems. In: Michalski et al. (editors), Machine Learning II. Morgan Kaufmann publishers Los Altos, California. 12. Holsheimer M., Siebes A. (1991). Data Mining: The search for knowledge in databases. Technical Report CS-R9406, Amsterdam, The Netherlands. 13. Kanal L., Lemmer j. (1986). Uncertainty in Artificial Intelligence. North-Holland, Amsterdam. 14. Kohavi R., et al. (1997). The utility of feature weighting in nearest neighbour algorithms. In: Van Someren M. and Widermer G., (editors) , Poster paper: Ninth European Conference on Machine Learning. Czech Republic. 15. Kohavi R., Kunz C. (1997). Option decision trees with majority votes. In: Proceedings of the International Conference on Machine Learning. Tahoe City, CA. 16. Laird J., (1988). Recovery from incorrect knowledge in Soar. In: Proceedings of the National Conference on Artificial Intelligence, Saint Paul, Minnesota, 618-623. 17. Mark W., Jude W. (1993). Learning to represent codons: A challenge problem for constructive induction. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, Chamberg, France, 1319-1324. 18. Michell T. (1997). Machine Learning, McGraw Hill. 19. Mohri T., Tanaka H. (1994). An optimal weighting criterion of case indexing for both numeric and symbolic attributes. In: D.W. Aha (Editors), Case-based reasoning: papers from the 1994 workshop (TR WS-94-01). Menlo Park, CA. 20. Murphy p. UCI repository of machine learning databases, 38
  47. 47. [ http:/ html]. 21. Ourston D., Mooney R. (1994). Theory refinement combining analytical and empirical methods. Artificial Intelligence, 273-309. 22. Politakis P. (1983). Using empirical analysis to refine expert system knowledge bases. Ph.D. thesis, Rutgers University. 23. Politakis P., Weiss S. (1984). Using empirical analysis to refine expert system knowledge bases. Artificial Intelligence, 23-48. 24. Quinlan R. (1988). Induction of decision trees. Machine Leaning, 81-108. 25. Quinlan R. (1993). C4.5, Programs for machine learning. Morgan Kaufman. 26. Raymond J., Mooney R., Ourston D. (1994). A multi-strategy approach to theory refinement. In: Michalski et al. (editors) Machine Learning IV. Morgan Kaufmann publishers Los Altos, California. 27. Theron H. (1994). Specialisation by exclusion: An approach to concept learning. Ph.D. thesis, University of Stellenbosch, South Africa. 28. Wettschereck D., Aha D. (1995). Weighting features. In: Proceedings of the First International Conference on Case-Based Reasoning. Sesimbra, Portugal. 29. Wilkens D. Ma Y. (1994). The refinement of probabilistic rule sets: sociopathic interactions. Artificial Intelligence, 1-32. 39