Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RuleML2015: Rule Generalization Strategies in Incremental Learning of Disjunctive Concepts


Published on

Symbolic Machine Learning systems and applications, especially
when applied to real-world domains, must face the problem of
concepts that cannot be captured by a single definition, but require several
alternate definitions, each of which covers part of the full concept
extension. This problem is particularly relevant for incremental systems,
where progressive covering approaches are not applicable, and the learning
and refinement of the various definitions is interleaved during the
learning phase. In these systems, not only the learned model depends
on the order in which the examples are provided, but it also depends on
the choice of the specific definition to be refined. This paper proposes
different strategies for determining the order in which the alternate definitions
of a concept should be considered in a generalization step, and
evaluates their performance on a real-world domain dataset.

Published in: Science
  • Be the first to comment

  • Be the first to like this

RuleML2015: Rule Generalization Strategies in Incremental Learning of Disjunctive Concepts

  1. 1. Rule Generalization Strategies in Incremental Learning of Disjunctive Concepts Stefano Ferilli, Andrea Pazienza, Floriana Esposito Dipartimento di Informatica Centro Interdipartimentale per la Logica e sue Applicazioni Università di Bari 9th International Web Rule Symposium (RuleML) August 2-5, 2015 – Berlin, Germany
  2. 2. Overview ● Introduction & Motivation ● Generalization Algorithm ● Selection Strategies ● Evaluation ● Conclusions & Future Work
  3. 3. Introduction ● Symbolic knowledge representations ● Mandatory for applications that – Reproduce the human inferential behavior – May be required to explain their decisions in human- understandable terms ● 2 kinds of concept definitions – Conjunctive: a single definition accounts for all instances of the concept – Disjunctive: several alternate conjunctive definitions (components) ● Each covers part of the full concept extension ● Psychological studies have established that capturing and dealing with the latter is much harder for humans ● Pervasive and fundamental in most real-world domains
  4. 4. Introduction ● Knowledge acquisition bottleneck ● (Symbolic) Machine Learning (ML) systems – Supervised setting: concept definitions inferred from descriptions of valid (positive) or invalid (negative) instances (examples) ● Batch: whole set of examples available (classical setting in ML) – Components learned by progressive coverage strategies – Definitions are immutable ● If additional examples are provided, learning must start from scratch considering the extended set of examples ● Incremental: new examples may be provided after a tentative definition is already available – Components emerge as long as they are found – Definitions may be changed/refined if wrong
  5. 5. Motivations ● Incremental approach – If the available concept definition cannot properly account for a new example, it must be refined (revised, changed, modified) so that the new version properly accounts for both the old and the new examples – Progressive covering strategy not applicable ● Issue of disjunctive definitions becomes particularly relevant – When many components can be refined, there is no unique way for determining which one is most profitably refined – Refining different components results in different updated definitions, that become implicit constraints on how the definition itself may evolve when additional examples will become available in the future ● The learned model depends – on the order in which the examples are provided – on the choice of the component to be refined at each step
  6. 6. Motivations ● Abstract Diagnosis: Identification of the part of the theory that misclassifies the example – Tricky in the case of disjunctive concepts ● Several alternate conjunctive definitions are available – If a positive example is not covered, no component of the current definition accounts for it (omission error) ● All candidate to generalization – Generalizing all components, each component would account alone for all positive examples ● Contradiction: the concept is conjunctive ● Over-generalization: the theory would be more prone to covering forthcoming negative examples – Problem not present in batch learning
  7. 7. Motivations ● Problem – A single component of a disjunctive concept is to be generalized ● Guided solutions may improve effectiveness and efficiency of the overall outcome compared to a random strategy ● Objective – Propose and evaluate different strategies for determining the order in which the components should be considered for generalization ● If the generalization of a component fails, generalization of the next component in the ordering is attempted
  8. 8. Motivations ● Questions: – what sensible strategies can be defined? – what are their expected pros and cons? – what is their effect on the quality of the theory? – what about their consequences on the effectiveness and efficiency of the learning process? ● Starting point – InTheLEx (Incremental Theory Learner from Examples) ● fully incremental ● can learn disjunctive concept definitions ● refinement strategy can be tuned to suitably adapt its behavior
  9. 9. InTheLEx ● Learns hierarchical theories from positive and negative examples ● Fully incremental – May start from an empty theory and from the first available example ● Necessary in most real-world application domains ● DatalogOI – Concept definitions and examples expressed as rules ● Example: example :- observation ● Rule: concept :- conjunctive definition – Disjunctive concepts: several rules for the same concept
  10. 10. InTheLEx: Representation – Theory that defines a disjunctive concept (4 components) ● ball(A) :- weight_medium(A), air_filled(A), has_patches(A), horizontal_diameter(A,B), vertical_diameter(A,C), equal(B,C). %generic ● ball(A) :- weight_medium(A), air_filled(A), has_patches(A,B), horizontal_diameter(A,B), vertical_diameter(A,C), larger(B,C). %rugby ● ball(A) :- weight_heavy(A), has_holes(A), horizontal_diameter(A,B), vertical_diameter(A,C), equal(B,C). %bowling ● ball(A) :- weight_light(A), regular_shape(A), horizontal_diameter(A,B), vertical_diameter(A,C), equal(B,C). %small – Examples: ● ball(e) :- weight_medium(e), has_patches(e), air_filled(e), made_of_leather(e), horizontal_diameter(e,he), vertical_diameter(e,ve), equal(he,ve). %soccer Etrusco FIFA World Cup 1990 ● neg(ball(s1)) :- weight_light(s1), made_of_snow(s1), irregular_shape(s1), horizontal_diameter(s1,hs1), vertical_diameter(s1,vs1), smaller(hs1,vs1). %snowball ● neg(ball(s2)) :- weight_light(s2), made_of_paper(s2), horizontal_diameter(s2,hs2), vertical_diameter(s2,vs2), larger(hs2,vs2). %spitball
  11. 11. InTheLEx: Theory Revision ● Logic theory revision process – Given a new example ● No effect on the theory if – negative and not covered ● not predicted by the theory to belong to the concept – positive and covered ● predicted by the theory to belong to the concept ● In all the other cases, the theory needs to be revised – Positive example not covered  generalization of the theory – Negative example covered  specialization of the theory – Refinements (generalizations or specializations) must preserve correctness with respect to the entire set of currently available examples ● If no candidate refinement fulfills this requirement, the specific problematic example is stored as an exception
  12. 12. InTheLEx: Generalization – Procedure Generalize (E: positive example, T: theory, M: negative examples); ● L := list of the rules in the definition of E's concept while not generalized and L   do – Select from L a rule C for generalization – L' := generalize(C,E) (* list of generalizations *) – while not generalized and L'   do ● Select next best generalization C' from L' ● if (T {C}  {C'} is consistent wrt M then ● Implement C' in T – Remove C from L ● if not generalized then – C' := E with constants turned into variables – if (T {C}  {C'} is consistent wrt M then ● Implement C' in T – else ● Implement E in T as an exception
  13. 13. InTheLEx: Generalization – Comments ● Due to theoretical and implementation details, the generalization operator used might return several incomparable generalizations ● Implementation of the theoretical generalization operator would be computationally infeasible, even for relatively small rules – Similarity-based approximation ● Experiments have shown that it comes very close, and often catches, least general generalizations ● First example of a new concept – First rule rule added for that concept ● Initial tentative conjunctive definition of the concept ● Conjunctive definition turns out to be insufficient – Second rule added for that concept ● The concept becomes disjunctive ● Subsequent addition of rules for that concept – Extend the ‘disjunctiveness’ of the concept
  14. 14. Clause Selection Strategy ● 5 strategies for determining the order in which the components of a disjunctive concept definition are to be considered for generalization – Each component in the ordering considered only after generalization attempts have failed on all previous components ● Initial components have more chances to be generalized – No direct connection between age and length of a rule ● Older rules might have had more chances of refinement ● Whether this means that they are also shorter (i.e., more general) mainly depends on – Ranking strategy – Specific examples that are encountered and on their order ● Not controllable in a real-world setting
  15. 15. Clause Selection Strategy ● O: Older elements first – Same order as they were added to the theory ● The most straightforward. A sort of baseline ● Static: position of each component in the processing order fixed when component is added – Generalizations are monotonic (progressively remove constraints) ● Strategy expected to yield very refined (short, more human-readable and understandable) initial components and very raw (long) final ones – After several refinements, it is likely that initial components have reached a nearly final and quite stable form ● All attempts to further refine them are likely to fail ● The computational time spent in these attempts, albeit presumably not huge, will be wasted ● Runtime expected to grow as long as the life of the theory proceeds
  16. 16. Clause Selection Strategy ● N: Newer elements first – Reverse order as they were added to the theory ● Also quite straightforward ● Static, but not obvious to foresee what will be the shape and evolution of the components – Immediately after the addition of a new component, it will undergo a generalization attempt at the first non-covered positive example ● ‘average’ level of generality in the definition expected to be less than in the previous option ● There should be no completely raw components in the definition ● Many chances that such an attempt is successful for any example, but the resulting generalization might leverage features that might be not very related to the correct concept definition
  17. 17. Clause Selection Strategy ● L: Longer elements first – Decreasing number of conjuncts ● Specifically considers the level of refinement – Low variance in degree of refinement among components expected ● Can be considered as an evolution of N – Not just the most recently added component is favored for generalization ● The more conjuncts in a component, the more specialized the component – More room for generalization ● Avoid waste of time trying to generalize very refined rules that would hardly yield consistent generalizations ● On the other hand, generalizing a longer rule is expected to take more time than generalizing shorter ones
  18. 18. Clause Selection Strategy ● S: Shorter elements first – Increasing number of conjuncts ● Specifically considers the level of refinement – Opposite behavior than L ● May confirm the possible advantage of spending time in trying harder but more promising generalizations versus spending time in trying easier but less promising ones first ● Can be considered as an evolution of O – Tries to generalize first more refined components ● Largest variance in degree of refinement (number of conjuncts in rule premises) among components expected
  19. 19. Clause Selection Strategy ● ~: More similar elements first – Decreasing similarity with the uncovered example ● Only content-based strategy – Same similarity as in InTheLEx’s generalization operator ● Disjunctive components  different actualizations of a concept – Small intersection expected between the sets of examples covered by different components ● Similarity assessment may allow to identify the appropriate component to be generalized for a given example – Odd generalizations for mismatched component-example ● Coverage and generalization problems ● Bad theory, inefficient refinements – One might expect that over-generalization is avoided – Generalization more easily computed, but overhead to compute the similarity ● Does the improvement compensate the overhead?
  20. 20. Evaluation ● Real-world dataset: Scientific papers – 353 layout descriptions of first pages – 4 classes: Elsevier journals, SVLN, JMLR, MLJ ● Classification: learn definitions for these classes ● Understanding: the significant components in the papers – Title, Author, Abstract, Keywords ● First-Order Logic representation needed to express spatial relationships among the page components ● Complex dataset – Some layout styles are quite similar – 67920 atoms in observations ● avg per observation >192 atoms, some >400 atoms – Much indeterminacy ● Membership relation of layout components to pages
  21. 21. Evaluation: Parameters ● Qualitative – # comp in the disjunctive concept definition ● Less components yield a more compact theory, which does not necessarily provide for greater accuracy – avg length of components ● More conjuncts, more specific -and less refined- concept – # exc negative exceptions ● More exceptions, worse theory ● Quantitative – acc (accuracy) ● Prediction capabilities of the theory on test examples – time needed to carry out the learning task ● Efficiency (computational cost) for the different ranking strategies
  22. 22. Evaluation ● Incremental approach justified – New instances of documents continuously available in time ● Comparison of all strategies – Random not tried ● O or N are somehow random – Just append definitions as long as they are generated, without any insight ● 10-fold cross-validation procedure – Classification task used to check in detail the behavior of the different strategies – Understanding task used to assess the statistical significance of the difference in performance between different strategies
  23. 23. Evaluation: Classification ● Useful results only for MLJ and SVLN – Elsevier and JMLR: always single-rule definitions ● Ranking approach not applicable – Examples for Elsevier and JMLR still played a role as negative examples for the other classes – Some expectations confirmed ● Runtime: N always best (often the first generalization attempt succeeds); ~ worst (need for computing the similarity of each component and the example) ● Number of exceptions: ~ always best (improved selection of components for generalization); S also good ● Aggregated indicator: N always wins – Sum of ranking positions for the different parameters (the smaller, the better)
  24. 24. Evaluation: Classification
  25. 25. Evaluation: Classification – Rest of behavior somehow mixed ● SVLN: much less variance than in MLJ – Definition quite clear and examples quite significant? ● MLJ: Impact of the ranking strategies much clearer – Substantial agreement between quality-related indicators: for each approach, either all tend to be good (as in ~) or all tend to be bad (as in S and O) ● Interesting indications from single folds – 3: peak in runtime and number of exceptions for S and O ● runtime a consequence of the unsuccessful search for specializations, that in turn may have some connection with the quality of the theory ● S is a kind of evolution of O – 8: accuracy increases from 82% to 94% for ~ and L ● Content-based approach improves quality of the theory in difficult situations
  26. 26. Evaluation: Understanding ● Strategy on the row equal (=), better (+) or worse (–) than the one on the column ● Best strategy in bold
  27. 27. Evaluation: Understanding ● Quite different, albeit run on the same data – When the difference is significant, in general analogous behavior of L and ~ for accuracy (which is greater), number of components and number of negative exceptions. As expected, longer runtime for ~ – Also O and S have in general an analogous behavior, but do not reach as good results as L and ~ – N not outstanding for performance, but better than O (the baseline) for all parameters except runtime – On the classification task, average length of components using L significantly larger than all the others ● Returns more balanced components but more accuracy and less negative exceptions
  28. 28. Conclusions & Future Work ● Disjunctive concept definitions tricky – Each component covers a subset of positive examples, ensuring consistency with all negative examples ● In incremental learning, when a new positive example is not recognized by the current theory, one component must be generalized – Omission error (no specific component responsible) ● The system must decide the order in which the elements are to be considered for trying a generalization – 5 strategies proposed ● The outcomes confirm some of the expectations for the various strategies, but we need – More extensive experimentation to have confirmations and additional details – Identification of further strategies and refinement of the proposed ones