Issues in Learning an Ontology from Text


Published on

Talk at bio-ontologies SIG at ISMB Toronto, 2008

Published in: Science, Education, Spiritual
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • copulation --> grandfather copulation, cannibalism copulation, harassment copulation, inferred copulation, long copulation, palp copulation, elements copulation, behavioural elements copulation, face to copulation
  • Issues in Learning an Ontology from Text

    1. 1. Issues in Learning an Ontology from Text Christopher Brewster, Simon Jupp, Joanne Luciano, David Shotton, Robert Stevens, and Ziqi Zhang
    2. 2. The Use Case: Animal Behaviour • Animal behaviour community recognises the need for an ontology, e.g. for video annotation/retrieval • The community created an “Animal Behaviour Ontology” - 339 terms • Can we (semi-) automatically build from text?
    3. 3. Some Questions • Do we get a “good ontology”? • If not, is it useful? • Is it low-effort? • Should the result be “tidied up” or used as a donor?
    4. 4. Methodology: Dataset • Journal “Animal Behaviour” from Elsevier • 623 articles from Vol 71 (2006) - Vol 74 (2007) • 2.2 million words • Various formats - most usefully xml
    5. 5. We Want an Ontology of Green • An ontology of “animal behaviours” • Not an ontology of the corpus We want the green terms in the ontology
    6. 6. Processing Steps (1) 1. Text extracted from XML - excluding affiliations, acknowledgements, bibliography except for title etc. 2. Noise removed - person names, animal names, place names 3. Lemmatiser used to reduce data sparsity 4. Term extraction applied
    7. 7. Processing Steps (2) 5. Term selection Regular expression used to select terms ending in behaviour, display, construction, inspection plus generic -ing, -ism, etc. Build hierarchies using String Inclusion 5. Top level terms filtered using “Hearst Patterns” to test if X ISA behaviour/activity/etc. Walking Running Jumping Hunting Pecking Reed Bunting Corn Bunting Herring Courtship Studentship Cannibalism Dimorphism
    8. 8. Applying String Inclusion /Rules to Terms C BCAC ABC Selection Mate Selection Natural Selection Female Mate Selection
    9. 9. Lexico-Syntactic Patterns • X such as P, Q, R; X is a Y • Grooming is a behaviour • Copulation is an activity • Dimorphism is a behaviour • Calls such as trills, whistles, grunts
    10. 10. Results • 64,000 terms extracted • The regexp selected 10,335 terms • Step 6a resulted in an ontology with 17,776 classes and 1295 top level classes • Step 6b resulted in an ontology with 13,058 classes and 912 top level classes
    11. 11. Results (2) - Copulation Sub-tree
    12. 12. Results(3) • Evaluation of terms excluded by regexp: • 56,000 terms excluded • Random sample of 3140 terms evaluated by hand • 7 verbs and 42 nouns should not have been excluded • E.g., “interaction” • A recall of .905
    13. 13. Discussion: The problem of focus
    14. 14. Other Issues • More a vocabulary than an ontology • SKOS-like rather than OWL-like • Can deal with “selection”, “mate selection” and “natural selection • Highly compositional terms “Adult male grooming behaviour” • Cleanish list of top level terms: Canabalism, copulation, eating, foraging, fighting, grooming
    15. 15. Discussion: Is it useful? • Answers: No, yes, yes, donor • Useful ontological fragments • Bringing ontology to ontology learning is the research challenge • Limitations: noise; the problem of focus; only taxonomic relations • Advantages: speed; ease; a step towards formal ontologies