Generating Lexical Information for Terminology
          in a Bioinformatics Ontology
    Hammad Afzal1,3, Paul Buitelaar1...
Motivation
 Lack of Linguistic Expressiveness in formally specified ontologies
     Typically developed to provide a sha...
Desiderata for Ontology-Lexicon model
  Separation between linguistic and ontological Level
    Develop lexica independe...
Towards our approach: LexInfo
 Recent principled approaches to associate linguistic information
  to an arbitrary ontolog...
Case Study: Lexicalizing a bioinformatics ontology
   Creating a LexInfo-based lexicon for lexical enrichment of a bioinf...
Case Study: Lexicalizing a bioinformatics ontology
 MyGrid Ontology

       Supports Service Description of bioinformatic...
Case Study: Lexicalizing a bioinformatics ontology
• LexInfo
      A principled way to enrich ontologies with linguistic ...
Rest of the talk
• Methodology
      Dual approach towards lexicalization of myGrid ontology
      Collection of Bioinfo...
Methodology - I
 Dual approach towards lexicalization of myGrid ontology
    Semi-automatically created LexInfo-based le...
Methodology - II
 Collection of Bioinformatics Corpus

    Domain specific behaviour (linguistic information) of the lex...
Methodology - III
• Lexicalization of Class Labels (Step-wise approach)

   1. LexicalEntry is created for each Class (in ...
Methodology - III
• Lexicalization of Class Labels (Single Word)

   The linking of LexicalEntry with a domain Class, and ...
Methodology - III
• Lexicalization of Class Labels (Multi-Word)

    LexInfo associates a ListOfComponents with a Lexical...
Methodology - III
• Lexicalization of Class Labels (Multi-Word)
    An example of morphological decomposition of a multi-...
Methodology - IV
• Lexicalization of Property Labels (Steps)
    Morphological decomposition as well as the syntactic ana...
Methodology - IV
• Lexicalization of Property Labels

    In automatic lexicon generation, the lexical entries are derive...
Methodology - IV
• Lexicalization of Property Labels
   – Lexicalization of ObjectProperty produces.
Statistics - I
•   Some of the statistics about the myGrid ontology



        Ontology Constructs                        ...
Statistics - II
•   Semi-automatically generated LexInfo based lexicon of the myGrid ontology.
                           ...
Statistics - III
•   Statistics about the automatically generated LexInfo based lexicon of the
    myGrid ontology using L...
Discussion
Semi-Automatically created Lexicon

Lexicalization of Classes

  Most of the LexicalEntries are of type Noun, ...
Discussion
Semi-Automatically created Lexicon

Lexicalization of ObjectProperties

  is_identifier_of, and is_part_of lex...
Discussion
Automatically generated Lexicon using LexInfo service

Lexicalization of Classes (differences from the semi-aut...
Discussion
Automatically generated Lexicon using LexInfo service

Lexicalization of ObjectProperties

  ObjectProperties ...
Implementation
• Initial implementation of LexInfo Model as API – Univ. Bielefeld, DERI –
  National Univ. of Ireland, Gal...
Future Work
  Linguistically enriched ontology for improvement of service annotation
     The linguistically enriched le...
Acknowledgments
•   Supported in part by the European Union under Grant No. 248458 for the Monnet
    project as well as b...
References
    •   Afzal, H., Stevens, R. and Nenadic, G. Mining Semantic Descriptions of Bioinformatics Web
        Resou...
Resources Used
•   BMC Bioinformatics: http://www.biomedcentral.com/bmcbioinformatics/
•   Genia Tagger: http://www-tsujii...
Generating Lexical Information for Terminologyin a Bioinformatics Ontology
Upcoming SlideShare
Loading in...5
×

Generating Lexical Information for Terminology in a Bioinformatics Ontology

1,389

Published on

Slides were presented at Terminology and KnowleTKE 2010

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,389
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Generating Lexical Information for Terminology in a Bioinformatics Ontology

  1. 1. Generating Lexical Information for Terminology in a Bioinformatics Ontology Hammad Afzal1,3, Paul Buitelaar1, Philipp Cimiano2, John McCrae2, Tobias Wunner1 Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland1 Semantic Computing Group, Center of Excellence (CITEC), Bielefeld University, Bielefeld, Germany2 Department of Computer Science, College of Telecommunication Engineering, National University of Sciences and Technology, Pakistan3
  2. 2. Motivation  Lack of Linguistic Expressiveness in formally specified ontologies  Typically developed to provide a shared view of a domain’s knowledge.  Not necessarily support the natural language processing (NLP) tasks.  Solutions :  Terminologies to include linguistic information to facilitate using ontologies for text processing, e.g. Specialist Lexicon contains lexical variants of many terms that are used in the biomedical domain.  Simple Knowledge Organization System (SKOS) format provides a standard way to represent knowledge organization systems using the Resource Description Framework (RDF).  Limitations:  SKOS provides a data-model to represent classification schemas such as thesauri etc by introducing further typology of labels (preferred, alternative, hidden etc.) and is not intended to associate more sophisticated lexical and linguistic information with an arbitrary ontology.
  3. 3. Desiderata for Ontology-Lexicon model  Separation between linguistic and ontological Level  Develop lexica independently of specific ontologies for the same domain  Allow different lexica for each ontology  Independence between linguistic and ontological level  No mutual constraints  Ontological structures/concepts do not need to have a corresponding representation of linguistic structure and vice versa  Detailed information on linguistic realization  Part of speech, morphology (inflection, decomposition), syntactic structure (sub- categorization frames), etc.  Support for multi-linguality
  4. 4. Towards our approach: LexInfo  Recent principled approaches to associate linguistic information to an arbitrary ontology:  LingInfo: modeling morpho-syntactic decomposition of (complex) terms [Buitelaar et al. 2006]  LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]  Lexical Markup Framework (LMF): ISO standardized model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007]  LexInfo: building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]
  5. 5. Case Study: Lexicalizing a bioinformatics ontology  Creating a LexInfo-based lexicon for lexical enrichment of a bioinformatics ontology i.e. the myGrid ontology (Wolstencroft et al., 2007).  Lexical information is derived from semantic lexicons such as WordNet (Fellbaum, 1998), and a domain related corpus. Key points:  The capture of morpho-syntactic behavior such as part-of-speech (POS), decomposition, lemmatization and sub-categorization behaviour of lexical elements.  The lexicalized terms along with their linguistic information are added to the OWL-based lexicon based on the LexInfo model.
  6. 6. Case Study: Lexicalizing a bioinformatics ontology MyGrid Ontology  Supports Service Description of bioinformatics resources through service annotation.  Manual annotation is a slow process: e.g. Taverna/Feta: only ~15-20% of services are functionally described: Result is increasingly growing of backlog of un-annotated services  Certain NLP-based attempts for automation of service descriptions are reported where myGrid ontology is used.  Lexicalization of myGrid ontology can improve performance of such approaches
  7. 7. Case Study: Lexicalizing a bioinformatics ontology • LexInfo  A principled way to enrich ontologies with linguistic information.  Provides a framework for automatic construction of 'lexicalized ontologies' on top of existing ontologies and lexical resources (Buitelaar et al, 2009) • Main characteristics:  Two separate domain of discourse by way if using different name spaces:  Domain ontology and LexInfo Model  Domain ontology defines the classes, properties and individuals in that domain  The main entities in lexical domain of discourse are instances of class LexicalEntry.  LexInfo attaches lexical information (e.g. part-of-speech, morphological, sub- categorization) to lexical entries.
  8. 8. Rest of the talk • Methodology  Dual approach towards lexicalization of myGrid ontology  Collection of Bioinformatics Corpus  Lexicalization of Class Labels  Lexicalization of Property Labels • Statistics, Experiments and Results  Semi-automatically created lexicon  Automatically generated lexicon • What’s Next
  9. 9. Methodology - I  Dual approach towards lexicalization of myGrid ontology  Semi-automatically created LexInfo-based lexicon.  Automatically created lexicon using LexInfo ontology lexicalization service.  Difference:  In Semi-automatically created lexicon, the linguistic information has been mainly derived from the domain corpus, and manually analyzed to verify correctness  In automatic generation, a generic POS-tagger and domain independent lexical resources are used to derive morpho- syntactic behaviour on the basis of an automatic analysis of the labels of the concepts, properties and individuals in the ontology
  10. 10. Methodology - II  Collection of Bioinformatics Corpus  Domain specific behaviour (linguistic information) of the lexical entries is derived from 2691 full text journal articles of BMC Bioinformatics.  The GeniaTagger is used to get POS information; the tags of interest are Nouns, Proper Nouns, Verbs and Adjectives.  Syntactic information is derived using the Stanford parser.  Currently, we have worked only on the syntactic behaviour of properties (owl:ObjectProperty and owl:DataProperty in particular) and not of classes.
  11. 11. Methodology - III • Lexicalization of Class Labels (Step-wise approach) 1. LexicalEntry is created for each Class (in the domain ontology) and is linked to Class through the hasSense property. 2. The LexicalEntry is initialized as one of its sub-classes (e.g. Noun, Verb, Adjective, etc.) 3. POS tag is derived from a semantic lexicon such as WordNet and further supported from associated domain corpus 4. The lexical form (Lemma, WordForm etc) is attached to the lexical entries through the corresponding relation: hasLemma or hasWordForm.
  12. 12. Methodology - III • Lexicalization of Class Labels (Single Word) The linking of LexicalEntry with a domain Class, and attachment of grammatical information and lemma with LexicalEntry
  13. 13. Methodology - III • Lexicalization of Class Labels (Multi-Word)  LexInfo associates a ListOfComponents with a LexicalEntry with an ordered list of Components and size given as a DataProperty of ListOfComponents.  Each of the Components is linked with a LexicalEntry.  The validity of Component as a legitimate LexicalEntry is derived from its presence in the myGrid ontology as a separate entity, or its substantive existence in the domain corpus.
  14. 14. Methodology - III • Lexicalization of Class Labels (Multi-Word)  An example of morphological decomposition of a multi-word class label (from the myGrid ontology).
  15. 15. Methodology - IV • Lexicalization of Property Labels (Steps)  Morphological decomposition as well as the syntactic analysis of the property label is performed.  The property labels are automatically tokenized, and tokens are then linked with the LexicalEntries (Same as Classes).  On syntactic level, the tokens are analyzed to attach their respective syntactic behavior which is then linked with the subcategorization frames.  LexInfo model provides various specializations of subCategorization frames such as Transitive, TransitivePP, IntransitivePP, AdjectiveNP, NounPP and Noun2PP etc  Mapping of syntactic arguments such as Subject, Object, PObject etc. linked with the LexicalEntry to the semantic arguments such as Domain, Range, RangeOfProperty corresponding to the object property.
  16. 16. Methodology - IV • Lexicalization of Property Labels  In automatic lexicon generation, the lexical entries are derived automatically by processing the labels in the ontology using LILAC grammar.  LILAC production rules state part-of-speech patterns that apply to the label. For example, a label with the structure “N Prep” gives rise to a lexicon entry of type “NounPP”.  Currently, LexInfo uses 73 rules to generate lexicons automatically (further details on LexInfo homepage).
  17. 17. Methodology - IV • Lexicalization of Property Labels – Lexicalization of ObjectProperty produces.
  18. 18. Statistics - I • Some of the statistics about the myGrid ontology Ontology Constructs Total Number of Occurrences Single word class labels 88 Two word class labels 200 Classes 475 Three or more word class 187 labels Single word property labels 1 Two word property labels 4 ObjectProperties 8 Three or more word class 3 labels DataProperties 0 Individuals 0
  19. 19. Statistics - II • Semi-automatically generated LexInfo based lexicon of the myGrid ontology. Number of LexInfo Specialized Entries in Example Labels Constructs Constructs ‘myGrid Lexicon’ Adjective Multiple 21 Noun Alignment 752 LexicalEntries Proper Noun Medline 253 Verb Perform 4 NounPhrase Sequence_similarity_Search 369 AdjectivePhrase Tertiary_Structure_Prediction 16 VerbPhrase Performs_task 1 Written-Form 1044 List-of- 387 Components Syntactic- Transitive produces 4 Behaviour NounPP is_part_of 4
  20. 20. Statistics - III • Statistics about the automatically generated LexInfo based lexicon of the myGrid ontology using LexInfo lexicon generation service. # of Entries in LexInfo Specialized Constructs Example Labels ‘myGrid Constructs Lexicon’ Adjective local 131 Noun Record 973 Proper Noun Maize 15 Verb Perform 19 LexicalEntry Genotype-phenotype- NounPhrase 1069 database ProperNounPhrase UniProt 1 VerbPhrase 0 List-of- 1071 Components Transitive produces 3 Syntactic- NounPP is_part_of 4 Behaviour IntransitivePP produced_by 1
  21. 21. Discussion Semi-Automatically created Lexicon Lexicalization of Classes  Most of the LexicalEntries are of type Noun, NounPhrase and ProperNoun  Not many Verb occurrences.  Class labels are mostly named using nouns, whereas the object properties are typically named using verbs,  Small number of ObjectProperties (8 properties) resulted in a smaller number of verbs in the lexicon.  The number of Proper Nouns is 253; 32 of which are created from single- word Class names.  387 ListOfComponents are created from the 387 multi-word class names in the ontology (myGrid), 371 of them correspond to NounPhrases and 16 are AdjectivePhrases,
  22. 22. Discussion Semi-Automatically created Lexicon Lexicalization of ObjectProperties  is_identifier_of, and is_part_of lexicalized as Nouns (part and identifier)  SyntacticBehavior linked to the subcategorization frame of type NounPP (Noun: identifier, Prep: of and Noun: part, Prep:of).  performs_task and task_performed_by lexicalized as Verb (perform).  SyntacticBehavior linked to the subcategorization frame of type Transitive.  Both properties are inverse of each other, and are lexicalized using the same verb, however, the mapping of syntactic arguments to domain and range is inversed in the two cases.  Produces and produced_by are lexicalized lexicalized as Verb (perform)  performs_task is recognized as a VerbPhrase with performs as a Verb and a Transitive subCategorization frame linked with it.  The syntactic behaviors of has_identifier and has_part are also modeled as NounPP.
  23. 23. Discussion Automatically generated Lexicon using LexInfo service Lexicalization of Classes (differences from the semi-automatically created)  The number of Adjectives has significantly increased to 131 and those of ProperNouns has steeply decreased to 15.  Reason is that ProperNouns are incorrectly identified as Adjectives by our POS tagger (Stanford Tagger), e.g. DDBJ in DDBJ_Amino_Acid_Database), PIRSF in PIRSF_report are recognized as Adjectives by the POS-tagger.  This problem can be resolved by using domain corpora, or considering a domain thesaurus or dictionary etc.  The number of Verbs has increased to 19  Again due to a POS tag error: gerunds such as “manipulating”, “predicting” are incorrectly identified as Verbs.  The identification of ProperNounPhrase is incorrect due to a tokenization error.  “UniProt” is tokenized as two proper nouns, “uni” and “prot”, although it is a single word, i.e. name of a bioinformatics database.  This can also be resolved using a domain corpus or thesaurus.
  24. 24. Discussion Automatically generated Lexicon using LexInfo service Lexicalization of ObjectProperties  ObjectProperties are mostly lexicalized correctly.  Only error is in lexicalization of “produced_by” that is recognized as IntransitivePP. This is because of an error in the ontology lexicalization (LILAC) rules which consider the occurrence of a past-participle verb followed by “by” as an occurrence of IntransitivePP.
  25. 25. Implementation • Initial implementation of LexInfo Model as API – Univ. Bielefeld, DERI – National Univ. of Ireland, Galway – https://lexinfo.googlecode.com/svn
  26. 26. Future Work  Linguistically enriched ontology for improvement of service annotation  The linguistically enriched lexicon associated with the myGrid ontology can improve the performance of literature based approaches for automatic annotation of bioinformatics web services.  Optimization of LexInfo model by including WordNet etc.  To generate all possible lexicalizations of given ontological constructs by utilizing Synsets from WordNet and extract semantically similar verbs from VerbNet and FrameNet  LexInfo API is currently under development  Allows the creation, management and serialization of ontology lexica according to the LexInfo model. An early prototype of a lexicon generation service based on LexInfo model is also made available. Available at: http://code.google.com/p/lexinfo/
  27. 27. Acknowledgments • Supported in part by the European Union under Grant No. 248458 for the Monnet project as well as by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2). • Thanks to Thomas Wangler, Michael Sintek and Matthias Mantel for their valuable contributions in designing the LexInfo model and developing the LexInfo API.
  28. 28. References • Afzal, H., Stevens, R. and Nenadic, G. Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature, In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), LNCS 5554, Springer-Verlag: 535-549. • Afzal, H., Stevens, R., Nenadic, G. Towards Semantic Annotation of Bioinformatics Services: Building a Controlled Vocabulary, In Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008):5-12. • Buitelaar, P., Declerck, T., Frank, A., Racioppa, S., Kiesel, M., Sintek, M., Engel, R., Romanelli, M., Sonntag, D., Loos, B., Micelli, V., Porzel, R. and Cimiano, P. LingInfo: Design and Applications of a Model for the Integration of Linguistic Information in Ontologies. In Proceedings of OntoLex06, a workshop at LREC, Genoa, Italy. • Paul Buitelaar, Philipp Cimiano, Peter Haase, Michael Sintek: Towards Linguistically Grounded Ontologies. In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), Lecture Notes in Computer Science, Springer 2009. • Cimiano, P., Haase, P., Herold, M., Mantel, M. and Buitelaar, P.: LexOnto: A model for ontology lexicons for ontology-based NLP. In Proceedings of the OntoLex (From Text to Knowledge: The Lexicon/Ontology Interface) workshop at ISWC07 (International Semantic Web Conference). • Francopoulo, G., Bel, N., Georg, Calzolari, N., Monachini, M., Pet, M. and Soria, C.: Lexical markup framework: ISO standard for semantic information in NLP lexicons. In Proceedings of the Workshop of the GLDV Working Group on Lexicography at the Biennial Spring Conference of the GLDV
  29. 29. Resources Used • BMC Bioinformatics: http://www.biomedcentral.com/bmcbioinformatics/ • Genia Tagger: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ • Stanford Parser: http://nlp.stanford.edu/downloads/lex-parser.shtml • Stanford Tagger: http://nlp.stanford.edu/software/tagger.shtml • TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×