EFO tools - the good, the great, and the evil


Published on

An overview of some the tooling used in maintaining the Experimental Factor Ontology

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Experimental Factor Ontology is a great application ontology, hugely popular among internal and external collaborators and featured among the top 10 most accessed ontologies within NCBO BioPortal, which provides access to hundreds of different ontology resources. It is a pleasure to be involved in this project.
  • I joined the EFO teamaround January 2008 working in parallel to GEN2PHEN, to which some of this work was fed back into. My first task was designing and implementing a workflow for pulling in metadata (synonyms & definitions) from for xrefed ontology terms in external ontologies. We now have nearly 5,000 classes and 20,000 synonyms and there’s steady continuing growth.
  • Venn diagram representing who edited/added which class. In cases where it overlaps, the same class was touched by more than one person. Three people directly interact with the ontology Helen, James and I. Ele and Jie would submit large term requests, so added those classes indirectly through any of us.
  • This is how we’re leveraging all the rich metadata within the ontology. Here is an example of querying ArrayExpress http://www.ebi.ac.uk/arrayexpress/ for CML, and getting all experiments also annotated with chronic myloidleukemia and chronic myelogenousleukemia. Querying for leukemia or blood cancer would also give you this results. Anything inconsistent in the ontology would negatively influence this outcome.
  • Here’s a typical workflow. Annotations unmapped to EFO in the Gene Expression Atlas (http://www.ebi.ac.uk/gxa/) are discovered by Zooma (zooma.sf.net). Zooma in turn verifies whether there is a pre-existing mapping within the Atlas already, if not tries to map it to EFO or other ontologies in OLS and BioPortal via OntoCAT. The output is the fed into similarity_match.pl script to double check that no similar terms are in EFO already (as Zooma performs only exact matching) and the vetted terms are finally added to EFO via James’ tab_to_owl script or manually.Another sources of new terms is external users requests. They usually supply a flat list of terms they would like to see within the ontology. These are then mapped via similarity_match.pl to check whether they’re already in EFO, and the added.similarlity_match.pl has custom dedicated dependencies for parsing OWL ontologies and MeSH.
  • Before metadata from external resources can be imported into EFO we need to add appropriate xrefs. These are stored in a dedication annotation ‘definition_citation’ on the mapped term within EFO. The xrefs are added discovered by using similarity_match.pl to align other ontologies (e.g. MeSH, OMIM, NCI Thesaurus, Brenda, Cell Type, etc.) lexically to EFO. Note other tools exist in this domain that would rely on information content to align the ontologies. As far as I know they use exact matching only, so our approach could in fact be more efficient and in my experience the information content approach is not adding much value to the alignment.
  • Once we have the xrefs in, we can use a separate application BioportalImporter which will follow all the xrefs into respective external terms via BioPortal and import all the missing synonyms and definitions into EFO recording the source in a dedicated ‘bioportal_provenance’ annotation. With OWL2 it would be also possible to annotate the annotation directly.
  • Part of the BioportalImporter code base is consistency checking which performs 13 different tests once the import is completed. Most importantly it will report if there were any changes in external resources by cross-referencing provenance information between two versions of the import, and also alert on any potentially duplicated terms, by verifying shared metadata between two distinct terms within EFO.BioportalImporter is not in public domain as it’s tied quite heavily into EFO specifics, but most of the ontology handling code is actually in OntoCAT.Overview of the tests:Malformed efourisChanged ontology annoationsChanged classesObsoleted classesRenamed classesDuplicated xrefsDuplicated synonyms or labelsDupplicated xrefs same as URILocal efoURIs on external classesChanged featuresChanged external classesCircular referencesNon-english characters in annotations
  • Clean_ontology_terms.pl relies on the metaphone and double metaphone algorithms. Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It uses a larger set of rules for English pronunciation. The aim of Metaphone is to match words or names that are pronounced similarly, according to the criteria of similarity which ignores any non-initial vowels and treats voiced and unvoiced versions of consonants as the same. Its latest versionMetaphone 3achieves an unparalleled level of accuracy in producing correct lookup keys for English words, non-English words familiar to English speakers, and names commonly found in the United States, within the criterion of similarity as defined above, but it is not designed to match words which are clearly pronounced differently. Recently publishedAnatomy ontologies and potential users: bridging the gap, Ravensara S Travillian1*, Tomasz Adamusiak1, Tony Burdett1, Michael Gruenberger2,John Hancock3, Ann-Marie Mallon3, James Malone1, Paul Schofield2 and Helen Parkinson1While the original aim of the article was to show how difficult it is to align the two anatomy ontologies: FMA and Uberon, the other conclusion that can be reached is that metaphone algorithms are inapplicable to this particular use case. Mostly importantly clean_ontology_terms.pl performed only marginally better than Zooma doing exact matching, with an enormous hit to precision (~0.07) as the script for lack of better matches would present all the phrases just starting with the same word (a side effect of double metaphonemisapplied on a whole phrase rather than individual words, this is a different behaviour from classic metaphone).
  • Our input data is rarely about differences in spelling such as British tumour and American tumor, but ratherdifferent grammatical number (cell vs. cells), digits, typos, and differently ordered words in similar phrases.Here left column shows an example unmapped annotations from the Atlas. Right-hand column existing terms in EFO that we would like to semi-automatically map to EFO. The ontology is too big to handle manually and it is impossible to remember anymore whether a particular term has already been added, that’s why we need to automate this.
  • First of all clean_ontology_terms.pl is not that fuzzy at all.Tim Rayner the original developer of clean_ontology_terms.pl already considered a more fuzzy approach, and there is a comment in the code suggesting the use of Levenhsteindistance. Rather than extending the script further, rewrote it from scratch into similarity_match.plThe Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.This algorithm, an example of bottom-up dynamic programming, which is is a method for solving complex problems by breaking them down into simpler subproblems. Similar approaches have already been extensively studied in DNA sequence alignment, and the edit distance approach is further generalised by local and global alignment algorithms: Smith–Waterman and Needleman-Wunsch, but they don’t offer much improvement for transpositions, i.e. different ordering of words in a phrase.And this is where n-grams excel.
  • An n-gram is basically a fragment of n length from a given sequence.This idea can be traced to Claude Shannon's work in information theory in the 1900s, but it was Gravano et al. Who first suggested it for string querying in database applications.
  • N-grams work particularly well for transpositions. This surprisingly simple and easy to implement approach allows some powerful fuzzy matching.The general idea is that you split the two strings in question into all the possible 2-character fragments (2-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be easily normalised by dividing the shared number by the total number of n-grams in the longer string.Here we have three strings 19 characters long. The two suprsing things about using Levenshtein distance in this case is that not only both strings are quite low on the similarity, but also the completely different one is actually more similar. N-grams on the other hand deliver exactly the result that we’re expecting, with the sentence A being the most similar to the template, almost identical sharing 18 out of 20 possible 2-grams.Note there is a variation of Levenshtein distance called Damerau–Levenshtein, but it only allows for  transposition of two adjacent characters.
  • clean_ontology_terms.pl is being retired in place of similarity_match.pl Emma (emma@ebi.ac.uk) refactored all the code and repackaged it for easier integration and reuse into a dedicated set of modules EBI::FGPT::FuzzyRecogniser (http://search.cpan.org/dist/EBI-FGPT-FuzzyRecogniser/) available on CPAN.
  • Blowing my own trumpet here. The OntoCAT’sarticle was featured in the top 10 most accessed articles at BMC Bioinformatics a few months ago. The website (http://www.ontocat.org) sees about 1,000 pageviews monthly.
  • But it was Natalja and Misha who stole the show with the ontocat R package included in Bioconductor. Googling for ‘ontology R’ will return the wiki page for the package as first hit, and the actual article as fourth. This is no small feat considering the prevalence of dedicated Gene Ontology R packages that otherwise predominate this space.
  • An example of a directed acyclic graph representing all the relationships in an ontology for a particular EFO ontology term ‘EFO_0000815’ (heart). Edgesare labelled according to the relationship. Organism part classes are represented as ellipses and disease classes are shown as rectangles. The ontoCATpackage was used to compute the relationships which were later processed in Cytoscape (Cline et al., 2007).Converting the whole ontology to what is effectively RDF triples is a computationally intensive tasks, and takes about 30 minutes when run on 200 cluster nodes and parallelised by multiprocessing. It is demonstrated in Example 16 in the online documentation (http://www.ontocat.org/browser/trunk/ontoCAT/src/uk/ac/ebi/ontocat/examples/Example16.java)
  • EFO tools - the good, the great, and the evil

    1. 1. EFO tools – the good, the great, and the evil<br />Tomasz Adamusiak MD PhD<br />
    2. 2. Huge ontology developed by a tiny team<br />
    3. 3. We have means to assign blame when things go wrong (definition_editor)<br />
    4. 4. We need richness and consistency for EFO based query expansion<br />
    5. 5. New terms come from GXA and external users<br />GXA<br />Zooma<br />OLS<br />BioPortal<br />similarity_match.pl<br /> OWL::Simple::Parser<br /> MeSH::Parser::ASCII<br />
    6. 6. Xrefs are acquired by lexical cross-match to other ontologies<br />similarity_match.pl<br /> OWL::Simple::Parser<br /> MeSH::Parser::ASCII<br />
    7. 7. Definitions and synonyms are pulled in from external ontologies via NCBO BioPortal<br />+ provenance <br />BioPortal<br />metadata<br />xrefs<br />BioportalImporter<br />
    8. 8. Regression testing is essential as these are massive updates<br />
    9. 9. We need better concept recognition because clean_ontology_terms.plis evil<br />
    10. 10. We need fuzzines, because input data is extremely dirty<br />
    11. 11. There are different levels of fuzziness<br />similarity_match.pl<br />metaphone<br />& double metaphone<br />Levenhstein<br />distance<br />n-grams<br />clean_ontology_terms.pl<br />
    12. 12. N-grams is a simple and relatively unknown method of string approximation<br />
    13. 13. N-grams are extremely effective in practice<br />Thequickbrown fox<br />A. brownquickThe fox<br />B. The quiet swine flu<br />18% 90%<br />19% 40%<br />
    14. 14. The King is dead. Long live the Queen. <br />
    15. 15. OntoCAT is a great success and generated a lot of interest within the community<br />
    16. 16. Natalja & Misha hit the mother lode<br />
    17. 17. Which diseases affect heart components?<br />Kurbatova N et al. Bioinformatics 2011;27:2468-2470<br />
    18. 18. Acknowledgments<br />Morris A. Swertz’s group at the Genomics Coordination Center (GCC), University of Groningen<br />K Joeri van derVelde<br />DespoinaAntonakaki<br />Dasha Zhernakova<br />James Malone<br />Helen Parkinson<br />Emma Hastings<br />NiranAbeygunawardena<br />Ele Holloway<br />Tim Rayner<br />Zooma: Tony Burdett<br />Bioconductor/R package: Natalja Kurbatova, Pavel Kurnosov, Misha Kapushesky<br />This work was supported by the European Community's Seventh Framework Programmes GEN2PHEN [grant number 200754], SLING [grant number 226073], and SYBARIS [grant number 242220], the European Molecular Biology Laboratory, the Netherlands Organisation for Scientific Research [NWO/Rubicon grant number 825.09.008], and the Netherlands Bioinformatics Centre [BioAssist/Biobanking platform and BioRange grant SP1.2.3]<br />OntoCAT logo courtesy of Eamonn Maguire<br />Special thanks go to NCBO BioPortal and EBI OLS support teams for all the comprehensive help they provide<br />