Semantic tools for aggregation of morphological characters across studies


Published on

Presented by Hilmar Lapp at the TDWG 2013 conference in Florence, Italy, on Nov 1, 2013.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Semantic tools for aggregation of morphological characters across studies

  1. 1. Semantic tools for aggregation of morphological characters across studies James Balhoff, Alex Dececchi, Paula Mabee, Hilmar Lapp, & Phenoscape team
  2. 2. Rich body of morphological observations – mostly locked up Zebrafish Model of Human Ectodermal Dysplasia Figure 2. The dominant gene Nkt is phenotypically similar, however complements fls mutants. Nkt homozygotes show complete loss of scales, teeth and gill rakers resembling the fls phenotype (A–C). Heterozygous Nkt zebrafish show an intermediate phenotype of scale loss and patterning defect (arrows) while no effect on fin development is seen (D). Heterozygous Nkt also show a dominant effect on the number of teeth (arrows, E) and gill rakers (F), showing deficiencies along the posterior branchial arches and formation of rudimentary rakers along ceratobranchial 1 and 2 (arrows, F). Cb1-5, ceratobranchial bones. doi:10.1371/journal.pgen.1000206.g002 Table 1. Quantitative effect of fls on scale number and shape and the effect of background modifiers in Danio rerio strains on flsdt3Tpl. and a cytoplasmic terminal death domain essential for protein interactions with signaling adaptor complexes. The flste370f mutation is an A to T transversion at a splice acceptor site,
  3. 3. Free text is a barrier to machinebased integration Phylogenetic systematics Human genetics OMIM query “large bone” “enlarged bone” “big bones” “huge bones” “massive bones” “hyperplastic bones” Lundberg & Akama 2005 “hyperplastic bone” “bone hyperplasia” “increased bone growth” # of records 1083 224 21 4 41 12 45 181 879
  4. 4. Integration is key for knowledge synthesis The Tree of Life and a New Classification of Bony Fishes —Betancur-R. et al. 2013. PLoS Currents Tree of Life
  5. 5. Integration is key for discovery
  6. 6. Phenoscape: making evolutionary morphology computable + Comparative studies Model organism datasets = Phenoscape Knowledgebase
  7. 7. How it works: shared ontologies, rich semantics, OWL reasoning
  8. 8. Phenoscape KB content 16,000 character states from >120 comparative morphological datasets, linked to 4,000 vertebrate taxa. Imported genetic phenotype and expression data from ZFIN, Xenbase, MGI, and Human Phenotype project. Shared semantics: Uberon (anatomy), PATO (phenotypic qualities), Entity–Quality (EQ) OWL axioms (phenotype observations) Plus a dozen other ontologies ...
  9. 9. Integrative querying with the Phenoscape KB: scale, absent Ictalurus punctatus eda gene in Danio rerio “body: naked”—Kailola, P. J. 2004. A phylogenetic exploration of the catfish family Ariidae (Otophysi; Siluriformes). The Beagle, Records of the Museums and Art Galleries of the Northern Territory 20:87-166 edadt3S243X/dt3S243X — Harris, M.P., Rohner, N., Schwarz, H., Perathoner, S., Konstantinidis, P., and Nüsslein-Volhard, C.. 2008. Zebrafish eda and edar mutants reveal conserved and ancestral roles of ectodysplasin signaling in vertebrates. PLoS Genetics 4(10):e1000206.
  10. 10. Integrating phylogenetic studies Can we use reasoning to integrate character matrices across studies? Would enable the wealth of single-study character analysis methods on any integrated matrix. Including tree-based comparative phylogenetic methods
  11. 11. Evolution of Sarcopterygian Limb/Fin Combined matrix of any character states related to presence/absence of limb/fin structures from studies in Phenoscape KB Clack, J. A. (2009). The Fin to Limb Transition: New Data, Interpretations, and Hypotheses from Paleontology and Developmental Biology. Annual Review of Earth and Planetary Sciences, 37(1), 163-179
  12. 12. EQ supermatrix synthesis: workflow 1. Use OWL reasoner to group character states by anatomy and quality axes, based on EQ annotations. 2. Export groupings as character matrix, with taxon assignments to states from original data. 3. Supplement presence/absence character state assertions with reasoner-inferred information. 4. Use Phenex data editor to manually consolidate character states where appropriate
  13. 13. EQ supermatrix synthesis: Results Synthesized limb/fin character matrix 1055 Sarcopterygian taxa 494 characters 2-7 states per character from 55 original studies Developed several tools for automated character matrix synthesis to make this happen.
  14. 14. Technology stack Ontologies and phenotype observation data in OWL ELK, an OWL-EL reasoner OWL-DL reasoners are too slow for this OWL API (Java), programmed primarily using Scala Bigdata™ RDF triplestore (~ 25 million triples)
  15. 15. Using reasoning to group character states For every pair of anatomical term X and quality attribute Y, generate a “character expression” OWL class: (involves some X and involves some Y) Done programmatically via property chain axioms and OWL reasoning (ELK) Classify character states to most relevant character expression Done by OWL reasoner (ELK) Inferred relationships materialized to triple store
  16. 16. Challenge: scalable reasoning Anatomy ontologies and EQ annotation employ rich OWL semantics → best used with a DL reasoner Classifying and querying over large dataset (~25 million RDF triples) does not scale well Presently, the only feasible OWL reasoner is ELK constrained to OWL EL profile → limits kinds of expressions we use best performance over class axioms only → data must be modeled so as to avoid need for classifying instances
  17. 17. Challenge: Querying complex expressions Want to allow arbitrary selection of structures of interest, using rich semantics: (part_of some (limb/fin or girdle skeleton)) or (connected_to some girdle skeleton) RDF triplestores provide very limited reasoning expressivity, and scale poorly with large ontologies. However, ELK can answer class expression queries within seconds.
  18. 18. Instead of something like this (*): PREFIX  rdf:  <­‐rdf-­‐syntax-­‐ns#> PREFIX  rdfs:  <­‐schema#> PREFIX  ao:  <­‐anatomy-­‐ontology/> PREFIX  owl:  <> SELECT  DISTINCT  ?gene WHERE   { ?gene  ao:expressed_in  ?structure  . ?structure  rdf:type  ?structure_class  . #  Triple  pattern  selecting  structure: ?structure_class  rdfs:subClassOf  "ao:muscle”  . ?structure_class  rdfs:subClassOf  ?restriction ?restriction  owl:onProperty  ao:part_of  . ?restriction  owl:someValuesFrom  "ao:head"  . } We would really like to do this: PREFIX  rdf:  <­‐rdf-­‐syntax-­‐ns#> PREFIX  rdfs:  <­‐schema#> PREFIX  ao:  <­‐anatomy-­‐ontology/> PREFIX  ow:  <> SELECT  DISTINCT  ?gene WHERE   { ?gene  ao:expressed_in  ?structure  . ?structure  rdf:type  ?structure_class  . #  Triple  pattern  containing  an  OWL  expression: ?structure_class  rdfs:subClassOf  "ao:muscle  and  (ao:part_of  some  ao:head)"^^ow:omn  . }
  19. 19. owlet: SPARQL query expansion with in-memory OWL reasoner owlet interprets OWL class expressions embedded within SPARQL queries Uses any OWL API-based reasoner to preprocess query. We use ELK that holds terminology in memory. Replaces OWL expression with FILTER statement listing matching terms
  20. 20. PREFIX  rdf:  <­‐rdf-­‐syntax-­‐ns#> PREFIX  rdfs:  <­‐schema#> PREFIX  ao:  <­‐anatomy-­‐ontology/> PREFIX  ow:  <> SELECT  DISTINCT  ?gene WHERE   { ?gene  ao:expressed_in  ?structure  . ?structure  rdf:type  ?structure_class  . #  Triple  pattern  containing  an  OWL  expression: ?structure_class  rdfs:subClassOf  "ao:muscle  and  (ao:part_of  some  ao:head)"^^ow:omn  . } ➡︎ owlet ➡︎ PREFIX  rdf:  <­‐rdf-­‐syntax-­‐ns#> PREFIX  rdfs:  <­‐schema#> PREFIX  ao:  <­‐anatomy-­‐ontology/> PREFIX  ow:  <> SELECT  DISTINCT  ?gene WHERE   { ?gene  ao:expressed_in  ?structure  . ?structure  rdf:type  ?structure_class  . #  Filter  constraining  ?structure_class  to  the  terms  returned  by  the  OWL  query: FILTER(?structure_class  IN  (ao:adductor_mandibulae,  ao:constrictor_dorsalis,  ...)) }
  21. 21. Inferring presence/absence Character states often do not directly assert, but imply presence or absence. Most phenotypic descriptions of some feature of a structure implies its presence or absence: “Humerus slender and elongate: with length more than three times the diameter of its distal end” → humerus must be present Partonomy axioms in the ontology allow inferring presence or absence: ‘all humerus part_of some forelimb’ → forelimb must be present if humerus is; humerus must be absent if forelimb is
  22. 22. Absence is typically modeled using negation → not (has_part some forelimb) Negation not part of OWL EL (and thus ELK reasoner) C = has_part some appendage ︎ B = has_part some limb ︎ —————reverse————— Challenge: absence reasoning with OWL EL absentA = not A ︎ absentB = not B ︎ Solution: programmatic A = has_part absentC = assertion of “absence some forelimb not C hierarchy” via classification of negated expressions Requires precomputation, constraints for on-the-fly use
  23. 23. Challenge: Character state consolidation
  24. 24. Challenge: Character state consolidation Reduced 1-297 states per character to 2-7.
  25. 25. Result: Reasoning fills in many missing character states asserted presence/absence with inference Mesquite “birds-eye view”
  26. 26. Unified matrix enables candidate gene view Linking evolutionary phenotypes to genes through ontologies, via Phenoscape KB or similarity
  27. 27. Integrated data highlight conflict and gaps Conflicting interpretations in studies supinator process of humerus: both absent & present in Strepsodus (Zhu et al. 1999 vs. Ruta 2011) figure from Parker et al., 2005 Gaps in knowledge acetabulum present or absent? Acetabulum of pelvic girdle: present/absent Same term, different meaning? Acanthostega— “radials, jointed” (Swartz 2012) but doesn’t have radials... Uneven taxon sampling
  28. 28. Phenoscape software owlet (SPARQL processor), Phenex (semantic data editor), phenoscape-owl-tools (KB build), others
  29. 29. Phenoscape project team National Evolutionary Synthesis Center (NESCent) University of Oregon (Zebrafish Information Network) Todd Vision (also University of North Carolina at Chapel Hill) Monte Westerfield Hilmar Lapp Ceri Van Slyke Jim Balhoff Cincinnati Children's Hospital (Xenbase) Prashanti Manda University of South Dakota Paula Mabee David Blackburn Paul Sereno Nizar Ibrahim Mouse Genome Informatics Terry Hayamizu Christina James-Zorn California Academy of Sciences Alex Dececchi Judith Blake Aaron Zorn Virgilio Ponferrada Wasila Dahdul University of Chicago Yvonne Bradford University of Arizona Hong Cui Oregon Health & Science University Melissa Haendel Lawrence Berkeley National Labs Chris Mungall