Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ontology-based Data Integration

3,378 views

Published on

Data integration is a perennial challenge facing large-scale data scientists. Bio-ontologies are useful in this endeavour as sources of synonyms and also for rules-based fuzzy integration pipelines.

Published in: Technology, Education
  • Be the first to comment

Ontology-based Data Integration

  1. 1. Industry Programme Workshop: Data Integration 18-19 September 2013 Ontology-based data integration Janna Hastings
  2. 2. Data integration is hard Technology Syntax Semantics Content
  3. 3. Different data resources, different needs “why can‟t they all just use the same - schema - measurement accuracy - units - labels - content?”
  4. 4. Standards are the solution… (?) Source: http://xkcd.com/927/
  5. 5. Ontology-based data integration Ontologies can help with the semantic and the content aspects of data integration • Semantic: definition for schemas • OWL is a good language for defining schemas • See RDF and Semantic Web presentations, today • Content: definition of the entities referred to by data • Ontologies embedded into a data integration workflow help facilitate content-aware data integration
  6. 6. Core challenge: labelling Multiple labels can mean the same thing One label can mean multiple things
  7. 7. Semantics-free identifiers, multiple synonyms CHEBI:27732 A trimethylxanthine in which the three methyl groups are located at positions 1, 3, and 7. guaranine methyltheobromine 1,3,7-trimethylxanthineKoffein caféine
  8. 8. Core challenge: biological knowledge The answer to the question: “Is Entity A from Data Source 1 the same thing as Entity B from Data Source 2?” often depends who is asking and who is answering! Left lung vs. lung Hippocampus vs. brain Dopamine vs. L-dopamine In vitro vs. In vivo cells of type X Gene Y and post-translationally modified form Y‟ Gene Z in mouse, Gene Z in human
  9. 9. Hierarchy left lung lung organ is a is a Generalise to the nearest common ancestor i.e. if you are integrating data about tissue samples annotated to „lung‟ in the one dataset, and „left lung‟ in the other, The ontology can compute „lung‟ as the nearest common ancestor Also for „left lung‟ and „right lung‟
  10. 10. Other relationships Relationships encode biological knowledge Rules allow to specify which relationships can be traversed for data integration purposes e.g. for tissue samples, part_of: sample_frompart_of => sample_from A sample from a part of the brain (e.g. the hippocampus) is a sample from the brain (Quite aside from the „is a‟ hierarchy!) brain hippocampus part of
  11. 11. Core challenge: flexibility … (>150 members) Fixed-depth hierarchies force some classes to be too big, with the lowest level collapsing biolgoical hierarchy and others too small … (<1 member)
  12. 12. Ontologies in content integration A B A&B 1. Schema mappings A B 2. Ontology- provided synonyms A B 3. Hierarchy and relationship rules for integration OWL language and tools: web-embedded (but whole-ontology rule reasoning may be slow)
  13. 13. Is ontology integration just another type of data integration? Which ontology(-ies) to use? How to use them together? How to plug the gaps? Why should I (as a user) have to do this integration over and over
  14. 14. Desiderata for ontologies for data integration • Ontologies should be neutral and shared community- wide • Users should be able to directly and rapidly extend the ontology where there are gaps (responsiveness) • The ontology should use semantics-free identifiers and at the same time energetically annotate synonyms • When necessary, ontologies should take care of ontology integration to provide the community with a one-stop service and appropriate cross-references • The ontologies should be used in data annotation See http://www.obofoundry.org/
  15. 15. Questions?

×