Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Resource Description Framework Approach to Data Publication and Federation


Published on

Bob Stanley, CEO, IO Informatics, explains the utility to RDF as a standard way of defining and redefining data that could have utility in managing life science information.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Resource Description Framework Approach to Data Publication and Federation

  1. 1. Resource Description Framework Approach to Data Publication and Federation Robert Stanley IO Informatics, Inc. IO Informatics © 2011 June 6 2011 Pistoia Alliance – Open Technical Committee Meeting
  2. 2. Agenda <ul><li>Introduction: </li></ul><ul><ul><li>Requirements, Background, Use Cases </li></ul></ul><ul><li>Technical Example / Use Case: </li></ul><ul><ul><li>Requirements for creating a data service and invoking a query </li></ul></ul><ul><li>Q & A </li></ul>
  3. 3. Recap - ELN Query Functional Requirements <ul><li>The ELN Query Services team have produced a rich set of functional requirements/user stories representing common questions scientists have that can be satisfied with an ELN query. Most, if not all, of those requirements have the conceptual form; </li></ul><ul><li>SELECT <selection> FROM Experiment WHERE <constraints> </li></ul><ul><li>… A common approach to the problem would be preferable , ideally leveraging existing standards, rather than building a solution that works just for Experiment. </li></ul><ul><li>… the Pistoia Technical Committee may wish to constrain such query services, by aligning to existing standards, to ensure consistency in approach. </li></ul>
  4. 4. Example ELN Query Workflow <ul><li>Scientist researching a class of Agents </li></ul><ul><ul><li>small molecules (or biologics) intended to hit a target or targets </li></ul></ul><ul><ul><li>links to… </li></ul></ul><ul><li>Assays </li></ul><ul><ul><li>test to determine activity, affinity, binding, promiscuity </li></ul></ul><ul><ul><li>determine potential toxicity, adverse events, etc. </li></ul></ul><ul><ul><li>links to… </li></ul></ul><ul><li>Targets </li></ul><ul><ul><li>sites where compounds bind -- can be locations on a protein, locations on a gene, active centers on an enzyme, etc. </li></ul></ul><ul><ul><li>links to… </li></ul></ul><ul><li>Disease/ Gene relationships </li></ul><ul><ul><li>e.g. biology, can be from TMO / LODD resources </li></ul></ul><ul><ul><li>pathways, proteins, catalysts, immunology defense mechanisms, potential for adverse events, etc. can be included </li></ul></ul>
  5. 5. Recap - Query Service API <ul><li>As part of the phase two deliverable, the ELN Query Services team produced a prototype SOAP-based service outlining the methods that would be required to support the query service. </li></ul><ul><li>There are existing protocols and standards for querying structured data over the web. Aligning to an existing approach will prevent re-inventing a query language and will provide confidence in the stability of the query interface. Examples; </li></ul><ul><li>OData ( is a RESTful query protocol … </li></ul><ul><li>GData ( Very similar to OData, but published by Google… </li></ul><ul><li>SPARQL ( is a RDF query protocol published by W3C for querying linked data. A triple store is [not ] required and there may be synergies with existing Pistoia projects (e.g. VSI, SESL) by adopting SPARQL. </li></ul>
  6. 6. Questions for Tech Committee <ul><li>Ontology content and format </li></ul><ul><ul><li>Regarding reference content </li></ul></ul><ul><ul><ul><li>How comprehensive and refined does it need to be? </li></ul></ul></ul><ul><ul><ul><li>How can it relate to records and different detail? </li></ul></ul></ul><ul><ul><ul><li>Who’s going to maintain the ontology? </li></ul></ul></ul><ul><ul><ul><li>Is there a practical, agile approach to creating, managing, extending? </li></ul></ul></ul><ul><ul><li>Standards - driven? (UML, OWL, or ERD, etc.) </li></ul></ul><ul><ul><li>Investigate service versioning and how backward compatibility might be implemented </li></ul></ul><ul><li>API Mechanism (OData, Gdata, SPARQL?) </li></ul><ul><ul><li>Is there a standards-based approach? </li></ul></ul>
  7. 7. Moving Forward? Resource Description Framework <ul><li>Semantic Technologies provide a unified, standard framework for: </li></ul><ul><ul><li>Ontology / API representation, modification, merging </li></ul></ul><ul><ul><li>Query framework / SOA SPARQL API </li></ul></ul><ul><ul><li>Rich, extensible Query Federation </li></ul></ul><ul><ul><ul><li>(Does not require ETL) </li></ul></ul></ul><ul><li>Agile ontology development & maintenance… </li></ul><ul><ul><ul><li>Customers can extend the ontologies themselves </li></ul></ul></ul><ul><li>Broadly – these are standards-driven methods that will make the ELN Query Federation lifecycle much easier </li></ul>
  8. 8. What is RDF? <ul><li>Resource Description Framework (“RDF”) defines and links data for more effective discovery, federation, integration and re-use across applications </li></ul><ul><li>RDF is a fundamentally simple, standard and extensible way of specifying data and data relationships </li></ul><ul><li>Ontologies describe resources and relationships according to their explicit meaning </li></ul><ul><li>SPARQL Protocol and RDF Query Language (“SPARQL”) supports federated queries w’ SPARQL API for publication </li></ul>
  9. 9. <ul><li>RDF graphs are collections of triples </li></ul><ul><li>Triples are made up of a subject , a predicate , and an object </li></ul><ul><li>Resources and relationships (metadata)are stored </li></ul>Resource Description Framework (RDF) is… A labeled, directed graph of relations between resources and literal values. Confidential IO Informatics © 2011 subject object predicate
  10. 10. <ul><li>“ TP53 encodes Human p53” </li></ul><ul><li>“ p53 is a tumor suppressor protein” </li></ul><ul><li>“ TP53 gene is located on the short arm of chromosome 17” </li></ul>Example RDF Triples Confidential TP53 p53 encodes p53 tumor suppressor protein is a TP53 located chromosome 17
  11. 11. Triples Connect to Form Graphs TP53 Confidential p53 encodes tumor suppressor protein is a located chromosome 17 part of Human part of
  12. 12. Why RDF? What’s Different Here? <ul><li>Triples act as a human and machine readable least common denominator for expressing data and relationships </li></ul><ul><li>Ontologies organize data according to their human readable meaning, to make defining and merging data intuitive… </li></ul><ul><li>RDF supports inference and disambiguation , so merging data and adding new data and relationships without shared identifiers becomes possible… </li></ul>Confidential IO Informatics © 2011
  13. 13. Maturation of a Standard Framework for Resource Description <ul><li>This standard method and framework for extensible open data publication is maturing: </li></ul><ul><ul><li>Standards and practices, ontologies, query and data model specification: </li></ul></ul><ul><ul><ul><li>W3C, ISCB, IEEE, NCBO, OBO </li></ul></ul></ul><ul><ul><ul><li>SKOS, LOD / LODD and related resources </li></ul></ul></ul><ul><ul><li>Federation methods: </li></ul></ul><ul><ul><ul><li>Ontology Resources, URIs, SPARQL endpoints, ontology alignment, inference, D2R, SWObjects, … </li></ul></ul></ul><ul><ul><li>Scalability: </li></ul></ul><ul><ul><ul><li>Security, support for transactional processing </li></ul></ul></ul><ul><ul><li>Practice: </li></ul></ul><ul><ul><ul><li>Expertise, training, larger projects (FDA, DOD, NASA, World Cup,, Chevron, etc.) </li></ul></ul></ul>IO Informatics © 2011
  14. 14. Exponentially Growing Resources 2007 2008 2009 2011 IO Informatics © 2011
  15. 15. Healthcare / Life Sciences remains the largest data sector for SPARQL APIs
  16. 16. Integration / Federation Options Data Sources RDBMS / REST / Web Services Basic Transformation Query-Based Transformation SPARQL 2 SQL (D2R, SWObjects) Federated Semantic Access / Datastore(s) SPARQL APIs / SOAP / REST Provenance Versioning Governance Meaning
  17. 17. RDF / SPARQL Solution Stack DBs, Services SPARQL API, SPARQL conversion, ETL=>RDF Extensible Standards-based Framework Federated Access / Integrated Semantic Datastore(s) Prediction and Simulation Apps Query by Meaning Rich Browsing Other Data Sources Public Data, Services Other Apps
  18. 18. Example Use Cases <ul><li>Manufacturing: Link data sources across imprecise connections to verify reports </li></ul><ul><li>Animal Safety: Knowledge Network for discovery, qualification and validation of cross-species biomarkers </li></ul><ul><li>Personalized Medicine: Knowledge Network and screening application for personalized medicine </li></ul>IO Informatics © 2011
  19. 19. Example Questions <ul><li>What data sources support this specific manufacturing report about product purity and shelf-life? </li></ul><ul><li>What toxicity biomarkers are common to most animals? </li></ul><ul><li>What patients are showing combinations of indicators that predict risk? </li></ul>IO Informatics © 2011
  20. 20. Qualitative Benefits <ul><li>Link internal data with public resources </li></ul><ul><ul><li>E.G. – “out of box” linking of ELN data with LODD sources </li></ul></ul><ul><ul><li>Growing public resources provide cost-effective enrichment and hypothesis generation </li></ul></ul><ul><ul><li>Reduce dependencies on expensive commercial databases </li></ul></ul><ul><li>Emergent Properties </li></ul><ul><ul><li>ELN data enrichment and knowledge building made easy! </li></ul></ul><ul><ul><li>Supports inference, rich interrogation, serendipitous discovery </li></ul></ul><ul><li>R&D concepts can now be translated to immediate consumer benefits </li></ul><ul><ul><li>Previous “out of reach” integration and applications become practical </li></ul></ul>IO Informatics © 2011
  21. 21. Quantitative Benefits <ul><li>Reduced effort, time and cost to federate, update, extend to new datasets </li></ul><ul><li>Start sooner - reduce initial design and deployment burden </li></ul><ul><ul><li>Ontologies are explicit, can be engineered from or decoupled from resources, can be altered without refactoring </li></ul></ul><ul><li>Finish sooner - reduced time to create and test integrations </li></ul><ul><ul><li>Agile modeling, integration and testing </li></ul></ul><ul><li>Extend more easily - add new data sources and applications </li></ul><ul><ul><li>Building blocks to add new data, create new applications </li></ul></ul><ul><ul><li>End Users can adapt, modify, extend ontologies </li></ul></ul>IO Informatics © 2011
  22. 22. UBC: Knowledge Network for Organ Failure; “ASK” for Personalized Medicine IO Informatics © 2011
  23. 23. UBC: Knowledge Network for Organ Failure; “ASK” for Personalized Medicine <ul><li>Outcome </li></ul><ul><ul><li>Knowledge network for enrichment, visualization and qualification of patterns indicating risk of organ failure </li></ul></ul><ul><ul><li>Web-based deployment of SPARQL-based screening patterns across multiple data sources, indicating patients-at-risk </li></ul></ul>IO Informatics © 2011
  24. 24. <ul><li>Screening of transplant patients for likelihood of transplant failure, based on combined biomarker patterns </li></ul>Personalized Medicine Knowledge Network IO Informatics © 2011 <ul><li>Web-based Knowledge Application </li></ul><ul><li>Applies patterns for predictive screening </li></ul><ul><li>Weighing, scoring of results </li></ul><ul><li>Bring “hits” back into Knowledge Network for validation of hypotheses and algorithms </li></ul>
  25. 25. ASK™ <ul><li>Extended SPARQL-based Knowledge Application </li></ul><ul><li>Visualization, Testing, Validation of Systems-oriented Hypotheses </li></ul><ul><li>Automation, Dashboards </li></ul>Confidential IO Informatics © 2011
  26. 26. UBC / PROOF: Quantitative… <ul><ul><li>Integration and analysis time reduced from estimated 2 years to about 8 months FTE equivalent </li></ul></ul><ul><ul><li>Time to capture and apply patterns reduced from days to hours </li></ul></ul><ul><ul><li>Knowledge base can be / has been extended to include new public sources in hours (days / weeks with curation) </li></ul></ul><ul><ul><ul><li>Expensive commercial database no longer needed due to ease of integrating public resources </li></ul></ul></ul>IO Informatics © 2011
  27. 27. UBC / PROOF: Qualitative… <ul><ul><li>Visual SPARQL presents queries as hypotheses </li></ul></ul><ul><ul><ul><li>Make it possible for researchers to iteratively create, test and refine hypotheses </li></ul></ul></ul><ul><ul><ul><li>Research queries were published to web service for easy scale-up </li></ul></ul></ul><ul><ul><ul><li>Extended SPARQL delivers practically useful classifiers with scoring </li></ul></ul></ul><ul><ul><li>Use of RDF makes enrichment with public sources practical </li></ul></ul><ul><ul><li>Provenance, reference annotation, original data accessible for review </li></ul></ul><ul><li>“ [The] ability to consume and intuitively represent a wide variety of data-types - from images to quantitative data - and more importantly, display that data in ways that make the significant features immediately obvious to our biologist end-users, has allowed us to move to a completely new level of data analysis….” </li></ul>IO Informatics © 2011
  28. 28. Recap – Core Benefit Drivers <ul><li>Semantic technologies provide a standard method and framework for data publication and interoperability… not a new “data standard”! </li></ul><ul><li>Low barrier to entry for data publishing – extensible building blocks </li></ul><ul><li>for agile, growing integrations </li></ul><ul><li>Reduced effort, time and cost to deliver and maintain data </li></ul><ul><li>definitions and applications, particularly those that depend on </li></ul><ul><li>federation </li></ul><ul><li>Growing public resources are a catalyst and value-add </li></ul><ul><li>Projects that were impractical become practical to achieve and </li></ul><ul><li>maintain </li></ul>IO Informatics © 2011
  29. 29. Thank you! Next, W3C… For questions: <ul><li>Email: [email_address] </li></ul><ul><li>Website: </li></ul>IO Informatics © 2011