Resource Description Framework Approach to Data Publication and Federation


Published on

Bob Stanley, CEO, IO Informatics, explains the utility to RDF as a standard way of defining and redefining data that could have utility in managing life science information.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • RDF is very simple. It provides a framework for describing things according to their relationships. A key value proposition for semantic technologies is to make it easy to describe and connect related data sources, in order to search them as deeply as we care to. Although there is much useful effort in the formal terminology space, the broad focus is on practical outcomes – specifically, to provide a standard set of methods and framework for describing and linking data. This is NOT a standard for what the minimum or perfect data definition should be! – it is a standard way of defining and redefining data…
  • Lowering barriers to interoperability using standards does NOT mean using standardized, approved terms in an ontology or API. It means using a common framework for resource description, in which terms can easily be surfaced, visualized, tested, merged – due to the common approach.
  • SPARQL is an open standard Query Service API. It supports a query language with a variety of visual query interfaces and takes advantage of growing linked open data resources . SPARQL APIs can utilize SOAP and REST.
  • Agile resource description is a key benefit of RDF and SPARQL. Although it can support highly formal resource description, RDF and SPARQL lowers barrier to interoperability caused by excessive concern about perfect specification of reference content. RDF opens up this conversation providing standard framework and practices for mixing, mapping and updating resource descriptions via data model providers AND consumers. RDF also can manage provenance and links to original data, full experiment records, etc. --------- SPARQL endpoints provide visible data models that can be visually managed by data providers. They can also be rapidly adapted by local users mapping to more or less well-formed APIs. Depending on user need, these can be narrowly practical application ontologies or formal OWL models. Application ontologies can be rapidly mapped to OWL ontologies and vice versa.
  • Broadly, semantic technologies provide a standard framework and methods for describing and querying data resources and connections between data elements. The technology space has been slowing growing and maturing over the past decade. People were writing papers and doing work in the area in the late ‘90s. Tim Berners-Lee wrote an article in 2001 that was published in Scientific American and began popularizing the concepts. These methods make it possible to visualize and connect data in a way that makes sense both to humans and computers. Using RDF and SPARQL, data descriptions can be modified by end users without refactoring the database or databases.
  • RDF is very simple. It provides a framework for describing things according to their relationships. A key value proposition for semantic technologies is to make it easy to describe and connect related data sources, in order to search them as deeply as we care to. Although there is much useful effort in the formal terminology space, the broad focus is on practical outcomes – specifically, to provide a standard set of methods and framework for describing and linking data. This is NOT a standard for what the minimum or perfect data definition should be! – it is a standard way of defining and redefining data…
  • Clockwise from 6 oclock > Species, (protein description), Protein, Gene, Gene Location Terminology around new methods can cause confusion. For ease, I find it useful at times to refer to “assertions” instead of “triples”, and to “data structure” instead “ontologies”.
  • These methods have been developed by W3C, MIT, Stanford, Dublin Core, DERI and elsewhere for just about a decade. Vendor and consumer adoption and technical maturity has grown over the past 8 years or so, with exponential growth over the past 4 years. This compares very favorably to “proprietary” methods and standards invented by single vendors.
  • Agile method; human readable ontologies, no more over-specification, easy to manage and update, end users can extend the API
  • First we linked all of the data of interest to help the customer understand what happens biologically in organ transplant rejection scenarios. Next we apply SPARQL to run searches across clinical, protein and gene expression databases, to screen for patients at risk.
  • Not a new data standard, a standard way of publishing and linking data Extensible so you can start and get to a ‘good enough’ state quickly Created with federation in mind Out-of-box benefits for project completion, maintenance and extension to new data sources
  • Resource Description Framework Approach to Data Publication and Federation

    1. 1. Resource Description Framework Approach to Data Publication and Federation Robert Stanley IO Informatics, Inc. IO Informatics © 2011 June 6 2011 Pistoia Alliance – Open Technical Committee Meeting
    2. 2. Agenda <ul><li>Introduction: </li></ul><ul><ul><li>Requirements, Background, Use Cases </li></ul></ul><ul><li>Technical Example / Use Case: </li></ul><ul><ul><li>Requirements for creating a data service and invoking a query </li></ul></ul><ul><li>Q & A </li></ul>
    3. 3. Recap - ELN Query Functional Requirements <ul><li>The ELN Query Services team have produced a rich set of functional requirements/user stories representing common questions scientists have that can be satisfied with an ELN query. Most, if not all, of those requirements have the conceptual form; </li></ul><ul><li>SELECT <selection> FROM Experiment WHERE <constraints> </li></ul><ul><li>… A common approach to the problem would be preferable , ideally leveraging existing standards, rather than building a solution that works just for Experiment. </li></ul><ul><li>… the Pistoia Technical Committee may wish to constrain such query services, by aligning to existing standards, to ensure consistency in approach. </li></ul>
    4. 4. Example ELN Query Workflow <ul><li>Scientist researching a class of Agents </li></ul><ul><ul><li>small molecules (or biologics) intended to hit a target or targets </li></ul></ul><ul><ul><li>links to… </li></ul></ul><ul><li>Assays </li></ul><ul><ul><li>test to determine activity, affinity, binding, promiscuity </li></ul></ul><ul><ul><li>determine potential toxicity, adverse events, etc. </li></ul></ul><ul><ul><li>links to… </li></ul></ul><ul><li>Targets </li></ul><ul><ul><li>sites where compounds bind -- can be locations on a protein, locations on a gene, active centers on an enzyme, etc. </li></ul></ul><ul><ul><li>links to… </li></ul></ul><ul><li>Disease/ Gene relationships </li></ul><ul><ul><li>e.g. biology, can be from TMO / LODD resources </li></ul></ul><ul><ul><li>pathways, proteins, catalysts, immunology defense mechanisms, potential for adverse events, etc. can be included </li></ul></ul>
    5. 5. Recap - Query Service API <ul><li>As part of the phase two deliverable, the ELN Query Services team produced a prototype SOAP-based service outlining the methods that would be required to support the query service. </li></ul><ul><li>There are existing protocols and standards for querying structured data over the web. Aligning to an existing approach will prevent re-inventing a query language and will provide confidence in the stability of the query interface. Examples; </li></ul><ul><li>OData ( is a RESTful query protocol … </li></ul><ul><li>GData ( Very similar to OData, but published by Google… </li></ul><ul><li>SPARQL ( is a RDF query protocol published by W3C for querying linked data. A triple store is [not ] required and there may be synergies with existing Pistoia projects (e.g. VSI, SESL) by adopting SPARQL. </li></ul>
    6. 6. Questions for Tech Committee <ul><li>Ontology content and format </li></ul><ul><ul><li>Regarding reference content </li></ul></ul><ul><ul><ul><li>How comprehensive and refined does it need to be? </li></ul></ul></ul><ul><ul><ul><li>How can it relate to records and different detail? </li></ul></ul></ul><ul><ul><ul><li>Who’s going to maintain the ontology? </li></ul></ul></ul><ul><ul><ul><li>Is there a practical, agile approach to creating, managing, extending? </li></ul></ul></ul><ul><ul><li>Standards - driven? (UML, OWL, or ERD, etc.) </li></ul></ul><ul><ul><li>Investigate service versioning and how backward compatibility might be implemented </li></ul></ul><ul><li>API Mechanism (OData, Gdata, SPARQL?) </li></ul><ul><ul><li>Is there a standards-based approach? </li></ul></ul>
    7. 7. Moving Forward? Resource Description Framework <ul><li>Semantic Technologies provide a unified, standard framework for: </li></ul><ul><ul><li>Ontology / API representation, modification, merging </li></ul></ul><ul><ul><li>Query framework / SOA SPARQL API </li></ul></ul><ul><ul><li>Rich, extensible Query Federation </li></ul></ul><ul><ul><ul><li>(Does not require ETL) </li></ul></ul></ul><ul><li>Agile ontology development & maintenance… </li></ul><ul><ul><ul><li>Customers can extend the ontologies themselves </li></ul></ul></ul><ul><li>Broadly – these are standards-driven methods that will make the ELN Query Federation lifecycle much easier </li></ul>
    8. 8. What is RDF? <ul><li>Resource Description Framework (“RDF”) defines and links data for more effective discovery, federation, integration and re-use across applications </li></ul><ul><li>RDF is a fundamentally simple, standard and extensible way of specifying data and data relationships </li></ul><ul><li>Ontologies describe resources and relationships according to their explicit meaning </li></ul><ul><li>SPARQL Protocol and RDF Query Language (“SPARQL”) supports federated queries w’ SPARQL API for publication </li></ul>
    9. 9. <ul><li>RDF graphs are collections of triples </li></ul><ul><li>Triples are made up of a subject , a predicate , and an object </li></ul><ul><li>Resources and relationships (metadata)are stored </li></ul>Resource Description Framework (RDF) is… A labeled, directed graph of relations between resources and literal values. Confidential IO Informatics © 2011 subject object predicate
    10. 10. <ul><li>“ TP53 encodes Human p53” </li></ul><ul><li>“ p53 is a tumor suppressor protein” </li></ul><ul><li>“ TP53 gene is located on the short arm of chromosome 17” </li></ul>Example RDF Triples Confidential TP53 p53 encodes p53 tumor suppressor protein is a TP53 located chromosome 17
    11. 11. Triples Connect to Form Graphs TP53 Confidential p53 encodes tumor suppressor protein is a located chromosome 17 part of Human part of
    12. 12. Why RDF? What’s Different Here? <ul><li>Triples act as a human and machine readable least common denominator for expressing data and relationships </li></ul><ul><li>Ontologies organize data according to their human readable meaning, to make defining and merging data intuitive… </li></ul><ul><li>RDF supports inference and disambiguation , so merging data and adding new data and relationships without shared identifiers becomes possible… </li></ul>Confidential IO Informatics © 2011
    13. 13. Maturation of a Standard Framework for Resource Description <ul><li>This standard method and framework for extensible open data publication is maturing: </li></ul><ul><ul><li>Standards and practices, ontologies, query and data model specification: </li></ul></ul><ul><ul><ul><li>W3C, ISCB, IEEE, NCBO, OBO </li></ul></ul></ul><ul><ul><ul><li>SKOS, LOD / LODD and related resources </li></ul></ul></ul><ul><ul><li>Federation methods: </li></ul></ul><ul><ul><ul><li>Ontology Resources, URIs, SPARQL endpoints, ontology alignment, inference, D2R, SWObjects, … </li></ul></ul></ul><ul><ul><li>Scalability: </li></ul></ul><ul><ul><ul><li>Security, support for transactional processing </li></ul></ul></ul><ul><ul><li>Practice: </li></ul></ul><ul><ul><ul><li>Expertise, training, larger projects (FDA, DOD, NASA, World Cup,, Chevron, etc.) </li></ul></ul></ul>IO Informatics © 2011
    14. 14. Exponentially Growing Resources 2007 2008 2009 2011 IO Informatics © 2011
    15. 15. Healthcare / Life Sciences remains the largest data sector for SPARQL APIs
    16. 16. Integration / Federation Options Data Sources RDBMS / REST / Web Services Basic Transformation Query-Based Transformation SPARQL 2 SQL (D2R, SWObjects) Federated Semantic Access / Datastore(s) SPARQL APIs / SOAP / REST Provenance Versioning Governance Meaning
    17. 17. RDF / SPARQL Solution Stack DBs, Services SPARQL API, SPARQL conversion, ETL=>RDF Extensible Standards-based Framework Federated Access / Integrated Semantic Datastore(s) Prediction and Simulation Apps Query by Meaning Rich Browsing Other Data Sources Public Data, Services Other Apps
    18. 18. Example Use Cases <ul><li>Manufacturing: Link data sources across imprecise connections to verify reports </li></ul><ul><li>Animal Safety: Knowledge Network for discovery, qualification and validation of cross-species biomarkers </li></ul><ul><li>Personalized Medicine: Knowledge Network and screening application for personalized medicine </li></ul>IO Informatics © 2011
    19. 19. Example Questions <ul><li>What data sources support this specific manufacturing report about product purity and shelf-life? </li></ul><ul><li>What toxicity biomarkers are common to most animals? </li></ul><ul><li>What patients are showing combinations of indicators that predict risk? </li></ul>IO Informatics © 2011
    20. 20. Qualitative Benefits <ul><li>Link internal data with public resources </li></ul><ul><ul><li>E.G. – “out of box” linking of ELN data with LODD sources </li></ul></ul><ul><ul><li>Growing public resources provide cost-effective enrichment and hypothesis generation </li></ul></ul><ul><ul><li>Reduce dependencies on expensive commercial databases </li></ul></ul><ul><li>Emergent Properties </li></ul><ul><ul><li>ELN data enrichment and knowledge building made easy! </li></ul></ul><ul><ul><li>Supports inference, rich interrogation, serendipitous discovery </li></ul></ul><ul><li>R&D concepts can now be translated to immediate consumer benefits </li></ul><ul><ul><li>Previous “out of reach” integration and applications become practical </li></ul></ul>IO Informatics © 2011
    21. 21. Quantitative Benefits <ul><li>Reduced effort, time and cost to federate, update, extend to new datasets </li></ul><ul><li>Start sooner - reduce initial design and deployment burden </li></ul><ul><ul><li>Ontologies are explicit, can be engineered from or decoupled from resources, can be altered without refactoring </li></ul></ul><ul><li>Finish sooner - reduced time to create and test integrations </li></ul><ul><ul><li>Agile modeling, integration and testing </li></ul></ul><ul><li>Extend more easily - add new data sources and applications </li></ul><ul><ul><li>Building blocks to add new data, create new applications </li></ul></ul><ul><ul><li>End Users can adapt, modify, extend ontologies </li></ul></ul>IO Informatics © 2011
    22. 22. UBC: Knowledge Network for Organ Failure; “ASK” for Personalized Medicine IO Informatics © 2011
    23. 23. UBC: Knowledge Network for Organ Failure; “ASK” for Personalized Medicine <ul><li>Outcome </li></ul><ul><ul><li>Knowledge network for enrichment, visualization and qualification of patterns indicating risk of organ failure </li></ul></ul><ul><ul><li>Web-based deployment of SPARQL-based screening patterns across multiple data sources, indicating patients-at-risk </li></ul></ul>IO Informatics © 2011
    24. 24. <ul><li>Screening of transplant patients for likelihood of transplant failure, based on combined biomarker patterns </li></ul>Personalized Medicine Knowledge Network IO Informatics © 2011 <ul><li>Web-based Knowledge Application </li></ul><ul><li>Applies patterns for predictive screening </li></ul><ul><li>Weighing, scoring of results </li></ul><ul><li>Bring “hits” back into Knowledge Network for validation of hypotheses and algorithms </li></ul>
    25. 25. ASK™ <ul><li>Extended SPARQL-based Knowledge Application </li></ul><ul><li>Visualization, Testing, Validation of Systems-oriented Hypotheses </li></ul><ul><li>Automation, Dashboards </li></ul>Confidential IO Informatics © 2011
    26. 26. UBC / PROOF: Quantitative… <ul><ul><li>Integration and analysis time reduced from estimated 2 years to about 8 months FTE equivalent </li></ul></ul><ul><ul><li>Time to capture and apply patterns reduced from days to hours </li></ul></ul><ul><ul><li>Knowledge base can be / has been extended to include new public sources in hours (days / weeks with curation) </li></ul></ul><ul><ul><ul><li>Expensive commercial database no longer needed due to ease of integrating public resources </li></ul></ul></ul>IO Informatics © 2011
    27. 27. UBC / PROOF: Qualitative… <ul><ul><li>Visual SPARQL presents queries as hypotheses </li></ul></ul><ul><ul><ul><li>Make it possible for researchers to iteratively create, test and refine hypotheses </li></ul></ul></ul><ul><ul><ul><li>Research queries were published to web service for easy scale-up </li></ul></ul></ul><ul><ul><ul><li>Extended SPARQL delivers practically useful classifiers with scoring </li></ul></ul></ul><ul><ul><li>Use of RDF makes enrichment with public sources practical </li></ul></ul><ul><ul><li>Provenance, reference annotation, original data accessible for review </li></ul></ul><ul><li>“ [The] ability to consume and intuitively represent a wide variety of data-types - from images to quantitative data - and more importantly, display that data in ways that make the significant features immediately obvious to our biologist end-users, has allowed us to move to a completely new level of data analysis….” </li></ul>IO Informatics © 2011
    28. 28. Recap – Core Benefit Drivers <ul><li>Semantic technologies provide a standard method and framework for data publication and interoperability… not a new “data standard”! </li></ul><ul><li>Low barrier to entry for data publishing – extensible building blocks </li></ul><ul><li>for agile, growing integrations </li></ul><ul><li>Reduced effort, time and cost to deliver and maintain data </li></ul><ul><li>definitions and applications, particularly those that depend on </li></ul><ul><li>federation </li></ul><ul><li>Growing public resources are a catalyst and value-add </li></ul><ul><li>Projects that were impractical become practical to achieve and </li></ul><ul><li>maintain </li></ul>IO Informatics © 2011
    29. 29. Thank you! Next, W3C… For questions: <ul><li>Email: [email_address] </li></ul><ul><li>Website: </li></ul>IO Informatics © 2011