• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Bio2RDF and Beyond!
 

Bio2RDF and Beyond!

on

  • 4,382 views

The Bio2RDF project aims to transform silos of bioinformatics data into a distributed platform for biological knowledge discovery. Initial work focused on building a public database of open-linked ...

The Bio2RDF project aims to transform silos of bioinformatics data into a distributed platform for biological knowledge discovery. Initial work focused on building a public database of open-linked data with web-resolvable identifiers that provides information about named entities. This involved a syntactic normalization to convert open data represented in a variety of formats (flatfile, tab, xml, web services) to RDF-based linked data with normalized names (HTTP URIs) and basic typing from source databases. Bio2RDF entities also make reference to other open linked data networks (e.g. dbPedia) thus facilitating traversal across information spaces. However, a significant problem arises when attempting to undertake more sophisticated knowledge discovery approaches such as question answering or symbolic data mining. This is because knowledge is represented in a fundamentally different manner, requiring one to know the underlying data model and reconcile the artefactual differences when they arise. In this talk, we describe our data integration strategy that makes use of both syntactic and semantic normalization to consistently marshal knowledge to a common data model while leveraging explicit logic-based mappings with community ontologies to further enhance the biological knowledgescope.

Statistics

Views

Total Views
4,382
Views on SlideShare
4,031
Embed Views
351

Actions

Likes
5
Downloads
99
Comments
0

5 Embeds 351

http://duncan.hull.name 331
http://www.slideshare.net 14
http://static.slidesharecdn.com 3
http://translate.googleusercontent.com 2
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Can’t answer questions that require background knowledge
  • But don’t have the flexibility to ask sophisticated questions
  • Can’t answer questions that require background knowledge
  • Research – that’s what brought you hereSkils – marketable in whatever you choose to do thereafterKnowledeable – where the field has been and where it is goingImprove oral and written scientific communication skillsResearch – tell people what you’ve been doingTrack progress – develop a sense of progress

Bio2RDF and Beyond! Bio2RDF and Beyond! Presentation Transcript

  • Bio2RDF and Beyond! Large Scale, Distributed Biological Knowledge Discovery
    1
    EBI : 14-01-10
    Michel Dumontier, Ph.D.
    Associate Professor of Bioinformatics
    Carleton University
    Department of Biology
    School of Computer Science
    Institute of Biochemistry
    Ottawa Institute of Systems Biology
    Ottawa-Carleton Institute of Biomedical Engineering
  • Web-based Knowledge Discovery
    a very painful process
    Carole Goble (ISWC 2005)
    2
    EBI : 14-01-10
  • Syntactic Web…
    It takes a lot of digging to get answers
    3
    EBI : 14-01-10
  • Portals provide structured information
    and give better results
    4
    EBI : 14-01-10
  • We need to expose the deep web
    Surface web:167 terabytes
    Deep web:91,000 terabytes
    545-to-one
    EBI : 14-01-10
    5
  • Data silos – not made for sharing
    6
    EBI : 14-01-10
  • How do we integrate these resources?
    7
    EBI : 14-01-10
  • We want to simultaneously query the 1000+ biological databases
    8
    EBI : 14-01-10
  • The Semantic Web is a web of knowledge.
    9
    EBI : 14-01-10
    It is about standards for publishing, sharing and querying
    knowledge drawn from diverse sources
    It enables the answering of
    sophisticated questions
  • A growing web of linked data
    10
    EBI : 14-01-10
  • Life Science Data Contributors
    HCLS (LODD)
    Neurocommons
    Bio2RDF
    EBI : 14-01-10
    11
  • Resource Description Framework (RDF)
    Allows one to talk about anything
    Uniform Resource Identifier (URI) can be used as entity names
    Bio2RDF specifies the naming convention
    http://bio2rdf.org/uniprot:P05067
    is a name for Amyloid precursor protein
    http://bio2rdf.org/omim:104300
    is a name for Alzheimer disease
    uniprot:P05067
    omim:104300
    12
    EBI : 14-01-10
  • Resource Description Framework (RDF)
    Allows one to express statements
    A RDF statement consists of:
    • Subject: resource identified by a URI
    • Predicate: resource identified by a URI
    • Object: resource or literal
    uniprot:P05067
    is a
    uniprot:Protein
    13
    EBI : 14-01-10
  • Multi-Source Data Integration
    depends on consistent naming
    uniprot:P05067
    uniprot:Protein
    uniprot:Protein
    is a
    UniProt
    has name
    +
    uniprot:P05067
    go:Membrane
    uniprot:P05067
    go:Membrane
    located in
    located in
    Gene Ontology
    +
    uniprot:P05067
    interacts with
    uniprot:P05067
    uniprot:P05067
    interacts with
    Unified view
    iRefIndex
    14
    EBI : 14-01-10
  • Building statements creates knowledge
    Amyloid precursor protein
    Alzheimer
    Disease
    label
    label
    is involved in
    uniprot:P05067
    omim:104300
    is a
    is a
    Protein
    Disease
    15
    EBI : 14-01-10
  • RDF has multiple representations
    RDF/XML
    <?xml version="1.0"?>
    <rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:u="http://bio2rdf.org/uniprot:"
    <rdf:Descriptionrdf:about=“&u;Q16665">
    <rdf:typerdf:resource=“&u;Protein"/>
    </rdf:Description>
    </rdf:RDF>
    RDF/N3
    PREFIX u: <http://bio2rdf.org/uniprot:>
    <u:Q16665> a <u:Protein> .
    16
    EBI : 14-01-10
  • Bio2RDF is a framework to create and provision linked data networks
    17
    EBI : 14-01-10
    Francois Belleau, Laval University
    Marc-Alexandre Nolin, Laval University
    Peter Ansell, Queensland University of Technology
    Michel Dumontier, Carleton University
  • Bio2RDF’s RDFized data fits together
    EBI : 14-01-10
    18
  • Bio2RDF now serving over 5 / 15 billion triples of linked biological data
    19
    EBI : 14-01-10
  • Bio2RDF linked data
    20
    EBI : 14-01-10
  • Bioinformatics Discovery Registry
    SharedName initiative to provide stable URI patterns for data records.
    We added the relationship between entities and records
    Directory Service
    ~1700 datasets & dozens of resolvers.
    Discovery Service
    Registry links entities to data records, their formats (RDF/XML, HTML, etc) and provider (Bio2RDF, Uniprot)
    Redirection Service
    Automatic redirection to data provider document
  • something you can lookup or search for with rich descriptions
    22
    EBI : 14-01-10
  • Bio2RDF: Raw Data!
    EBI : 14-01-10
    23
  • 24
    SPARQL is the newcool kid on the query block
    SQLSPARQL
    EBI : 14-01-10
  • Bio2RDF’s describe service uses SPARQL
    CONSTRUCT {
    ?s ?p ?o .
    }
    WHERE {
    ?s ?p ?o .
    FILTER(?s = <http://bio2rdf.org/ns:id>).
    }
    Sent to http://ns.bio2rdf.org/sparql?query=...
    25
    EBI : 14-01-10
    http://bio2rdf.org/ns:id
  • Bio2RDF’s search service uses SPARQLhttp://bio2rdf.org/search/hexokinase
    26
    EBI : 14-01-10
    bio2rdf.org
    kegg
    gene
    uniprot
  • Bio2RDF
    Scalable, Decentralized Data Provision
    Globally Mirrored and Point Provision
    Customizable Query Resolution
    27
    EBI : 14-01-10
  • Customizable Configuration (in N3)Single Query, Single Provider
    EBI : 14-01-10
    28
  • Query Resolution
    EBI : 14-01-10
    29
  • EBI : 14-01-10
    30
  • 31
    EBI : 14-01-10
    700,000 queries in November 2009
  • Yai for data!
    32
    EBI : 14-01-10
    But how do we discover more than what was in the data?
  • Ontology as Strategy
    33
    EBI : 14-01-10
  • Reasoning and Inference through Semantics
    fact
    uniprot:P05067
    is a
    is a
    Uniprot:Protein
    is a
    chebi:Polyatomic
    Entity
    ontology
    Knowledge base
    34
    EBI : 14-01-10
  • The Web Ontology Language (OWL) Has Explicit Semantics
    Can therefore be used to capture knowledge in a machine understandable way
    35
    EBI : 14-01-10
  • Over 170 bio-ontologies
    EBI : 14-01-10
    36
  • From linked data to linked knowledge through syntactic and semantic normalization.
  • Multiple Ways To Represent Knowledge
    Three ways to model the relationship between a protein and the volume it occupies.
  • Web-based Knowledge Discovery
    Some of our queries need services
    39
    EBI : 14-01-10
  • The Holy Grail:
    Align the promoters of all serine threoninekinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.
    Retrieve and align 2000nt 5' from every serine/threoninekinase in Musmusculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is <40% similar in the active site to kinases known to be involved in cell-cycle regulation in any other species.
    40
    EBI : 14-01-10
  • Semantic Automated Discovery and Integration
    http://sadiframework.org
    41
    EBI : 14-01-10
    Mark Wilkinson, UBC
    Michel Dumontier, Carleton University
    Christopher Baker, UNB
  • SADI – described oriented service matching based on registered predicates
  • EBI : 14-01-10
    43
  • What pathways does UniProt protein P47989 belong to?
    PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>
    PREFIX ont: <http://ontology.dumontierlab.com/>
    PREFIX uniprot: <http://lsrn.org/UniProt:>
    SELECT ?gene ?pathway
    WHERE {
    uniprot:P47989 pred:isEncodedBy ?gene .
    ?gene ont:isParticipantIn ?pathway .
    }
    EBI : 14-01-10
    44
  • SADI
    • Describe the input and output using OWL-DL classes
    • Subject of input and output mustbe the same
    • Web services correspond to predicates
    • Biocatalogue to register SADI-compliant services
    • Simplified migration path for existing web services (java, perl)
    45
    EBI : 14-01-10
  • Build a
    knowledge base
    from a series of questions
    46
    EBI : 14-01-10
  • You want to join the knowledge web
    47
    EBI : 14-01-10
  • Share your data
    48
    EBI : 14-01-10
  • Build semantic web services
    49
    EBI : 14-01-10
  • 50
    EBI : 14-01-10
    Get to where you want to be … faster!
  • Next Steps
    Service and Data Buildout
    Formal Partnerships
    Applications
    51
    EBI : 14-01-10
  • dumontierlab.com
    michel_dumontier@carleton.ca
    52
    EBI : 14-01-10
  • EBI : 14-01-10
    53
  • We’re interested in Personalized Medicine
    The ability to offer
    The Right Drug
    To The Right Patient
    For The Right Disease
    At The Right Time
    With The Right Dosage
    Genetic and metabolic data will allow drugs to be tailored to patient subgroups
    54
    EBI : 14-01-10
  • PHARMGKB
    is an emerging resource for pharmacogenomics
    + Role of genes, gene variants , drugs
    + pharmacokinetics
    + pharmacodynamics
    + clinical outcomes.
    + Links to publications
    - Natural language descriptions
    - Variant details in publications
    55
    EBI : 14-01-10
  • Pharmacogenomics of Depression KNOWLEDGE BASE
    contains statements from 11/40 relevant publications involving 45 genes / gene variants, 57 drugs annotated with 19 classes of antidepressants, 45 drug treatments, 47 drug-gene interactions, 29 clinical outcomes, 10 drug-induced side-effects, and 8 gene-disease interactions.
    56
    EBI : 14-01-10
  • Protégé 4, FaCT++, DL Query Tab
    Querying the PDKB
    Nortriptyline induced side effects for ABCB1 gene variants
    ‘side effect’ that
    ‘is realized by’ some
    (‘drug treatment’ that
    ‘involves’ some ‘nortriptyline’ and
    ‘involves’ some (‘variant of’ some ‘ABCB1’))
    57
    EBI : 14-01-10
    postural hypotension is a side effect of nortriptyline treatment of depression for individuals presenting the 3435C>T genotype