Bio2RDF : A biological knowledge base for the Semantic Web

Bio2RDF: A biological knowledge
base for the Semantic Web

Michel Dumontier, François Belleau,
Marc-Alexandre Nolin, Peter Ansell

Web search for biological information
is hit or miss

Introducing...

something you can lookup and
search for with rich descriptions

Surface web:
167 terabytes

Deep web:
91,000 terabytes

545-to-one

Bio-Portals provide Database access
give better results

We want to
simultaneously
query the 1000+
biological databases

Data silos – not made for sharing

How do we integrate these resources?

Bio2RDF provides the
methodology to create and
glue these different
networks.

Bio2RDF is building the linked data web for
biological data

Contributing to a growing linked data web

What is the semantic web?
The Semantic Web is a web of knowledge.

It is about standard formats for
representing and querying
knowledge drawn from
diverse sources and
making statements
about real
objects.

Goals for the Semantic Web
• Provide a common knowledge representation
• syntax & semantics
• Facilitate publishing, data integration and
information retrieval
• Make possible semantically interoperable web
applications and services
• Enable the answering of questions across global
repositories of knowledge

Resource Description Framework (RDF)

• Allows one to express propositions, and reason
about them
• Uniform Resource Identifier (URI) are entity names
• i.e http://purl.uniprot.org/uniprot/Q16665
• A RDF statement consists of:
– Subject: resource identified by a URI u:Q16665

– Predicate: resource identified by a URI rdf:type
– Object: resource or literal
Protein

Semantic Knowledge Base
fact
Q16665

rdf:type

Protein rdf:type

rdfs:subClassOf

Molecule

ontology
Knowledge base

RDF/XML
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:u="http://purl.uniprot.org/uniprot/"

<rdf:Description rdf:about=“&u;Q16665">
<rdf:type rdf:resource=“&u;Protein"/>
</rdf:Description>
</rdf:RDF>

N3
PREFIX u: <http://purl.uniprot.org/uniprot/> .

<u:Q16665> a <u:Protein> .

16

Syntactic Data Integration
depends on consistent naming

has name
u:Q16665 HIF1-alpha
HIF1-alpha
UniProt
has name
+

located in located in
u:Q16665 go:nucleus u:Q16665 go:nucleus

Gene Ontology

+ interacts with
u:vhl
interacts with
u:Q16665 u:vhl Unified view
BIND

Semantic Data Integration
depends on accurate typing

Protein

rdf:type

U:Q16665 u:vhl

Linked Data

http://www.w3.org/DesignIssues/LinkedData

Bio2RDF Design Principles

http://bio2rdf.wiki.sourceforge.net/Banff+Manifesto

Over 1800 namespaces

Compiled From: NAR, BioMoby, UniProt, NCBI, SRS

Naming Convention
http://bio2rdf.org/namespace:identifier

http://bio2rdf.org/pdb:1AM0

http://bio2rdf.org/gi:99

Namespace Domain Updated Triples Topics Namespaces SPARQL

Affymetrix Probeset loading 45560115 1708777 20affymetrix

BIND Network information 09/04/1930 bind
BioCYC Pathway/BioPAX 4418699622 + xref biocyc
ChEBI@EBI Chemistry 09/03/2025 4764030 50377 25chebi
CPD@KEGG Chemistry 09/04/2014 177199 14071 10kegg
cPath Pathway/BioPAX 09/04/2007 28052098 51cpath
DBpedia Encyclopedia 09/03/2023 190790 0 21dbpedia
DR@KEGG Drug 09/04/2014 116822 8117 8dr
EC@KEGG Enzyme 09/04/2014 556888 4245 4ec
EC@UniProt Enzyme 09/04/2014 36109 enzyme
GeneID@NCBI Gene loading 1.73E+08 86geneid
GL@KEGG Chemistry 09/04/2014 94148 10965 2kegg
GO Ontology 09/03/2015 8188649 804979 144go
HGNC Genome 09/03/2025 1085662 125256 14hgnc

HomoloGene@NCBI Homolog 09/03/1931 6598206 7homologene
IProClass@PIR Protein loading 1.92E+08 19iproclass
MGI Genome 09/03/2025 3089976 12mgi
OBO Ontology 09/03/2027 4507016 4954332 165obo
OMIM@NCBI Disease 09/03/2024 1048053 32102 7omim
Path@KEGG Pathway 09/03/2028 50793314 kegg
PDB Protein 09/03/2021 1215254 44569 2pdb
Pubmed@UniProt Article 09/03/1931 pubmed
Pubmed@NCBI Article 09/03/1931 pubmed
Reactome Pathway/BioPAX 09/04/2015 57527092 22reactome
RN@KEGG Pathway 09/04/2015 110971 7755 5kegg
SGD Genome 09/04/2015 1437648 13sgd
Taxonomy@UniProt Taxon 09/04/2014 3230933 taxonomy
UniParc@UniProt Sequence 09/04/2009 5.59E+08 53uniparc

UniPathway@UniProt Pathway 09/04/2014 8508 unipathway
UniProtKB@UniProt Protein 09/04/2016 4.56E+08 135uniprot
UniRef@UniProt Homolog 09/04/2008 3.9E+08 5uniref
UniSTS@NCBI Marker 09/03/1931 7542235 7unists

Bio2RDF Software
• http://sourceforge.net/projects/bio2rdf/
• Virtuoso Triple Store gives SPARQL endpoint
• Bio2RDF software transforms URIs to SPARQL
queries directed to one or more endpoints
• RDFizers – transform legacy data into RDF
– OMIM, KEGG
• SW DBs – rules to create Bio2RDF URI’s
– Dbpedia, BioPAX

SPARQL Endpoints
http://ns.bio2rdf.org/sparql

http://atlas.bio2rdf.org/sparql

Services
• Describe a resource
– http://bio2rdf.org/ns:id
• Global services over federated endpoints
– http://bio2rdf.org/links/ns:id
– http://bio2rdf.org/search/term
• Targeted services to a specific endpoint
– http://bio2rdf.org/linksns/ns/ns2:id
– http://bio2rdf.org/searchns/ns/term

Describe service
http://bio2rdf.org/ns:id

Corresponding SPARQL query :
CONSTRUCT {
?s ?p ?o .
}
WHERE {
?s ?p ?o .
FILTER(?s = <http://bio2rdf.org/ns:id>).
}

Sent to http://ns.bio2rdf.org/sparql?query=...
DNS subdomain resolution service

Search Service
http://bio2rdf.org/search/hexokinase

Virtuoso 6.0 Facet Browsing
http://lod.openlinksw.com/

Multiple Ways To Represent Knowledge

Fig. 2. Three ways to model the relationship between a protein and the volume it occupies.

Fig. 1. From linked data to linked knowledge through syntactic and semantic normalization.

OWL Has Explicit Semantics

Can therefore be used to captured knowledge in
a machine understandable way

A generalized Biological Data Model

Semantic normalization will improve facet browsing
and question answering

You want to join the knowledge web

Bridge your data
with others in
semantic
communities
(data networks).

Time-sensitive or frequently updated data is
one way to encourage more visits.

Bioinformatics Discovery Registry
• Part of SharedName initiative to provide stable URI
patterns for data records.
• We add the relationship between entities and records

Discovery Service
• Registry links entities to data records, their formats
(RDF/XML, HTML, etc) and provider (Bio2RDF, Uniprot)
http://registry.semanticscience.org/ns:id

Redirection Service
• Automatic redirection to data provider document
http://registry.semanticsience.org/doc/provider/format/ns:id

Build a
knowledge base
from a series of questions

Web-based Knowledge Discovery
a very painful process

Carole Goble (ISWC 2005)

The Knowledge Web
• Merging data & services
• Reasoning & question answering
• Persistent (RESTful)
• Trust & Security

Data consumers must be able
to rely upon your data to use it
as a foundation for their own
applications.

2009 Goals
• Add more data!
– Standardize RDFizers
– Enrichment from small producer data!
• Design more RESTful services (Workflow)
• Start using Virtuoso 6 cluster
• Add mirrors
• Approval from data providers to distribute RDF
dump and publish SPARQL endpoints
– Confirmed: UniProt, BioCyc, Pathway Commons, BIND

Triplified Data and Virtuoso DB

http://quebec.bio2rdf.org/download

RDFizer Cookbook

http://bio2rdf.wiki.sourceforge.net/

Thanks To:
• The Bio2RDF community
• Dumontier Lab
– Alex De Leon, Jose Cruz, Natalia Villanueva-Rosales
• Quebec Reseachers
– Francois Belleau, Marc-Alexandre Nolin
• Australian Researchers
– Peter Ansell
• Openlink Virtuoso Team

dumontierlab.com
michel_dumontier@carleton.ca

Bio2RDF : A biological knowledge base for the Semantic Web

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (20)

Similar to Bio2RDF : A biological knowledge base for the Semantic Web

Similar to Bio2RDF : A biological knowledge base for the Semantic Web (20)

More from Michel Dumontier

More from Michel Dumontier (20)

Recently uploaded

Recently uploaded (20)

Bio2RDF : A biological knowledge base for the Semantic Web