Bio2RDF @ DILS 2008

BIO2RDF : A Semantic Web Atlas of
post genomic knowledge about
Human and Mouse
François Belleau, Nicole Tourigny,
Benjamin Good and Jean Morissette
● Centre de Recherche du CHUL, Université Laval
● Département d'informatique et de génie logiciel, Université Laval

Vaugondy, Louis XV geograph, view of
the world in the 18th
century

Google Map view of the world in the
21th
century

Evry, June 27, 2008 CHUL research center Laval University 4
Outline
 Introduction
− Problem definition
− Proposed approach
− The 4 rules of linked data
− Related Work
 Results
− Bio2RDF first knowledge map
− Semantic ranking
 Paget query demo with SPARQL
 Future work and Conclusion

Problem definition
● The objective of data integration is to
make data distributed over a number of
distinct, heterogeneous databases
accessible via a single interface
[Davidson 1995].
● We already use global text search engine
on the web (Google, Yahoo).
● There is many specialized integrated
search tools in bioinformatics (NCBI
Entrez, EBI search, KEGG GenomeNet).

What is known about
« Paget disease» ?
but first ...
What is known about the
mouse and human
genomes ?

Popular web search engines
without semantic

Some Bioinformatics integrated
search tools
● EMBL-EBI EB-eye search
● KEGG GenomeNet
● NCBI Entrez

EMBL-EBI search

NCBI Entrez life science search
across databases

KEGG GenomeNet search

Bio2RDF search
What is known about Paget
disease in the mouse and
human genomes ?

Proposed approach
● Apply the semantic web model to data
integration in bioinformatics;
● Use a PageRank [Brin 1998] variation
adapted to semantic graph, a method
analog to Aleman-Meza group's work: the
LinkRank;
● Adopt standard (RDF, OWL) and use
existing software (Sesame, Virtuoso,
PiggyBank).

Outline
 Introduction
− Related Work
 Results

Linked data 4 rules
http://www.w3.org/DesignIssues/LinkedData

Rule #1: Use URIs as names for
things.
● Using normalized identifier to name
concept is already a reality in biology
domain.
● Hexokinase is GO:0004396
● Definition :
− Catalysis of the reaction: ATP + D-hexose =
ADP + D-hexose 6-phosphate.
● Synonym of EC:2.7.1.1

Rule #2 : Use HTTP URIs so that
people can look up those names.
● Derefencable URL
● The Banff Manifesto rule for URN
− urn:bm:public_namespace:private_identifier
● Normalized URL according to Banff
Manifesto:
http://bio2rdf.org/public_namespace:private_identifier
● http://bio2rdf.org/go:0004396

Rule #3 When someone looks up a
URI, provide useful information.
● http://bio2rdf.org/go:0004396 returns the
RDF graph of this topic

Rule #4 :Include links to other URIs so
that they can discover more things.
●Openess Ratio > 0 (to be defined)

Outline
 Introduction
− Related Work
 Results

Related work
● DBPedia
● YeastHub
● UniProt
● HCLS linked data
● Bio2RDF architecture

Related work – Linked data map
http://wiki.dbpedia.org/Interlinking

Related work – Linked data map
● If we were to draw a map of the existing
relations between linked data from
bioinformatics database providers, what
would it look like?
● Could we measure the amount of post
genomic knowledge available related to a
mouse or human genome sequence?
● Could it help answer the what is known
question?

Related work – YeastHub

Related work – UniProt beta

Related work – HCLS demo

Bio2RDF architecture

Bio2RDF actual datasources loaded
in the Atlas graph

Outline
 Introduction
− Related Work
 Results

What is known about human
and mouse genome in 2008?

What is Bioinformatics linked data ?

http://bio2rdf.org/map
Bio2RDF linked data map is a first
answer attempt

Outline
 Introduction
− Related Work
 Results

Semantic Web Ranking
● Openess Ratio
● Averange Link Rank
● Semantic weight

The semantic mashup effect
OR = 0
ALR = 2
MeSH
OR = 1
ALR = 1
GeneID
OR = 0,5
ALR = 1,5
PubMed
mean OR = 0,5
mean ALR = 1,5

The semantic mashup effect
OR = 0
ALR = 2,3
MeSH
OR = 1
ALR = 1
GeneID
OR = 0,5
ALR = 1,5
PubMed
mean OR = 0,4
mean ALR = 1,6

Bio2RDF statistics by
datasource

Bio2RDF : OR = 0,6
30 datasources, 225 namespaces

Knowledge gain of 0,19
From 0,77
to 0,58

Bio2RDF Semantic Web Atlas
in numbers
● 30 different datasources, 30 different
namespaces
− go, geneid, uniprot, pubmed, pdb, reactome, omim,
etc.
● 195 namespaces referencing non-rdfized
datasource
− cog, genethon, tigr, cath, goa, etc.
● 8 millions topics
● 65 millions triples
● 973 Mo, size of N3 format compressed data
− http://bio2rdf.org/download/bio2rdf-atlas-080414.n3.gz

Bio2RDF Semantic Web Atlas
in statistics
● Openess Ratio (OR) of 0.58
● Averange Link Rank (ALR) of 4.7
● 8 millions topics are connected by 19 millions
relations within the graph
● 58 % of URIs are referencing the open world
outside the graph
● 19 % of knowledge gain because of the mashup
effect

Outline
 Introduction
− Related Work
 Results

Bio2RDF search demo with
SPARQL
What is known about Paget
disease in the mouse and
human genomes ?
Submitted at
http://bio2rdf.org:8890/sparql

Submit the SPARQL query to Virtuoso

SPARQL query in a URL
http://bio2rdf.org:8890/sparql?defaultgraph
uri=&query=CONSTRUCT+%7B%0D%0A%3Fs1+%3Fp1+%3Fo1+.%0D%0A
%3Fs1+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22rdf
syntaxns%23type%3E+%3Ftype+.+%0D%0A%3Fs1+%3Chttp%3A%2F
%2Fwww.w3.org%2F2000%2F01%2Frdfschema%23label%3E+
%3Flabel.+%0D%0A%3Fs1+%3Chttp%3A%2F%2Fbio2rdf.org
%2Fbio2rdf%23linkRank%3E+%3FlinkRank.+%0D%0A%7D%0D
%0AWHERE+%7B%0D%0A%3Fs1+%3Fp1+%3Fo1+.+%0D%0A%3Fo1+bif
%3Acontains+%22paget%22+.%0D%0A%3Fs1+%3Chttp%3A%2F
%2Fwww.w3.org%2F1999%2F02%2F22rdfsyntaxns%23type%3E+
%3Ftype+.+%0D%0A%3Fs1+%3Chttp%3A%2F%2Fwww.w3.org
%2F2000%2F01%2Frdfschema%23label%3E+%3Flabel.+%0D%0A
%3Fs1+%3Chttp%3A%2F%2Fbio2rdf.org%2Fbio2rdf%23linkRank
%3E+%3FlinkRank.+%0D%0A%7D%0D%0A%0D%0A%0D%0A%0D
%0A&format=application%2Frdf%2Bxml&debug=on

View results in HTML

View results with Sesame

View results with Piggy Bank

Outline
 Introduction
− Related Work
 Results

Future works
● Create new rdfizer for public data source;
● Build a community of users around the
Bio2RDF project (visit the Google group);
● Connect more datasources to Bio2RDF by
building collaboration between research
groups;
● Offer a public SPARQL endpoint based on
Virtuoso server :
− http://bio2rdf.org:8890/sparql

Conclusion
Those devices in the hands of scientists have
forged our understanding of nature.

Conclusion
We have started to map the knowledge
space of biology, we have a first
impression of what the bioinformatics
nation looks like, the time has come to
explore it, the time has come to build
the knowledgescope.

Acknowlegments
Jean Morissette
Nicole Tourigny
Benjamin Good
Bioinformatics lab’s team at CHUL Research Center :
Philippe Rigault
Marc-Alexandre Nolin
Thanks to the essential annotators and data provider
and to developers of open source project :
Sesame, Virtuoso and PiggyBank.
François Belleau was a recipient of a studentship from Génome Québec.
This work have been financed in part by the Atlas of Genomic Profiles of Steroid
Action, a Genome Canada project. BMG is funded by Pacific Century
and University of British Columbia Graduate Fellowships.

http://bio2rdf.org
Query the graph with SPARQL
http://bio2rdf.org:8890/sparql
Download our software
http://sourceforge.net/projects/bio2rdf/
Download the Atlas data in N3 format
http://bio2rdf.org/download
Join our group
http://groups.google.ca/group/bio2rdf
Contact us at bio2rdf@gmail.com

Bio2RDF @ DILS 2008

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bio2RDF @ DILS 2008

Similar to Bio2RDF @ DILS 2008 (20)

More from François Belleau

More from François Belleau (19)

Recently uploaded

Recently uploaded (20)

Bio2RDF @ DILS 2008