Linked Data for integrating life-science databases

•

•

etc

•

•

B=

150B

113B

75B

38B

B
1982 1986 1990 1994 1998 2002 2006 2010

ID

Gene Ontology, EC etc

RDF

•

• UniProt

• PDBJ DDBJ

•Bio2RDF BioGateway
RDF

UniProt RDF

• UniProt

•

• UniProt RDF

UniProt

Name Description Source File size #triples
uniprot Protein annotation data UniProt consortium 14G 3.3 B
uniref Clusters of proteins with similar sequences UniProt consortium 7G 900M
uniparc Non-redundant archive of UniProt sequences UniProt consortium 65G 1B
citations Literature citations UniProt consortium 1355M 10,177,308
taxonomy Classiﬁcation of organisms UniProt consortium 421M 5,041,437
journals Journals UniProt consortium 3M 34,850
pathways Pathways UniProt consortium 1000K 8,865
keywords Keywords UniProt consortium 940K 8,449
locations Subcellular locations UniProt consortium 468K 4,476
tissues TIssues UniProt consortium 572K 7439
components Cellular components (Organelles) UniProt consortium 6K 43
go Gene onotology SBI 25M 263,944
enzymes Classiﬁcation of enzymes GO consortium 4M 4,476
core.owl Classes and properties for UniProt RDF UniProt consortium 152K

#triples
Sesame Java 70 M
4store C 15 B
5store C
Virtuoso C 15.4 B
Jena Java 1.7 B
Bigdata Java 12.7 B
ARC PHP
AllegroGraph Lisp 1B
http://esw.w3.org/LargeTripleStores

Protein UniProt
Components encodedIn
core.owl

<owl:ObjectProperty rdf:about="encodedIn">
<rdfs:label rdf:datatype="&xsd;string">encoded in</rdfs:label>
<rdfs:comment rdf:datatype="&xsd;string"
>The subcellular location where a protein is encoded.</rdfs:comment>
<rdfs:domain rdf:resource="Protein"/>
<rdfs:range rdf:resource="Subcellular_Location"/>
</owl:ObjectProperty>

RDF purl
http://purl.uniprot.org/{database}/{identiﬁer}

UniProt

http://purl.uniprot.org/core/

Gene URI

http://purl.uniprot.org/core/Gene

type

PDBJ, DDBJ RDF

• PDBJ
47 4.7B

• http://www.pdbj.org/rdf ID

• DDBJ INSD: International Nucleotide
Sequence Database 1.2 76
7.6B

• mulgara (http://mulgara.org/)

RDF

KEGG Taxonomy 23,238
KEGG GENES Cyanobacteria 708,745
KEGG OC 10,384,602
hmmer Pfam-A vs Cyano 11,881,212
hmmer Pfam-B vs Cyano 7,007,154
Kazusa Annotatioin 2,807,879

1

•

• Synechococcus

• 1.0e-20

• Pfam

1 SPARQL
SPARQL
PREFIX hmmer: <http://hmmer.janelia.org/>
PREFIX kegg: <http://www.kegg.jp/>
PREFIX kg: <http://www.kegg.jp/entry/>
PREFIX pfam: <http://pfam.sanger.ac.uk/>
PREFIX kt: <http://www.kegg.jp/taxon/>
SELECT ?pfam1, ?pfam2, COUNT(DISTINCT(?org))
WHERE {
  GRAPH <hmmer_pfam_a_cyano> {
    ?gene hmmer:hit ?n1 .
    ?gene hmmer:hit ?n2 .
    ?n1 pfam:pfam_id ?pfam1 .
    ?n1 hmmer:i-evalue ?eval1 .
    ?n2 pfam:pfam_id ?pfam2 .
    ?n2 hmmer:i-evalue ?eval2 .
  }
  GRAPH <http://www.kegg.jp/genes> {
    ?gene kegg:belongs_to ?org .
  }
  GRAPH <http://www.kegg.jp/taxonomy> {
   ?org kegg:belongs_to kt:Synechococcus .
  }
  FILTER (?eval1 < 1.0e-10 && ?eval2 < 1.0e-10 && ?pfam1 != ?pfam2)
};

10

Domain I Domain II #genes #species
RNA_pol_Rpb2 RNA_pol_Rpb2 9 9
_3
G6PD_N _1
G6PD_C 9 9
5_3_exonuc_N 5_3_exonuc 9 9
HIT DcpS_C 9 9
Glyco_hydro_38 Glyco_hydro_38 9 9
C
RNA_pol_Rpb2 RNA_pol_Rpb2 9 9
_6
GARS_N _3
GARS_C 9 9
DSHCT DEAD 9 9
adh_short KR 12 9
EFG_C EFG_IV 10 9
.... 171 9 Synechococcus

2

• KEGG OC

• Cyanobacteria

• Kazusa Annotation PumMed

• KO KEGG Othology

2 SPARQL
SPARQL
PREFIX kegg: <http://www.kegg.jp/>
PREFIX kg: <http://www.kegg.jp/entry/>
PREFIX kt: <http://www.kegg.jp/taxon/>
PREFIX kns: <http://a.kazusa.or.jp/ns/>
SELECT ?oc, ?gene, ?ko, COUNT(DISTINCT(?pm))
WHERE {
  GRAPH <http://www.kegg.jp/oc> {
   ?gene kegg:belongs_to ?oc .
  }
  GRAPH <http://www.kegg.jp/genes> {
   ?gene kegg:belongs_to ?taxon .
   ?gene kegg:linked_to ?cb_gene .
   OPTIONAL {
   ?gene kg:ortholog ?ko .
   }
  }
  GRAPH <http://www.kegg.jp/taxonomy> {
   ?taxon kegg:belongs_to kt:Cyanobacteria .
  }
  GRAPH <http://kazusa.or.jp/cyanobase> {
   ?cb_gene ?p1 ?bm .
   ?bm ?p2 ?pm .
  }
};

PumMed ID 10

OC #gene with PMID #PMID
Genes_537709 3 1296
Genes_565278 3 761
Genes_710476 2 527
Genes_189668 1 497
Genes_710587 1 479
Genes_710480 1 416
Genes_711471 1 407
Genes_71824 1 393
Genes_75617 5 381
Genes_711511 1 376

Semantic Web

• URI

•

•

• W3C

Semantic Web

• SPARQL
->

• ->

•

• ->

Linked Data for integrating life-science databases

More Related Content

What's hot

Similar to Linked Data for integrating life-science databases

Recently uploaded

Linked Data for integrating life-science databases