Linked Data for integrating life-science databases

Shuichi Kawashima
Shuichi KawashimaResearcher at DBCLS
RDF
•


•



    etc


•




•
B=

     150B



     113B



     75B



     38B



       B
       1982   1986   1990   1994   1998   2002   2006   2010
Linked Data for integrating life-science databases
Linked Data for integrating life-science databases
Linked Data for integrating life-science databases
ID




Gene Ontology, EC   etc
Linked Data for integrating life-science databases
RDF

•


•         UniProt


•         PDBJ DDBJ


•Bio2RDF BioGateway
    RDF
UniProt RDF

• UniProt




•




• UniProt     RDF
UniProt

  Name                          Description                     Source          File size    #triples
  uniprot    Protein annotation data                       UniProt consortium     14G         3.3 B
  uniref     Clusters of proteins with similar sequences   UniProt consortium     7G          900M
  uniparc    Non-redundant archive of UniProt sequences    UniProt consortium     65G          1B
 citations   Literature citations                          UniProt consortium   1355M       10,177,308
 taxonomy    Classification of organisms                    UniProt consortium    421M       5,041,437
 journals    Journals                                      UniProt consortium     3M         34,850
 pathways    Pathways                                      UniProt consortium   1000K         8,865
 keywords    Keywords                                      UniProt consortium    940K         8,449
 locations   Subcellular locations                         UniProt consortium    468K         4,476
  tissues    TIssues                                       UniProt consortium    572K         7439
components   Cellular components (Organelles)              UniProt consortium     6K           43
    go       Gene onotology                                       SBI            25M         263,944
 enzymes     Classification of enzymes                       GO consortium         4M          4,476
 core.owl    Classes and properties for UniProt RDF        UniProt consortium    152K
#triples
  Sesame       Java                       70 M
   4store       C                         15 B
   5store       C
  Virtuoso      C                        15.4 B
   Jena        Java                       1.7 B
  Bigdata      Java                      12.7 B
   ARC         PHP
AllegroGraph   Lisp                        1B
                      http://esw.w3.org/LargeTripleStores
Protein                                          UniProt
         Components                        encodedIn
                            core.owl

<owl:ObjectProperty rdf:about="encodedIn">
    <rdfs:label rdf:datatype="&xsd;string">encoded in</rdfs:label>
    <rdfs:comment rdf:datatype="&xsd;string"
        >The subcellular location where a protein is encoded.</rdfs:comment>
    <rdfs:domain rdf:resource="Protein"/>
    <rdfs:range rdf:resource="Subcellular_Location"/>
</owl:ObjectProperty>
RDF                                                     purl
             http://purl.uniprot.org/{database}/{identifier}

                 UniProt

                     http://purl.uniprot.org/core/

                                Gene                           URI

                  http://purl.uniprot.org/core/Gene

      type
PDBJ, DDBJ                  RDF

• PDBJ
                 47     4.7B


• http://www.pdbj.org/rdf      ID


• DDBJ                                    INSD: International Nucleotide
 Sequence Database                  1.2                              76
      7.6B


• mulgara (http://mulgara.org/)
RDF



     KEGG Taxonomy          23,238
KEGG GENES Cyanobacteria    708,745
        KEGG OC            10,384,602
 hmmer Pfam-A vs Cyano     11,881,212
 hmmer Pfam-B vs Cyano     7,007,154
    Kazusa Annotatioin     2,807,879
1

•


• Synechococcus


• 1.0e-20


•                 Pfam
1     SPARQL
SPARQL 
PREFIX hmmer: <http://hmmer.janelia.org/>
PREFIX kegg: <http://www.kegg.jp/>
PREFIX kg:     <http://www.kegg.jp/entry/>
PREFIX pfam: <http://pfam.sanger.ac.uk/>
PREFIX kt:     <http://www.kegg.jp/taxon/>
SELECT ?pfam1, ?pfam2, COUNT(DISTINCT(?org))
WHERE {
  GRAPH <hmmer_pfam_a_cyano> {
    ?gene hmmer:hit        ?n1 .
    ?gene hmmer:hit        ?n2 .
    ?n1    pfam:pfam_id    ?pfam1 .
    ?n1    hmmer:i-evalue ?eval1 .
    ?n2    pfam:pfam_id    ?pfam2 .
    ?n2    hmmer:i-evalue ?eval2 .
  }
  GRAPH <http://www.kegg.jp/genes> {
    ?gene kegg:belongs_to ?org .
  }
  GRAPH <http://www.kegg.jp/taxonomy> {
    ?org kegg:belongs_to kt:Synechococcus .
  }
  FILTER (?eval1 < 1.0e-10 && ?eval2 < 1.0e-10 && ?pfam1 != ?pfam2)
};
10

  Domain I     Domain II             #genes            #species
RNA_pol_Rpb2 RNA_pol_Rpb2              9                  9
     _3
  G6PD_N          _1
               G6PD_C                  9                  9
5_3_exonuc_N     5_3_exonuc           9                   9
      HIT          DcpS_C             9                   9
Glyco_hydro_38 Glyco_hydro_38         9                   9
                     C
RNA_pol_Rpb2 RNA_pol_Rpb2             9                   9
      _6
   GARS_N            _3
                  GARS_C              9                   9
    DSHCT           DEAD              9                   9
   adh_short         KR               12                  9
    EFG_C          EFG_IV             10                  9
                    ....   171   9     Synechococcus
2

• KEGG                    OC


• Cyanobacteria


• Kazusa Annotation    PumMed


• KO   KEGG Othology
2     SPARQL
SPARQL
PREFIX kegg: <http://www.kegg.jp/>
PREFIX kg: <http://www.kegg.jp/entry/>
PREFIX kt: <http://www.kegg.jp/taxon/>
PREFIX kns: <http://a.kazusa.or.jp/ns/>
SELECT ?oc, ?gene, ?ko, COUNT(DISTINCT(?pm))
WHERE {
  GRAPH <http://www.kegg.jp/oc> {
    ?gene kegg:belongs_to ?oc .
  }
  GRAPH <http://www.kegg.jp/genes> {
    ?gene kegg:belongs_to ?taxon .
    ?gene kegg:linked_to ?cb_gene .
    OPTIONAL {
      ?gene kg:ortholog ?ko .
    }
  }
  GRAPH <http://www.kegg.jp/taxonomy> {
    ?taxon kegg:belongs_to kt:Cyanobacteria .
  }
  GRAPH <http://kazusa.or.jp/cyanobase> {
    ?cb_gene ?p1 ?bm .
    ?bm      ?p2 ?pm .
  }
};
PumMed ID                         10

      OC        #gene with PMID        #PMID
 Genes_537709          3                1296
 Genes_565278          3                761
 Genes_710476          2                527
 Genes_189668          1                497
 Genes_710587          1                479
 Genes_710480          1                416
 Genes_711471          1                407
 Genes_71824           1                393
 Genes_75617           5                381
 Genes_711511          1                376
Semantic Web

•       URI


•


•


• W3C
Semantic Web

• SPARQL
                          ->




•              ->


•


•                    ->
1 of 24

Recommended

Bioinformatica 06-10-2011-t2-databases by
Bioinformatica 06-10-2011-t2-databasesBioinformatica 06-10-2011-t2-databases
Bioinformatica 06-10-2011-t2-databasesProf. Wim Van Criekinge
997 views107 slides
GMueller_Barcelona by
GMueller_BarcelonaGMueller_Barcelona
GMueller_BarcelonaGerhard Müller
941 views21 slides
Bioinformatica t2-databases by
Bioinformatica t2-databasesBioinformatica t2-databases
Bioinformatica t2-databasesProf. Wim Van Criekinge
1.6K views108 slides
DNA sequencer by kk sahu by
DNA sequencer by kk sahu DNA sequencer by kk sahu
DNA sequencer by kk sahu KAUSHAL SAHU
88 views17 slides
Jc synthetic biology 6-15-2012 by
Jc synthetic biology   6-15-2012Jc synthetic biology   6-15-2012
Jc synthetic biology 6-15-2012Diane Wu
733 views30 slides

More Related Content

What's hot

Bonnal bosc2010 bio_ruby by
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBOSC 2010
806 views20 slides
EB-eye Back End by
EB-eye Back EndEB-eye Back End
EB-eye Back EndFranck Valentin
259 views13 slides
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co... by
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Lucidworks
1.1K views51 slides
NCBI Boot Camp for Beginners Slides by
NCBI Boot Camp for Beginners SlidesNCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners SlidesJackie Wirz, PhD
1.7K views147 slides
PAG-2004-Roe by
PAG-2004-RoePAG-2004-Roe
PAG-2004-Roemounir elharam
150 views45 slides
Sequencing and Bioinformatics PGRP Summer 2015 by
Sequencing and Bioinformatics PGRP Summer 2015Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Surya Saha
870 views70 slides

What's hot(7)

Bonnal bosc2010 bio_ruby by BOSC 2010
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_ruby
BOSC 2010806 views
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co... by Lucidworks
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Lucidworks1.1K views
NCBI Boot Camp for Beginners Slides by Jackie Wirz, PhD
NCBI Boot Camp for Beginners SlidesNCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners Slides
Jackie Wirz, PhD1.7K views
Sequencing and Bioinformatics PGRP Summer 2015 by Surya Saha
Sequencing and Bioinformatics PGRP Summer 2015Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015
Surya Saha870 views
Next-generation sequencing from 2005 to 2020 by Christian Frech
Next-generation sequencing from 2005 to 2020Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020
Christian Frech12K views

Similar to Linked Data for integrating life-science databases

Role of bioinformatics in life sciences research by
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
4K views39 slides
RML NCBI Resources by
RML NCBI ResourcesRML NCBI Resources
RML NCBI ResourcesJackie Wirz, PhD
1.4K views182 slides
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle by
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleRNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleJennifer Shelton
933 views29 slides
Bio2RDF@BH2010 by
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010François Belleau
1.9K views58 slides
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle. by
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.Jennifer Shelton
2.1K views30 slides
ICAR 2015 Workshop - Nick Provart by
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartAraport
1.4K views24 slides

Similar to Linked Data for integrating life-science databases(20)

Role of bioinformatics in life sciences research by Anshika Bansal
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
Anshika Bansal4K views
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle by Jennifer Shelton
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleRNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
Jennifer Shelton933 views
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle. by Jennifer Shelton
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
Jennifer Shelton2.1K views
ICAR 2015 Workshop - Nick Provart by Araport
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
Araport1.4K views
Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System by François Belleau
Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge SystemBio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System
Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System
François Belleau1.6K views
Bioinformatic databases 2 by Razzaqe
Bioinformatic databases 2Bioinformatic databases 2
Bioinformatic databases 2
Razzaqe103 views
Bioinformatic databases 2 by Razzaqe
Bioinformatic databases 2Bioinformatic databases 2
Bioinformatic databases 2
Razzaqe79 views
Towards a Reference Genome for Switchgrass (Panicum virgatum) - Schmutz jeremy by Shaojun Xie
Towards a Reference Genome for Switchgrass (Panicum virgatum) - Schmutz jeremyTowards a Reference Genome for Switchgrass (Panicum virgatum) - Schmutz jeremy
Towards a Reference Genome for Switchgrass (Panicum virgatum) - Schmutz jeremy
Shaojun Xie480 views
Crispr/cas9 101 by Suk Namgoong
Crispr/cas9 101Crispr/cas9 101
Crispr/cas9 101
Suk Namgoong42.4K views
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine... by Rothamsted Research, UK
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
2011 jeroen vanhoudt_ngs by Din Apellidos
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
Din Apellidos4.2K views

Recently uploaded

AMD: 4th Generation EPYC CXL Demo by
AMD: 4th Generation EPYC CXL DemoAMD: 4th Generation EPYC CXL Demo
AMD: 4th Generation EPYC CXL DemoCXL Forum
126 views6 slides
Future of Learning - Yap Aye Wee.pdf by
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfNUS-ISS
38 views11 slides
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...NUS-ISS
28 views35 slides
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV by
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTVSplunk
86 views20 slides
MemVerge: Gismo (Global IO-free Shared Memory Objects) by
MemVerge: Gismo (Global IO-free Shared Memory Objects)MemVerge: Gismo (Global IO-free Shared Memory Objects)
MemVerge: Gismo (Global IO-free Shared Memory Objects)CXL Forum
112 views16 slides
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM by
Samsung: CMM-H Tiered Memory Solution with Built-in DRAMSamsung: CMM-H Tiered Memory Solution with Built-in DRAM
Samsung: CMM-H Tiered Memory Solution with Built-in DRAMCXL Forum
105 views7 slides

Recently uploaded(20)

AMD: 4th Generation EPYC CXL Demo by CXL Forum
AMD: 4th Generation EPYC CXL DemoAMD: 4th Generation EPYC CXL Demo
AMD: 4th Generation EPYC CXL Demo
CXL Forum126 views
Future of Learning - Yap Aye Wee.pdf by NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS38 views
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by NUS-ISS
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
NUS-ISS28 views
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV by Splunk
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk86 views
MemVerge: Gismo (Global IO-free Shared Memory Objects) by CXL Forum
MemVerge: Gismo (Global IO-free Shared Memory Objects)MemVerge: Gismo (Global IO-free Shared Memory Objects)
MemVerge: Gismo (Global IO-free Shared Memory Objects)
CXL Forum112 views
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM by CXL Forum
Samsung: CMM-H Tiered Memory Solution with Built-in DRAMSamsung: CMM-H Tiered Memory Solution with Built-in DRAM
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM
CXL Forum105 views
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by Vadym Kazulkin
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
Vadym Kazulkin70 views
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy by Fwdays
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays40 views
"Fast Start to Building on AWS", Igor Ivaniuk by Fwdays
"Fast Start to Building on AWS", Igor Ivaniuk"Fast Start to Building on AWS", Igor Ivaniuk
"Fast Start to Building on AWS", Igor Ivaniuk
Fwdays36 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang35 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst449 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi113 views
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... by NUS-ISS
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
NUS-ISS32 views
TE Connectivity: Card Edge Interconnects by CXL Forum
TE Connectivity: Card Edge InterconnectsTE Connectivity: Card Edge Interconnects
TE Connectivity: Card Edge Interconnects
CXL Forum96 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10165 views
Liqid: Composable CXL Preview by CXL Forum
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum121 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada110 views
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi by Fwdays
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
Fwdays26 views

Linked Data for integrating life-science databases

  • 1. RDF
  • 2. • • etc • •
  • 3. B= 150B 113B 75B 38B B 1982 1986 1990 1994 1998 2002 2006 2010
  • 9. RDF • • UniProt • PDBJ DDBJ •Bio2RDF BioGateway RDF
  • 11. UniProt Name Description Source File size #triples uniprot Protein annotation data UniProt consortium 14G 3.3 B uniref Clusters of proteins with similar sequences UniProt consortium 7G 900M uniparc Non-redundant archive of UniProt sequences UniProt consortium 65G 1B citations Literature citations UniProt consortium 1355M 10,177,308 taxonomy Classification of organisms UniProt consortium 421M 5,041,437 journals Journals UniProt consortium 3M 34,850 pathways Pathways UniProt consortium 1000K 8,865 keywords Keywords UniProt consortium 940K 8,449 locations Subcellular locations UniProt consortium 468K 4,476 tissues TIssues UniProt consortium 572K 7439 components Cellular components (Organelles) UniProt consortium 6K 43 go Gene onotology SBI 25M 263,944 enzymes Classification of enzymes GO consortium 4M 4,476 core.owl Classes and properties for UniProt RDF UniProt consortium 152K
  • 12. #triples Sesame Java 70 M 4store C 15 B 5store C Virtuoso C 15.4 B Jena Java 1.7 B Bigdata Java 12.7 B ARC PHP AllegroGraph Lisp 1B http://esw.w3.org/LargeTripleStores
  • 13. Protein UniProt Components encodedIn core.owl <owl:ObjectProperty rdf:about="encodedIn"> <rdfs:label rdf:datatype="&xsd;string">encoded in</rdfs:label> <rdfs:comment rdf:datatype="&xsd;string" >The subcellular location where a protein is encoded.</rdfs:comment> <rdfs:domain rdf:resource="Protein"/> <rdfs:range rdf:resource="Subcellular_Location"/> </owl:ObjectProperty>
  • 14. RDF purl http://purl.uniprot.org/{database}/{identifier} UniProt http://purl.uniprot.org/core/ Gene URI http://purl.uniprot.org/core/Gene type
  • 15. PDBJ, DDBJ RDF • PDBJ 47 4.7B • http://www.pdbj.org/rdf ID • DDBJ INSD: International Nucleotide Sequence Database 1.2 76 7.6B • mulgara (http://mulgara.org/)
  • 16. RDF KEGG Taxonomy 23,238 KEGG GENES Cyanobacteria 708,745 KEGG OC 10,384,602 hmmer Pfam-A vs Cyano 11,881,212 hmmer Pfam-B vs Cyano 7,007,154 Kazusa Annotatioin 2,807,879
  • 18. 1 SPARQL SPARQL  PREFIX hmmer: <http://hmmer.janelia.org/> PREFIX kegg: <http://www.kegg.jp/> PREFIX kg: <http://www.kegg.jp/entry/> PREFIX pfam: <http://pfam.sanger.ac.uk/> PREFIX kt: <http://www.kegg.jp/taxon/> SELECT ?pfam1, ?pfam2, COUNT(DISTINCT(?org)) WHERE {   GRAPH <hmmer_pfam_a_cyano> {     ?gene hmmer:hit ?n1 .     ?gene hmmer:hit ?n2 .     ?n1 pfam:pfam_id ?pfam1 .     ?n1 hmmer:i-evalue ?eval1 .     ?n2 pfam:pfam_id ?pfam2 .     ?n2 hmmer:i-evalue ?eval2 .   }   GRAPH <http://www.kegg.jp/genes> {     ?gene kegg:belongs_to ?org .   }   GRAPH <http://www.kegg.jp/taxonomy> {     ?org kegg:belongs_to kt:Synechococcus .   }   FILTER (?eval1 < 1.0e-10 && ?eval2 < 1.0e-10 && ?pfam1 != ?pfam2) };
  • 19. 10 Domain I Domain II #genes #species RNA_pol_Rpb2 RNA_pol_Rpb2 9 9 _3 G6PD_N _1 G6PD_C 9 9 5_3_exonuc_N 5_3_exonuc 9 9 HIT DcpS_C 9 9 Glyco_hydro_38 Glyco_hydro_38 9 9 C RNA_pol_Rpb2 RNA_pol_Rpb2 9 9 _6 GARS_N _3 GARS_C 9 9 DSHCT DEAD 9 9 adh_short KR 12 9 EFG_C EFG_IV 10 9 .... 171 9 Synechococcus
  • 20. 2 • KEGG OC • Cyanobacteria • Kazusa Annotation PumMed • KO KEGG Othology
  • 21. 2 SPARQL SPARQL PREFIX kegg: <http://www.kegg.jp/> PREFIX kg: <http://www.kegg.jp/entry/> PREFIX kt: <http://www.kegg.jp/taxon/> PREFIX kns: <http://a.kazusa.or.jp/ns/> SELECT ?oc, ?gene, ?ko, COUNT(DISTINCT(?pm)) WHERE {   GRAPH <http://www.kegg.jp/oc> {     ?gene kegg:belongs_to ?oc .   }   GRAPH <http://www.kegg.jp/genes> {     ?gene kegg:belongs_to ?taxon .     ?gene kegg:linked_to ?cb_gene .     OPTIONAL {       ?gene kg:ortholog ?ko .     }   }   GRAPH <http://www.kegg.jp/taxonomy> {     ?taxon kegg:belongs_to kt:Cyanobacteria .   }   GRAPH <http://kazusa.or.jp/cyanobase> {     ?cb_gene ?p1 ?bm .     ?bm ?p2 ?pm .   } };
  • 22. PumMed ID 10 OC #gene with PMID #PMID Genes_537709 3 1296 Genes_565278 3 761 Genes_710476 2 527 Genes_189668 1 497 Genes_710587 1 479 Genes_710480 1 416 Genes_711471 1 407 Genes_71824 1 393 Genes_75617 5 381 Genes_711511 1 376
  • 23. Semantic Web • URI • • • W3C
  • 24. Semantic Web • SPARQL -> • -> • • ->