Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

3,210 views

Published on

Bio2RDF presentation at WWW2007 HCLS Workshop.

http://bio2rdf.org/www2007/

Published in: Technology, Education

Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

  1. 1. Towards A M ashup To Build Bioinformatics K nowledge System François Belleau, M arc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, Jean M orissette Département d'informatique et de génie logiciel Université Laval
  2. 2. Presentation Plan K nowledge integration vision  Bio2RDF architecture  RDFization of knowledge  Normalization of U RI  Parkinson E xample Demo  Conclusion  Banff, May 8, 2007 CHUL research center - Laval University 2
  3. 3. From the RDF inventor : quot;Wouldn't it be great if you were able to organize all this information based on your own terms, instead of based on the application you use to access the information ?” (1999) Ramanathan V. Guha From WikiPedia : Mashup (web application hybrid) A mashup is a website or application that combines content from more than one source into an integrated experience.(2007) Banff, May 8, 2007 CHUL research center - Laval University 3
  4. 4. Sir Berners-L ee’s vision of semantic web « The Semantic Web is not a separate Web but an extension of the current one, in which information is given well- defined meaning, better enabling computers and people to work in cooperation. » Scientific Americain, 2001 Tim Berners- Lee http://www.w3.org/2006/Talks/0404-mit-tbl/ Banff, May 8, 2007 CHUL research center - Laval University 4
  5. 5. Bio2RDF starting vision at ISM B 2005 Too many knowledge sources  available for life science scientists Too many formats (text, X M L ,  HTM L ) New source each day with  specialized tool or web interface Integration problem recognized by  global community T hanks to Chr istopher Baker, Eric Neum ann, Kei Cheun g and Johan ne Luciaono for their ideas. Banff, May 8, 2007 CHUL research center - Laval University 5
  6. 6. The knowledge integration problem in bioinformatics From the BioPAX group(2004) From Carol Goble at ISW C 2005 Banff, May 8, 2007 CHUL research center - Laval University 6
  7. 7. Integration methods in bioinformatics 1) Davidson 1995 “Transform data to the federated database on demand” 2) Köhler 2003 “In different databases the same things can be given different names” 3) Stein 2003 “link integration, view integration and data warehousing” Banff, May 8, 2007 CHUL research center - Laval University 7
  8. 8. Data warehouse approaches url http://www.ncbi.nlm.nih.gov/Database/ http://www.genome.jp/dbget/dbget.links.html Banff, May 8, 2007 CHUL research center - Laval University 8
  9. 9. Bio2RDF ’s approach to knowledge integration : “Solve the problem of kn owledge in tegration in biology by applying a sem antic web approach.” Banff, May 8, 2007 CHUL research center - Laval University 9
  10. 10. Other semantic web projects Banff, May 8, 2007 CHUL research center - Laval University 10
  11. 11. Bio2RDF ’s design rules 2. Convert document to RDF format; 3. U se of a triplestore technology (sesame, virtuoso, oracle); 4. Normalize U RIs; 5. Build a mashup as needed to answer specific question (elmo); 6. Query the mashup with SeRQL or SPARQL . Banff, May 8, 2007 CHUL research center - Laval University 11
  12. 12. Bio2RDF ’s architecture #1 #5 #4 #2 #3 #6 Banff, May 8, 2007 CHUL research center - Laval University 12
  13. 13. Bio2RDF ’s knowledge sources Banff, May 8, 2007 CHUL research center - Laval University 13
  14. 14. RDF conversion statistics Data Numb er of RDF sourc LSID example Size of data converted documents e go go:0000001 22 961 507 963 321 kegg path:aae00010 35 257 1 038 593 137 14 292 8 902 205 kegg cpd:c00001 438 724 210 458 897 mgi mgi:96103 17 359 573 639 380 ncbi omim:100050 ncbi geneid:1 2 744 786 67 225 535 082 obo obo's 59 name spaces 279 720 216 007 267 pdb pdb:100d 34 421 16 309 651 935 4 177 176 29 453 203 064 uniprot uniprot:A0A0 00 5 020 2 844 058 uniprot enzyme:1.-.-.- 191 664 364 728 083 uniprot pubmed:100133 uniprot taxonomy :10 337 564 125 630 659 uniprot niref:UniRef100_A0A000 u 7 990 452 14 865 490 144 … … … … Banff, May 8, 2007 CHUL research center - Laval University 14
  15. 15. OpenRDF ’s software http://www.openrdf.org/ Banff, May 8, 2007 CHUL research center - Laval University 15
  16. 16. RDF of geneid:15275 rdf:about • rdfs:label • dc:identifier, title, created • bio2rdf:lsid • bio2rdf:url • bio2rdf:synonym • bio2rdf:xRef • Banff, May 8, 2007 CHUL research center - Laval University 16
  17. 17. RDFizer To rdfize: T o convert existin g docum ent in to RD F form at. efetch rdfizer Banff, May 8, 2007 CHUL research center - Laval University 17
  18. 18. How to rdfize From HTM L pages (prosite:ps00101) • From X M L documents using X SLT • (path:mmu00010) From X M L documents using X Path and • J STL (geneid:15275) From direct SQL access • (ensembl:ensmusg00000025875 ) From RDF document (uniprot:p26838 ) • From Text files (cpd:c00001) • Banff, May 8, 2007 CHUL research center - Laval University 18
  19. 19. 1) prosite:ps00101 from html using a regex Banff, May 8, 2007 CHUL research center - Laval University 19
  20. 20. 2) Kegg’s path:mmu00010 from X M L using X SL Banff, May 8, 2007 CHUL research center - Laval University 20
  21. 21. 3) ensembl:ensmusg00000025875 from SQL Banff, May 8, 2007 CHUL research center - Laval University 21
  22. 22. 4) uniprot:p26838 from RDF using SeRQL Banff, May 8, 2007 CHUL research center - Laval University 22
  23. 23. One reality, many names Different namespace identifier ● pubmed:11992264 vs pmid:11992264 Uppercase and lowercase ● uniprot:p26838 vs uniprot:P26838 Version number ● genbank:ac008393 vs genbank:ac008393.7 Total id length ● go:0032283 vs go:32283 Banff, May 8, 2007 CHUL research center - Laval University 23
  24. 24. RDF izing docum ent is not enough we also need norm alized URIs. http:/ / bio2rdf.org/ namespace:id http:/ / bio2rdf.org/ pubmed:11992264 http:/ / bio2rdf.org/ uniprot:p26838 http:/ / bio2rdf.org/ genbank:ac008393 http:/ / bio2rdf.org/ go:0032283 Banff, May 8, 2007 CHUL research center - Laval University 24
  25. 25. U RI Normalization rules Different namespace identifier ● We resolve namespace synonymy with a urlrewrite rule, for example pubmed and pmid. Uppercase and lowercase ● We write every U RI in lowercase Version number ● A owl:sameAs predicate is use to link the different versions of a document. Total id length ● A fixed length is determine for id. Banff, May 8, 2007 CHUL research center - Laval University 25
  26. 26. U rl Rewrite Filter http://tuckey.org/urlrewrite/ < rule> < from> ^/ search:(.*?)@pubmed< / from> < to> / rdfizer/ ncbi-entrez2rdf.jsp?db= pubmed&amp;query= $1< / to> < / rule> < rule> < from> ^/ pubmed:(.*)< / from> < to> / rdfizer/ ncbi-pubmed2rdf.jsp?id= $1< / to> < / rule> < rule> < from> ^/ pmid:(.*)< / from> < to> / rdfizer/ lsid-sameas2rdf.jsp?from= pmid:$1&amp;to= pubmed:$1< / to> < / rule> < rule> < from> ^/ (.*):(.*)< / from> < to type= quot;redirectquot;> http:/ / bio2rdf.org/ $1:$2< / to> < / rule> Banff, May 8, 2007 CHUL research center - Laval University 26
  27. 27. U RL vs L SID http:/ / bio2rdf.org/ uniprot:p26838 owl:sameAs urn:lsid:uniprot.org:uniprot:p26838 http:/ / bio2rdf .org/ un ipr ot:p26838 http:/ / bi o2rdf .org/ ur n:lsid:uni pr ot.or g:unipr ot:p2 6838 Banff, May 8, 2007 CHUL research center - Laval University 27
  28. 28. Our method to answer question T o answer a very specialized question, we build a specifi c kn owledge base (the mash up stored in a RDF triplestore) and then query it wi th SeRQL. Banff, May 8, 2007 CHUL research center - Laval University 28
  29. 29. Parkinson examples 1. What is the semantic network of OMIM records describing Parkinson’s disease? 2. Which MeSH terms are mostly cited in Parkinson’s disease publications? 3. What genes related to Parkinson’s disease are involved in pathways according to Kegg ? Banff, May 8, 2007 CHUL research center - Laval University 29
  30. 30. Time for demo ! Banff, May 8, 2007 CHUL research center - Laval University 30
  31. 31. The big everything about parkinson http:/ / localhost:8080/ bio2rdf/ search:parkinson@omim http:/ / localhost:8080/ bio2rdf/ search:parkinson@geneid http:/ / localhost:8080/ bio2rdf/ search:parkinson@uniprot http:/ / localhost:8080/ bio2rdf/ search:parkinson@kegg http:/ / localhost:8080/ bio2rdf/ load:pubmed http:/ / localhost:8080/ bio2rdf/ sameas:hsa-geneid http:/ / localhost:8080/ bio2rdf/ learn:geneid http:/ / localhost:8080/ bio2rdf/ load:cpd http:/ / localhost:8080/ bio2rdf/ load:reactome http:/ / localhost:8080/ bio2rdf/ load:biopax-xref http:/ / localhost:8080/ bio2rdf/ load:chebi http:/ / localhost:8080/ bio2rdf/ load:obo-xref http:/ / localhost:8080/ bio2rdf/ sameas:keggcompound-cpd 1.700 K triples 97 M bytes in turtle format in 90 minutes Banff, May 8, 2007 CHUL research center - Laval University 31
  32. 32. Third exemple SeRQL query What genes related to Parkinson’s disease are involved in pathways according to Kegg ? SELECT GeneticDisorder-label, Gene-label, pathway-label FROM {GeneticDisorder} rdf:type {<http://bio2rdf.org/omim#GeneticDisorder>}, {GeneticDisorder} rdfs:label {GeneticDisorder-label}, {GeneticDisorder} <http://www.w3.org/2002/07/owl#sameAs> {sameAs}, {Gene} <http://bio2rdf.org/bio2rdf#xRef> {sameAs}, {Gene} rdfs:label {Gene-label}, {Gene2} <http://www.w3.org/2000/01/rdf-schema#seeAlso> {Gene}, {xobject} <http://bio2rdf.org/kegg#xobject> {Gene2}, {xentry1} <http://bio2rdf.org/kegg#xentry1> {xobject}, {pathway} <http://bio2rdf.org/kegg#xrelation> {xentry1}, {pathway} rdfs:label {pathway-label} WHERE GeneticDisorder-label like quot;*PARKINSON*quot; Banff, May 8, 2007 CHUL research center - Laval University 32
  33. 33. Query result Banff, May 8, 2007 CHUL research center - Laval University 33
  34. 34. Conclusion Banff, May 8, 2007 CHUL research center - Laval University 34
  35. 35. Before Bio2RDF integration Banff, May 8, 2007 CHUL research center - Laval University 35
  36. 36. Our main results ● RDF is a framework that enables a very simple thing: scalability of the knowledge base complexity. ● The Bio2RDF project proposes to keep complexity in the bioinformatics knowledge space under control by applying this proven web semantic approach. Banff, May 8, 2007 CHUL research center - Laval University 36
  37. 37. Now with Bio2RDF semantic integration Banff, May 8, 2007 CHUL research center - Laval University 37
  38. 38. Bio2RDF ’s vision of knowledge map Banff, May 8, 2007 CHUL research center - Laval University 38
  39. 39. Bio2RDF ’s map of distributed bioinformatics knowledge http://bio2rdf.org/bio2rdf-2007-02.owl Banff, May 8, 2007 CHUL research center - Laval University 39
  40. 40. M ap of semantic resource Banff, May 8, 2007 CHUL research center - Laval University 40
  41. 41. M ontreal’s subway map Banff, May 8, 2007 CHUL research center - Laval University 41
  42. 42. Bio2RDF ’s actual knowledge map Banff, May 8, 2007 CHUL research center - Laval University 42
  43. 43. Achievement Public data + open source software + rdf technology + rdfizer + normalized U RIs = Bio2RDF knowledge integration; A bioinformatic-integration ontology wont exist if it is not adopted by the community, bio2rdf.owl is just a proposed starting point; 46 millions RDF documents are now available at http:/ / bio2rdf.org. Banff, May 8, 2007 CHUL research center - Laval University 43
  44. 44. Bio2RDF project provides open source RDFizer to the community. So much style need to be rdfized, if you are interested to contribute, join us! Now lets build the big knowledge map of bioinformatics… Banff, May 8, 2007 CHUL research center - Laval University 44
  45. 45. Final words Please, tell Sir Tim Berners-L ee that he was right ‘semantic web in bioinformatics’ is a k ille r a p p to illustrate all the potential of the semantic web. And also, tell M ark W ilkinson that semantic web in bioinformatics won’t be full of cr e e p s if we organize it like we did… Banff, May 8, 2007 CHUL research center - Laval University 45
  46. 46. Thanks Jean M orissette Nicole Tourigny Philippe Rigault Bioinformatics lab’s team at CHU L Research Center M any open source communities (OpenRDF, Simile’s project, Tomcat, J STL and many more) W 3C Bio-RDF G roup G énome Québec G énome Canada
  47. 47. Visit http://bio2rdf.org Download http://sourceforge.net/projects/bio2rdf/ Discover http://bio2rdf.org/bio2rdf-2007-02.owl Contact us at bio2rdf@gmail.com Banff, May 8, 2007 CHUL research center - Laval University 47

×