Use of open_linked_data_in_bioinformatics


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Use of open_linked_data_in_bioinformatics

  1. 1. Use of Open Linked Data in Bioinformatics Space: A Case Study Remzi Çelebi Department of Computer Engineering, Ege University İzmir, Turkey Özgür Gümüş Department of Computer Engineering, Ege University İzmir, Turkey Yeşim Aydın Son Department of Health Informatics, Middle East Technical University Ankara, Turkey
  2. 2. Outline ● Semantic Web – very brief intro – ● RDF, SPARQL, Linked Data Use Case Senarios ● Conclusion ● Future work
  3. 3. Semantic Web ● ● Semantic Web , the next generation of web, is considered as an extension of the current web and provides a framework for integration of the data from heterogeneous resources. The semantic web enables machines to perform more of the tedious work involved in finding, combining and extracting information on the web.
  4. 4. Semantic Web – Open Linked Data ● ● ● Open linked data is a new approach, which utilizes the semantic web technology to publish, integrate and analyze open data on web. Open linked data suggests that the data on web should be linked and be open for use of practical applications. It provides two kinds of advantages: ability to search multiple datasets through a single framework and ability to search relationships and paths of relationships that go across different datasets.
  5. 5. Semantic Web Technologies - RDF ● ● ● Resource Description Framework (RDF) is the most fundamental way of describing resources and relationships between them in the Semantic Web. An RDF triple is a statement about a resource in the form of subject-predicate-object expression. RDF can be represented in variety of formats, including XML and JSON.
  6. 6. Uniform Resource Identifier - URI ● "The generic set of all names/addresses that are short strings that refer to resources" – ● URLs (Uniform Resource Locators) are a particular type of URI, used for resources that can be accessed on the WWW (e.g., web pages) In RDF, URIs typically look like “normal” URLs, often with fragment identifiers to point at specific parts of a document: Example: Shorthand notation gene:BRCA1 ● The PREFIX keyword is used to describe short form of resources PREFIX gene: 7
  7. 7. SPARQL ● ● ● SPARQL is a query language to retrieve and manipulate data in RDF format. A SPARQL endpoint is a service which provides a SPARQL-queryable interface to a set of RDF statements stored in a triple-store. SPARQL searches for all subgraphs that match the graph described by the triples in the query. SELECT * WHERE { ?subject ?predicate ?object . } 8
  8. 8. Semantic Web for Health Care and Bioinformatics ● ● There is a big data cloud including the information about genes, proteins, gene networks, protein-protein interactions, genetic variations, chemical compounds, diseases and drugs in diverse formats. The complexity of life sciences comes from the integration and the analysis of enormous amount of data obtained by research from these variety of domains. 9
  9. 9. Bio2RDF Project ● ● ● Creating a knowledge space of RDF documents linked together with normalized URIs and sharing a common ontology. Documents from public bioinformatics databases such as KEGG, PDB, MGI, HGNC and several of NCBI’s databases are available in RDF format through a unique URL in the form of Bio2RDF has created a RDF warehouse that serves over 70 million triples describing the human 10 and mouse genomes.
  10. 10. Bio2RDF ● Bio2RDF is unique in several ways from previous efforts that has been provisioning life sciences with linked data such as Neurocommons, LinkedLifeData, W3C HCLS, Chem2Bio2RDF and BioLOD, – First, Bio2RDF gives unique linked data vocabulary and topology. – Second, Bio2RDF produces syntactically interoperable linked data across all datasets by defining a set of basic guidelines. – Third, the community can benefit from Bio2RDF infrastructure with an expandable global network of mirrors that host Bio2RDF datasets and a federated network of SPARQL end-points. – Finally, Bio2RDF is open source and freely available for use, modify or redistribute. 11
  11. 11. Use Case Scenario ● As a case study, to reveal the capabilities and benefits of Bio2RDF project, we defined the following question: For a given pathway, what are the diseases associated to the individual genes in the pathway? ● To get the answer of this question, a set of data sources are required, CDT, OMIM, NCBI Gene. These datasets can be queried on the web as part of Bio2RDF project. 12
  12. 12. Use Case Scenario a) Query-1 CTD for gene-pathway information PREFIX ctd_vocabulary: <> SELECT ?geneID WHERE { ?geneID } ctd_vocabulary:pathway <> . b) Query-2 OMIM for gene-disease association PREFIX omim_vocabulary:> PREFIX rdf: <> SELECT ?gene ?pheno WHERE { } ?gene omim_vocabulary:phenotype ?pheno . ?pheno rdf:type omim_vocabulary:Phenotype . ?gene rdf:type omim_vocabulary:Gene . c) Query-3 PREFIX geneid_vocabulary: <> SELECT ?geneID ?ensemblID WHERE { } NCBI Gene for conversion of geneid to ENSEMBL id ?geneID geneid_vocabulary:has_ensembl_gene_identifier ?ensemblID . 13
  13. 13. Merged Query PREFIX PREFIX PREFIX PREFIX omim_vocabulary: <> rdf: <> ctd_vocabulary: <> geneid_vocabulary: <> SELECT ?geneID ?pheno WHERE { ?geneID ctd_vocabulary:pathway <> . ?gene omim_vocabulary:xref ?geneID . ?gene omim_vocabulary:phenotype ?pheno. ?pheno rdf:type omim_vocabulary:Phenotype . ?gene rdf:type omim_vocabulary:Gene . ?geneID geneid_vocabulary:has_ensembl_gene_identifier ?ensemblID . } Federated Query PREFIX PREFIX PREFIX PREFIX SELECT omim_vocabulary: <> rdf: <> ctd_vocabulary: <> geneid_vocabulary: <> ?ensemblID ?pheno WHERE { SERVICE <> { ?geneID ctd_vocabulary:pathway <> . } SERVICE <> { ?gene omim_vocabulary:xref ?geneID . ?gene omim_vocabulary:phenotype ?pheno. ?pheno rdf:type omim_vocabulary:Phenotype . ?gene rdf:type omim_vocabulary:Gene . Figure 2: a) Merged Query and b) Federated Query for the question defined } SERVICE <> { ?geneID geneid_vocabulary:has_ensembl_gene_identifier ?ensemblID } } 14
  14. 14. Results ● When the results from both BioMart (after providing KEGG Gene IDs) and Bio2RDF (all-in-one-step) searches are compared for the gene ID-OMIM ID matches ● ● Bio2RDF matched 27 unique ENSEMBL gene IDs from KEGG04520 pathway with 59 OMIM IDs, whereas BioMart results only included 50 of OMIM IDs for the same query, without any additional matches. The difference between the result set is likely to be due to the version of the OMIM searched by both services. Validity of all results is confirmed through current build of the OMIM database. 15
  15. 15. More Use Cases Finding important genes (hub genes) through pathway related disease PREFIX ctd_vocabulary: <> PREFIX omim_vocabulary: <> PREFIX rdf: <> SELECT ?symbol count(distinct ?pathway) as ?indirect_num WHERE <> ctd_vocabulary:pathway ?pathway . ?geneid ctd_vocabulary:pathway ?pathway . ?geneid rdf:type ctd_vocabulary:Gene . ?geneid ctd_vocabulary:gene-symbol ?symbol . } GROUP BY ?symbol ORDER BY DESC( ?indirect_num ) 16
  16. 16. More Use Cases Finding diseases related given SNP by rsid through gene association PREFIX ctd_vocabulary: <> PREFIX omim_vocabulary: <> PREFIX pharmgkb_vocabulary: <> PREFIX rdf: <> SELECT distinct ?disease_label WHERE { ?assoc rdf:type pharmgkb_vocabulary:Disease-Gene-Association . ?assoc pharmgkb_vocabulary:disease ?disease . ?disease rdfs:label ?disease_label . ?assoc pharmgkb_vocabulary:gene ?gene . ?rsid pharmgkb_vocabulary:gene ?gene . FILTER regex( str(?rsid), "rs1801253" ) . } 17
  17. 17. Conclusion ● ● Through the use case (pathway-gene-disease) build here, we have showed that with Bio2RDF datasets, different queries can be flexibly build, merged and run in a federated fashion in order to correctly retrieve data in a single run, which is not possible to get from another single database or service. In this paper, a use case involving to query multiple distant data sources which are semantically available through Bio2RDF is defined. Also, the results are compared and validated by traditional search techniques. 18
  18. 18. Future works ● This work will continue in two directions: – first direction will be developing a web interface that helps the researchers to query multiple data sources by using some visual query templates without RDF and/or SPARQL knowledge – second direction will be developing a monitoring system that helps the researchers to be aware of updates about data related to their research from multiple data sources. 19