UniProt and the Semantic Web

1,829 views

Published on

Published in: Health & Medicine
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,829
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

UniProt and the Semantic Web

  1. 1. UniProt and the Semantic Web Chimezie Ogbuji
  2. 2. ‘Omics’ Data Challenges  Advances in protein science is a major catalyst in the exploding availability of bioinformatics data  We have already discussed the dimensions of omics data:  Molecular components, interactions, and phenotype observations  Data from large-scale experiments are no longer published conventionally but stored in a database  Protein sequence databases are one of the most comprehensive information resources for scientists
  3. 3. Protein Sequence Databases  Universal protein sequence databases cover all species  Specialized protein databases are particular to a protein family or organism  Sequence repositories  A simple registry of sequence record  No annotations  Curated protein databases  Enrich sequence information with links to various sources (scientific literature primarily)
  4. 4. Informatics Challenges  Standard data integration challenge is the lack of common conventions  Applies to not just notation but also to:  Use of identifiers  Representation of cross-references  Framework for defining terms and relationships between them  Links between omics sources is another important component of data integration
  5. 5. What is UniProt?  A comprehensive repository of protein sequences and their functional annotations  Curators add value to raw data by annotations against scientific literature  Objective is: the creation and maintenance of stable, comprehensive, and high-quality protein databases, with high level of accessibility, to facilitate cross- database information retrival  Makes use of Semantic Web technologies to address its challenges
  6. 6. UniProt: Core Activities  Sequence archiving  Manual (peer-reviewed) and automated curation of sequences  Development of human / machine-readable Uniprot web site  Interaction with other protein-related databases for expanding cross references
  7. 7. UniProt: Components  UniProtKB –Protein sequence annotations and metadata:  Protein name, function, taxonomy, enzyme-specific information, domains, sites, subcellular location, interactions, relationships to disease etc.  Links to external sources: DNA sequence repositories, protein structure databases, protein domain and family databases, and species & function-specific data collections  UniRef – Compresses sequences at different resolutions  Parameterized by percent of how identical two sequences or sub-sequences are (100,90,50).  UniParc – Non-redundant database of all publically available protein sequences  Manages globaly-unique identifers, the sequence, information on source database, and CRC check number.
  8. 8. Semantic Web Technologies  Set of standards for managing web-based content in a way that emphasizes use by an automaton  Automaton: a machine that performs a function according to a predetermined set of coded instructions  The architectural vision (the Semantic Web) is to extend the standards and best practices behind the World-wide Web with new standards that emphasize meaning over structure of data.  Common data formats  Provide a means to make assertions about the world such that an automaton can reason about it through them  The vision is often confused with the tools meant to achieve it (i.e., set of standards)
  9. 9. RDF: Data Model  Standardized format for representating arbitrary information as a labelled, directed graph  Comprised of statements: subject, predicate, object  Terms in statements can be Universal Resource Identifiers (URIs), Blank Nodes (anonymous entities), or Literals  Abstract data model: a labelled, directed graph  Various serializations: XML-based and text-based
  10. 10. Information About John Smith
  11. 11. Modelling vocabulary: RDFS/OWL  RDF Schema (RDFS)  Simple, minimal schema language for RDF  Ontology Web Language (OWL)  Vocabulary for defining classes, relationships, and various constraints that limit how RDF is interpreted  More powerful modeling language  Tools for constraining & defining reality that can be used to codify scientific understanding  Gene Ontology is modelled in this way to capture our understanding of macromolecular reality
  12. 12. Query Language: SPARQL  Provides a common graph-matching language for querying RDF data  Similar to SQL in many respects
  13. 13. Nature of UniProt Data  Very large number of cross references to external resources  Cross-reference topology that of a graph not a tree  Automated and manual annotation require storage of provenance information (how / when data was acquired)  Requires a framework for both data as well as metadata (data about data)
  14. 14. UniProt Distribution
  15. 15. UniProt: Data Conventions  All outbound RDF statements are grouped together (statements about the same subject)  Datasets (nodes in previous graph) are distributed as a single file  Only stores stated data, not entailed data.  For instance, relationships involving symmetric properties are only stored in one direction
  16. 16. UniProt: Naming Conventions  Generally, in semiotics: a symbol denotes a referent.  In Web architecture, URIs identify resources  URIs that can be resolved over the web are URLs  UniProt URIs identify:  Resources that correspond to database entries  Modeling vocabulary that use standard namespaces: RDFS and OWL  Classes and properties used by UniProt  For ex: http://purl.uniprot.org/core/Gene  Resources without stable identifiers (from their source)
  17. 17. The Omics Identification Problem  UniProt uses a templated naming convention:  http://purl.uniprot.org/{database}/{identifier}  http://purl.uniprot.org/uniprot/{protein_identifier}  Problem  http://purl.uniprot.org/uniprot/P04926 denotes the Malaria protein EX-1  If loading that address in a browser returns a web page, can an automaton infer that Malaria protein EX-1 is a web page?  How do you identify abstract concepts v.s. digital media
  18. 18. The PURL Solution  Persistent Uniform Resource Locator (PURL) is a public URI management service for allocating a ‘URI space’ as a mapping of identifiers (aliases) for resources they are not immediately responsible for  PURLs are web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure  A request to a PURL returns a 303 HTTP status code and a location:  303 indicates that a response can be found under the returned location
  19. 19. The PURL Solution: Continued  Can use PURL addresses to identify abstract concepts  Redirect requests to such addresses to an informative web page (for humans) with a means for machines to extract other formats  RDF statements are about proteins, machines can reasons about proteins, and humans resolve protein identifiers to view informative web pages
  20. 20.  RDF/XML link:  http://www.uniprot.org/uniprot/P04926.rdf
  21. 21. UniProt: Protein Class
  22. 22. UniProt: Annotation Hierarchy
  23. 23. Serendipitous Re-use  Having a rich repository of protein sequence metadata, annotations, and taxonomic classification in a distributed, standard format encourages scientific collaboration
  24. 24. General UniProt Re-Use Scenario  User A refers to protein P1 in their dataset  User A’s dataset doesn’t include statements about P1 (the host organism for instance)  User B comes across this dataset and (in order to find out more about protein P1) puts the URI of protein P1 in their browser and pulls up human-readable information about it (including the host organism)  Automaton C comes across the same dataset, fetches the web page, fetches the RDF about P1 and has access to the same information as user B and can reason about the major taxon the host organism belongs to
  25. 25. References  Wu, C. et.al.,”The Universal Protein Resource (UniProt): an expanding universe of protein information”. Nucleic Acids Research, vol. 34. 2006  Swiss Institute of Bioinformatics, “UniProt RDF (project page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/  Redaschi, N. and UniProt Consortium, “UniProt in RDF: Tackling Data Integration and Distributed Annotation” Nature Proceedings, 3rd International Biocuration Conference, April 2009. http://precedings.nature.com/documents/3193/version/1

×