UniProt and the Semantic Web
                     Chimezie Ogbuji
‘Omics’ Data Challenges
 Advances in protein science is a major catalyst in the
  exploding availability of bioinformatics data
 We have already discussed the dimensions of omics
  data:
   Molecular components, interactions, and phenotype
    observations

 Data from large-scale experiments are no longer
  published conventionally but stored in a database
 Protein sequence databases are one of the most
  comprehensive information resources for scientists
Protein Sequence Databases
 Universal protein sequence databases cover all species

 Specialized protein databases are particular to a protein
  family or organism

 Sequence repositories
   A simple registry of sequence record
   No annotations

 Curated protein databases
   Enrich sequence information with links to various sources
    (scientific literature primarily)
Informatics Challenges
 Standard data integration challenge is the lack of
  common conventions

 Applies to not just notation but also to:
   Use of identifiers
   Representation of cross-references
   Framework for defining terms and relationships between
    them

 Links between omics sources is another important
  component of data integration
What is UniProt?
 A comprehensive repository of protein sequences and
  their functional annotations

 Curators add value to raw data by annotations against
  scientific literature

 Objective is: the creation and maintenance of stable,
  comprehensive, and high-quality protein databases,
  with high level of accessibility, to facilitate cross-
  database information retrival

 Makes use of Semantic Web technologies to address its
  challenges
UniProt: Core Activities
 Sequence archiving

 Manual (peer-reviewed) and automated curation of
  sequences

 Development of human / machine-readable Uniprot web
  site

 Interaction with other protein-related databases for
  expanding cross references
UniProt: Components
  UniProtKB –Protein sequence annotations and metadata:
    Protein name, function, taxonomy, enzyme-specific
     information, domains, sites, subcellular location, interactions,
     relationships to disease etc.
    Links to external sources: DNA sequence repositories, protein
     structure databases, protein domain and family databases, and
     species & function-specific data collections
  UniRef – Compresses sequences at different resolutions
    Parameterized by percent of how identical two sequences or
     sub-sequences are (100,90,50).
  UniParc – Non-redundant database of all publically
   available protein sequences
    Manages globaly-unique identifers, the sequence, information
     on source database, and CRC check number.
Semantic Web Technologies
 Set of standards for managing web-based content in a way
  that emphasizes use by an automaton
   Automaton: a machine that performs a function according to
    a predetermined set of coded instructions
 The architectural vision (the Semantic Web) is to extend the
  standards and best practices behind the World-wide Web with
  new standards that emphasize meaning over structure of
  data.
   Common data formats
   Provide a means to make assertions about the world such that
     an automaton can reason about it through them
 The vision is often confused with the tools meant to achieve
  it (i.e., set of standards)
RDF: Data Model
 Standardized format for representating arbitrary
  information as a labelled, directed graph

 Comprised of statements: subject, predicate, object

 Terms in statements can be Universal Resource
  Identifiers (URIs), Blank Nodes (anonymous entities), or
  Literals

 Abstract data model: a labelled, directed graph

 Various serializations: XML-based and text-based
Information About John Smith
Modelling vocabulary: RDFS/OWL
 RDF Schema (RDFS)
   Simple, minimal schema language for RDF

 Ontology Web Language (OWL)
   Vocabulary for defining classes, relationships, and various
    constraints that limit how RDF is interpreted
   More powerful modeling language

 Tools for constraining & defining reality that can be
  used to codify scientific understanding
 Gene Ontology is modelled in this way to capture our
  understanding of macromolecular reality
Query Language: SPARQL
 Provides a common graph-matching language for
  querying RDF data

 Similar to SQL in many respects
Nature of UniProt Data
 Very large number of cross references to external
  resources

 Cross-reference topology that of a graph not a tree

 Automated and manual annotation require storage of
  provenance information (how / when data was
  acquired)

 Requires a framework for both data as well as metadata
  (data about data)
UniProt Distribution
UniProt: Data Conventions
 All outbound RDF statements are grouped together
  (statements about the same subject)

 Datasets (nodes in previous graph) are distributed as a
  single file

 Only stores stated data, not entailed data.
   For instance, relationships involving symmetric properties
    are only stored in one direction
UniProt: Naming Conventions
 Generally, in semiotics: a symbol denotes a referent.

 In Web architecture, URIs identify resources
   URIs that can be resolved over the web are URLs

 UniProt URIs identify:
   Resources that correspond to database entries
   Modeling vocabulary that use standard namespaces: RDFS
    and OWL
   Classes and properties used by UniProt
     For ex: http://purl.uniprot.org/core/Gene
   Resources without stable identifiers (from their source)
The Omics Identification Problem
 UniProt uses a templated naming convention:
   http://purl.uniprot.org/{database}/{identifier}
   http://purl.uniprot.org/uniprot/{protein_identifier}

 Problem
     http://purl.uniprot.org/uniprot/P04926 denotes the Malaria
      protein EX-1
     If loading that address in a browser returns a web page, can an
      automaton infer that Malaria protein EX-1 is a web page?
     How do you identify abstract concepts v.s. digital media
The PURL Solution
 Persistent Uniform Resource Locator (PURL) is a public
  URI management service for allocating a ‘URI space’ as
  a mapping of identifiers (aliases) for resources they are
  not immediately responsible for
 PURLs are web addresses that act as permanent
  identifiers in the face of a dynamic and changing Web
  infrastructure
 A request to a PURL returns a 303 HTTP status code and
  a location:
   303 indicates that a response can be found under the
    returned location
The PURL Solution: Continued
 Can use PURL addresses to identify abstract concepts

 Redirect requests to such addresses to an informative
  web page (for humans) with a means for machines to
  extract other formats

 RDF statements are about proteins, machines can
  reasons about proteins, and humans resolve protein
  identifiers to view informative web pages
 RDF/XML link:

    http://www.uniprot.org/uniprot/P04926.rdf
UniProt: Protein Class
UniProt: Annotation Hierarchy
Serendipitous Re-use
 Having a rich repository of protein sequence metadata,
  annotations, and taxonomic classification in a
  distributed, standard format encourages scientific
  collaboration
General UniProt Re-Use Scenario
 User A refers to protein P1 in their dataset
   User A’s dataset doesn’t include statements about P1 (the
    host organism for instance)

 User B comes across this dataset and (in order to find
  out more about protein P1) puts the URI of protein P1
  in their browser and pulls up human-readable
  information about it (including the host organism)
 Automaton C comes across the same dataset, fetches
  the web page, fetches the RDF about P1 and has access
  to the same information as user B and can reason about
  the major taxon the host organism belongs to
References

 Wu, C. et.al.,”The Universal Protein Resource
  (UniProt): an expanding universe of protein
  information”. Nucleic Acids Research, vol. 34. 2006

 Swiss Institute of Bioinformatics, “UniProt RDF (project
  page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/

 Redaschi, N. and UniProt Consortium, “UniProt in RDF:
  Tackling Data Integration and Distributed Annotation”
  Nature Proceedings, 3rd International Biocuration
  Conference, April 2009.
  http://precedings.nature.com/documents/3193/version/1

UniProt and the Semantic Web

  • 1.
    UniProt and theSemantic Web Chimezie Ogbuji
  • 2.
    ‘Omics’ Data Challenges Advances in protein science is a major catalyst in the exploding availability of bioinformatics data  We have already discussed the dimensions of omics data:  Molecular components, interactions, and phenotype observations  Data from large-scale experiments are no longer published conventionally but stored in a database  Protein sequence databases are one of the most comprehensive information resources for scientists
  • 3.
    Protein Sequence Databases Universal protein sequence databases cover all species  Specialized protein databases are particular to a protein family or organism  Sequence repositories  A simple registry of sequence record  No annotations  Curated protein databases  Enrich sequence information with links to various sources (scientific literature primarily)
  • 4.
    Informatics Challenges  Standarddata integration challenge is the lack of common conventions  Applies to not just notation but also to:  Use of identifiers  Representation of cross-references  Framework for defining terms and relationships between them  Links between omics sources is another important component of data integration
  • 5.
    What is UniProt? A comprehensive repository of protein sequences and their functional annotations  Curators add value to raw data by annotations against scientific literature  Objective is: the creation and maintenance of stable, comprehensive, and high-quality protein databases, with high level of accessibility, to facilitate cross- database information retrival  Makes use of Semantic Web technologies to address its challenges
  • 6.
    UniProt: Core Activities Sequence archiving  Manual (peer-reviewed) and automated curation of sequences  Development of human / machine-readable Uniprot web site  Interaction with other protein-related databases for expanding cross references
  • 7.
    UniProt: Components UniProtKB –Protein sequence annotations and metadata:  Protein name, function, taxonomy, enzyme-specific information, domains, sites, subcellular location, interactions, relationships to disease etc.  Links to external sources: DNA sequence repositories, protein structure databases, protein domain and family databases, and species & function-specific data collections  UniRef – Compresses sequences at different resolutions  Parameterized by percent of how identical two sequences or sub-sequences are (100,90,50).  UniParc – Non-redundant database of all publically available protein sequences  Manages globaly-unique identifers, the sequence, information on source database, and CRC check number.
  • 8.
    Semantic Web Technologies Set of standards for managing web-based content in a way that emphasizes use by an automaton  Automaton: a machine that performs a function according to a predetermined set of coded instructions  The architectural vision (the Semantic Web) is to extend the standards and best practices behind the World-wide Web with new standards that emphasize meaning over structure of data.  Common data formats  Provide a means to make assertions about the world such that an automaton can reason about it through them  The vision is often confused with the tools meant to achieve it (i.e., set of standards)
  • 10.
    RDF: Data Model Standardized format for representating arbitrary information as a labelled, directed graph  Comprised of statements: subject, predicate, object  Terms in statements can be Universal Resource Identifiers (URIs), Blank Nodes (anonymous entities), or Literals  Abstract data model: a labelled, directed graph  Various serializations: XML-based and text-based
  • 11.
  • 12.
    Modelling vocabulary: RDFS/OWL RDF Schema (RDFS)  Simple, minimal schema language for RDF  Ontology Web Language (OWL)  Vocabulary for defining classes, relationships, and various constraints that limit how RDF is interpreted  More powerful modeling language  Tools for constraining & defining reality that can be used to codify scientific understanding  Gene Ontology is modelled in this way to capture our understanding of macromolecular reality
  • 14.
    Query Language: SPARQL Provides a common graph-matching language for querying RDF data  Similar to SQL in many respects
  • 15.
    Nature of UniProtData  Very large number of cross references to external resources  Cross-reference topology that of a graph not a tree  Automated and manual annotation require storage of provenance information (how / when data was acquired)  Requires a framework for both data as well as metadata (data about data)
  • 16.
  • 17.
    UniProt: Data Conventions All outbound RDF statements are grouped together (statements about the same subject)  Datasets (nodes in previous graph) are distributed as a single file  Only stores stated data, not entailed data.  For instance, relationships involving symmetric properties are only stored in one direction
  • 19.
    UniProt: Naming Conventions Generally, in semiotics: a symbol denotes a referent.  In Web architecture, URIs identify resources  URIs that can be resolved over the web are URLs  UniProt URIs identify:  Resources that correspond to database entries  Modeling vocabulary that use standard namespaces: RDFS and OWL  Classes and properties used by UniProt  For ex: http://purl.uniprot.org/core/Gene  Resources without stable identifiers (from their source)
  • 20.
    The Omics IdentificationProblem  UniProt uses a templated naming convention:  http://purl.uniprot.org/{database}/{identifier}  http://purl.uniprot.org/uniprot/{protein_identifier}  Problem  http://purl.uniprot.org/uniprot/P04926 denotes the Malaria protein EX-1  If loading that address in a browser returns a web page, can an automaton infer that Malaria protein EX-1 is a web page?  How do you identify abstract concepts v.s. digital media
  • 21.
    The PURL Solution Persistent Uniform Resource Locator (PURL) is a public URI management service for allocating a ‘URI space’ as a mapping of identifiers (aliases) for resources they are not immediately responsible for  PURLs are web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure  A request to a PURL returns a 303 HTTP status code and a location:  303 indicates that a response can be found under the returned location
  • 22.
    The PURL Solution:Continued  Can use PURL addresses to identify abstract concepts  Redirect requests to such addresses to an informative web page (for humans) with a means for machines to extract other formats  RDF statements are about proteins, machines can reasons about proteins, and humans resolve protein identifiers to view informative web pages
  • 23.
     RDF/XML link:  http://www.uniprot.org/uniprot/P04926.rdf
  • 24.
  • 25.
  • 26.
    Serendipitous Re-use  Havinga rich repository of protein sequence metadata, annotations, and taxonomic classification in a distributed, standard format encourages scientific collaboration
  • 27.
    General UniProt Re-UseScenario  User A refers to protein P1 in their dataset  User A’s dataset doesn’t include statements about P1 (the host organism for instance)  User B comes across this dataset and (in order to find out more about protein P1) puts the URI of protein P1 in their browser and pulls up human-readable information about it (including the host organism)  Automaton C comes across the same dataset, fetches the web page, fetches the RDF about P1 and has access to the same information as user B and can reason about the major taxon the host organism belongs to
  • 28.
    References  Wu, C.et.al.,”The Universal Protein Resource (UniProt): an expanding universe of protein information”. Nucleic Acids Research, vol. 34. 2006  Swiss Institute of Bioinformatics, “UniProt RDF (project page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/  Redaschi, N. and UniProt Consortium, “UniProt in RDF: Tackling Data Integration and Distributed Annotation” Nature Proceedings, 3rd International Biocuration Conference, April 2009. http://precedings.nature.com/documents/3193/version/1