UniProt and the Semantic Web

UniProt and the Semantic Web
Chimezie Ogbuji

‘Omics’ Data Challenges
 Advances in protein science is a major catalyst in the
exploding availability of bioinformatics data
 We have already discussed the dimensions of omics
data:
 Molecular components, interactions, and phenotype
observations

 Data from large-scale experiments are no longer
published conventionally but stored in a database
 Protein sequence databases are one of the most
comprehensive information resources for scientists

Protein Sequence Databases
 Universal protein sequence databases cover all species

 Specialized protein databases are particular to a protein
family or organism

 Sequence repositories
 A simple registry of sequence record
 No annotations

 Curated protein databases
 Enrich sequence information with links to various sources
(scientific literature primarily)

Informatics Challenges
 Standard data integration challenge is the lack of
common conventions

 Applies to not just notation but also to:
 Use of identifiers
 Representation of cross-references
 Framework for defining terms and relationships between
them

 Links between omics sources is another important
component of data integration

What is UniProt?
 A comprehensive repository of protein sequences and
their functional annotations

 Curators add value to raw data by annotations against
scientific literature

 Objective is: the creation and maintenance of stable,
comprehensive, and high-quality protein databases,
with high level of accessibility, to facilitate cross-
database information retrival

 Makes use of Semantic Web technologies to address its
challenges

UniProt: Core Activities
 Sequence archiving

 Manual (peer-reviewed) and automated curation of
sequences

 Development of human / machine-readable Uniprot web
site

 Interaction with other protein-related databases for
expanding cross references

UniProt: Components
 UniProtKB –Protein sequence annotations and metadata:
 Protein name, function, taxonomy, enzyme-specific
information, domains, sites, subcellular location, interactions,
relationships to disease etc.
 Links to external sources: DNA sequence repositories, protein
structure databases, protein domain and family databases, and
species & function-specific data collections
 UniRef – Compresses sequences at different resolutions
 Parameterized by percent of how identical two sequences or
sub-sequences are (100,90,50).
 UniParc – Non-redundant database of all publically
available protein sequences
 Manages globaly-unique identifers, the sequence, information
on source database, and CRC check number.

Semantic Web Technologies
 Set of standards for managing web-based content in a way
that emphasizes use by an automaton
 Automaton: a machine that performs a function according to
a predetermined set of coded instructions
 The architectural vision (the Semantic Web) is to extend the
standards and best practices behind the World-wide Web with
new standards that emphasize meaning over structure of
data.
 Common data formats
 Provide a means to make assertions about the world such that
an automaton can reason about it through them
 The vision is often confused with the tools meant to achieve
it (i.e., set of standards)

RDF: Data Model
 Standardized format for representating arbitrary
information as a labelled, directed graph

 Comprised of statements: subject, predicate, object

 Terms in statements can be Universal Resource
Identifiers (URIs), Blank Nodes (anonymous entities), or
Literals

 Abstract data model: a labelled, directed graph

 Various serializations: XML-based and text-based

Modelling vocabulary: RDFS/OWL
 RDF Schema (RDFS)
 Simple, minimal schema language for RDF

 Ontology Web Language (OWL)
 Vocabulary for defining classes, relationships, and various
constraints that limit how RDF is interpreted
 More powerful modeling language

 Tools for constraining & defining reality that can be
used to codify scientific understanding
 Gene Ontology is modelled in this way to capture our
understanding of macromolecular reality

Query Language: SPARQL
 Provides a common graph-matching language for
querying RDF data

 Similar to SQL in many respects

Nature of UniProt Data
 Very large number of cross references to external
resources

 Cross-reference topology that of a graph not a tree

 Automated and manual annotation require storage of
provenance information (how / when data was
acquired)

 Requires a framework for both data as well as metadata
(data about data)

UniProt: Data Conventions
 All outbound RDF statements are grouped together
(statements about the same subject)

 Datasets (nodes in previous graph) are distributed as a
single file

 Only stores stated data, not entailed data.
 For instance, relationships involving symmetric properties
are only stored in one direction

UniProt: Naming Conventions
 Generally, in semiotics: a symbol denotes a referent.

 In Web architecture, URIs identify resources
 URIs that can be resolved over the web are URLs

 UniProt URIs identify:
 Resources that correspond to database entries
 Modeling vocabulary that use standard namespaces: RDFS
and OWL
 Classes and properties used by UniProt
 For ex: http://purl.uniprot.org/core/Gene
 Resources without stable identifiers (from their source)

The Omics Identification Problem
 UniProt uses a templated naming convention:
 http://purl.uniprot.org/{database}/{identifier}
 http://purl.uniprot.org/uniprot/{protein_identifier}

 Problem
 http://purl.uniprot.org/uniprot/P04926 denotes the Malaria
protein EX-1
 If loading that address in a browser returns a web page, can an
automaton infer that Malaria protein EX-1 is a web page?
 How do you identify abstract concepts v.s. digital media

The PURL Solution
 Persistent Uniform Resource Locator (PURL) is a public
URI management service for allocating a ‘URI space’ as
a mapping of identifiers (aliases) for resources they are
not immediately responsible for
 PURLs are web addresses that act as permanent
identifiers in the face of a dynamic and changing Web
infrastructure
 A request to a PURL returns a 303 HTTP status code and
a location:
 303 indicates that a response can be found under the
returned location

The PURL Solution: Continued
 Can use PURL addresses to identify abstract concepts

 Redirect requests to such addresses to an informative
web page (for humans) with a means for machines to
extract other formats

 RDF statements are about proteins, machines can
reasons about proteins, and humans resolve protein
identifiers to view informative web pages

 RDF/XML link:

 http://www.uniprot.org/uniprot/P04926.rdf

Serendipitous Re-use
 Having a rich repository of protein sequence metadata,
annotations, and taxonomic classification in a
distributed, standard format encourages scientific
collaboration

General UniProt Re-Use Scenario
 User A refers to protein P1 in their dataset
 User A’s dataset doesn’t include statements about P1 (the
host organism for instance)

 User B comes across this dataset and (in order to find
out more about protein P1) puts the URI of protein P1
in their browser and pulls up human-readable
information about it (including the host organism)
 Automaton C comes across the same dataset, fetches
the web page, fetches the RDF about P1 and has access
to the same information as user B and can reason about
the major taxon the host organism belongs to

References

 Wu, C. et.al.,”The Universal Protein Resource
(UniProt): an expanding universe of protein
information”. Nucleic Acids Research, vol. 34. 2006

 Swiss Institute of Bioinformatics, “UniProt RDF (project
page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/

 Redaschi, N. and UniProt Consortium, “UniProt in RDF:
Tackling Data Integration and Distributed Annotation”
Nature Proceedings, 3rd International Biocuration
Conference, April 2009.
http://precedings.nature.com/documents/3193/version/1

UniProt and the Semantic Web

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to UniProt and the Semantic Web

Similar to UniProt and the Semantic Web (20)

More from Chimezie Ogbuji

More from Chimezie Ogbuji (12)

Recently uploaded

Recently uploaded (20)

UniProt and the Semantic Web