SlideShare a Scribd company logo
1 of 11
Download to read offline
RESEARCH ARTICLE
Assignment of protein function and discovery of novel
nucleolar proteins based on automatic analysis of
MEDLINE
Martijn Schuemie1*, Christine Chichester2, 3, 4*, Frederique Lisacek5
, Yohann Coute6
,
Peter-Jan Roes1
, Jean Charles Sanchez6
, Jan Kors1
and Barend Mons1, 2, 3
1
Biosemantics Group, Medical Informatics Department, Erasmus Medical Center, Rotterdam, The Netherlands
2
Human and Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands
3
Knewco Inc., Rockville, MD, USA
4
Swiss-Prot Group, Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
5
Proteome Informatics Group, Swiss Institute of Bioinformatics, Centre Medical Universitaire,
Geneva, Switzerland
6
Biomedical Proteomics Research Group, Departement de Biologie Structurale et Bioinformatique,
Centre Medical Universitaire, Geneva, Switzerland
Attribution of the most probable functions to proteins identified by proteomics is a significant
challenge that requires extensive literature analysis. We have developed a system for automated
prediction of implicit and explicit biologically meaningful functions for a proteomics study of the
nucleolus. This approach uses a set of vocabularyterms to map and integrate the information from
the entire MEDLINE database. Based on a combination of cross-species sequence homology
searches and the corresponding literature, our approach facilitated the direct association between
sequence data and information from biological texts describing function. Comparison of our
automated functional assignment to manual annotation demonstrated our method to be highly
effective. To establish the sensitivity, we defined the functional subtleties within a family contain-
ing a highly conserved sequence. Clustering of the DEAD-box protein family of RNA helicases
confirmed that these proteins shared similar morphology although functional subfamilies were
accurately identified by our approach. We visualized the nucleolar proteome in terms of protein
functions using multi-dimensional scaling, showing functional associations between nucleolar
proteins that were not previously realized. Finally, by clustering the functional properties of the
established nucleolar proteins, we predicted novel nucleolar proteins. Subsequently, non-
proteomics studies confirmed the predictions of previously unidentified nucleolar proteins.
Received: September 14, 2006
Revised: December 18, 2006
Accepted: December 20, 2006
Keywords:
Exosome / Multi-dimensional scaling / Nucleolus / RNA helicase / Sequence alignment
Proteomics 2007, 7, 921–931 921
1 Introduction
There is a high demand for approaches and techniques that
can accelerate the annotation of genes, proteins, and other
entities in the biomedical sciences. The manual annotation
of proteins, such as performed by UniProtKB/Swiss-Prot, is
a demanding process that could benefit enormously from
computer assistance. Here we describe the evaluation of a
novel approach that uses a combination of techniques that
Correspondence: Dr Christine Chichester, Human and Clinical
Genetics, Leiden University Medical Center, gebouw 2, Eintho-
venweg 20, 2333ZC Leiden, The Netherlands
E-mail: chichester@knewco.com
Fax: 131-79-593-1601
Abbreviations: AC, UniProtKB/Swiss-Prot accession number;
AUC, area under the curve; GO, Gene Ontology; MDS, multi-
dimensional scaling; MeSH, Medical Subject Headings; PARN,
poly(A)-specific ribonuclease; PMID, PubMed ID; Rbm3, RNA-
binding protein 3; rpL12, ribosomal protein L12; SC35, splicing
factor, arginine/serine-rich 2 protein; SRP, signal recognition par-
ticle; TTF-1, transcription factor 1 * Both authors have contributed equally to this work.
DOI 10.1002/pmic.200600693
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
922 M. Schuemie et al. Proteomics 2007, 7, 921–931
process free text resources to reconstruct and refine the
manual annotation of 585 proteins identified in the human
nucleolus.
The nucleolus is a subnuclear compartment of eukar-
yotic cells that has classically been characterized as the site of
ribosomal RNA (rRNA) synthesis and ribosome biogenesis.
Recently, remarkable progress in understanding the func-
tional organization of the nucleolus has occurred, mainly
due to a direct application of proteomics research which has
enabled the identification of more than 705 proteins [1–4].
These studies and more recent findings [5] have suggested
that the nucleolus may also play a key role in growth and cell
cycle control, tumorigenesis, aging, and sequestration of
nonribosomal macromolecules and the consequent modula-
tion of their molecular pathways [6]. The recent high-
throughput studies have delivered massive amounts of data,
which in turn lead to time-consuming manual examination
of information contained in databases and biomedical litera-
ture to attribute functional characteristics to the proteins
discovered. Many of the protein entities have been previously
described in the literature, thus they need to be correlated
with the established knowledge and eventually categorized
into functional pathways. Many of the nucleolar proteins
identified via proteomics are still classified as “function
unknown.” Proteins that represent new entities present the
challenge of deciphering the biological function without lit-
erature validation specific for the individual human protein.
The first step in these processes is to employ techniques that
support educated inferences by the scientist.
An established method of attributing possible biological
functions to newly discovered proteins is through the use of
sequence alignments. Significant similarity between
sequences infers biological relatedness and often, although
not always, functional similarity. Current prediction of pro-
tein function is based on extrapolation of the information
accumulated for a relatively small set of proteins for which
direct functions have been determined experimentally. Ad-
ditional clues to individual protein function reside in the vast
ocean of the biomedical literature.
Existing methodologies mine functional information by
relying on document classification methods based on the
relationships between a limited set of GO terms and manually
assigned Medical Subject Headings (MeSH) [7], by associat-
ing literature with proteins and GO terms using a dictionary
[8], or sequence homology clustering with a GOA protein [9]
for the transfer of GO terms. The increasing number and
diversity of protein sequence families require new methods to
define and predict details regarding function. Our method
extracts relations between biological entities from the entire
MEDLINE literature database to produce a condensedversion
of it; an impossible task for a single scientist. In the present
work, we focus on human nucleolar proteins and their
sequence homologs. The condensed literature is represented
by a set of vocabulary terms taken from a thesaurus built from
the MeSH and several gene and protein databases. These
representations are called concept profiles. Concept profiles
calculated from papers related to human nucleolar proteins
and their sequence homologs are then compared to the con-
cept profiles calculated from papers associated with each GO
term. This comparison functions as a bridge from the molec-
ular function data to the sequence data and allows the associ-
ation of the correct term to each nucleolar protein. For new
sequences known to be homologous to an existing family, but
of unknown function, our method can exploit the information
stored in an unstructured form in literature and complement
the information in structured databases by predicting specific
functions at a high level of detail. We demonstrate the method
by application to a well-characterized protein family, the
DEAD-box RNA helicases.
Particularly important from the perspective of systems
biology, we show how our methodology can generate biolog-
ical leads concerning proteins associated with the exosome
surveillance mechanism. This is due to the fact that we are
able to capture the correlation between the concept profiles
representing the exosome proteins and the concept profiles
of the exosome functional categories to suggest this over-
looked biological system. Equally relevant to systems biology,
the concept profiles resulting from the condensed literature
are also used to perform a large-scale prediction of new
potential nucleolar proteins; we automatically derived and
aggregated the functional concepts for all known nucleolar
proteins and used these to predict novel nucleolar proteins.
Several of our predictions, although not found by proteomics
analysis of the nucleolus, have been subsequently confirmed
by manual evaluation of published experimental findings.
2 Materials and methods
We automatically assigned GO terms to each of the nucleolar
proteins that were identified by Coute et al. [16] for which
enough information was available. The process is depicted in
Fig. 1 and consists of the following steps.
2.1 Step 1: Creating literature sets per protein
We used the information from several sequence databases to
connect proteins to the PubMed IDs (PMIDs) of relevant arti-
cles about these proteins. For human proteins, we used the
articlesreferredinEntrezGene[10].Formouse,drosophila,and
yeast, we used the Mouse Genome Database [11], FlyBase [12],
and the SaccharomycesGenome Database [13], respectively. We
removed articles that were referred in more than 25 entries that
focused in general on sequencing projects. These papers do not
contain functional information for specific proteins.
To provide a potential function for previously undocu-
mented proteins we combined the information found for
similar sequences. This technique is based on the assump-
tion that proteins with high sequence homology will have
similar function and that using all the identified literature,
including cross-species information will provide more func-
tional insight than the unique sequence literature alone. We
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Proteomics 2007, 7, 921–931 Systems Biology 923
Figure 1. Overview of the process for assigning GO terms to
proteins. Step 1: A set of PMIDs is extracted from sequence
databases for each protein in the databases. Step 2: A set of
PMIDs is extracted from the GOA database for each GO term.
Step 3: For each set of PMIDs, a concept profile is generated. Step
4: The concept profiles of proteins are compared to the concept
profiles of GO terms using vector matching.
aligned the protein sequences from UniProtKB/Swiss-Prot
to similar sequences using blastp with the default parame-
ters [14]. We pooled the literature from human sequences
that matched the query sequence with an E-value of 10290
or
lower. For nonhuman sequences, we relaxed the criteria to
sequences with an E-value of 10220
or lower to allow for
cross-species differences in sequence that were not related to
the function of the protein.
2.2 Step 2: Finding articles related to GO terms
The GOA database lists proteins and their corresponding GO
terms, and when applicable, mentions the PMIDs of articles
used for assigning a specific GO term. We used these articles
to represent each GO term. We found that many GO terms
were seldom or not even assigned to gene products. We had
included only those GO terms for which more than five
PMIDs were found.
2.3 Step 3: Creating concept profiles
Concepts are defined via a thesaurus. A thesaurus contains
terms and synonyms that are used to identify textual refer-
ences to concepts. Our thesaurus was created by combining
biomedical concepts obtained from MeSH, gene product-
specific information obtained from GO, and gene/protein
names obtained from five different databases for four differ-
ent species [15]. There is some overlap between MeSH and
GO, resulting in some concepts being identified twice.
However, this has little consequence for the subsequent pro-
cessing steps.
The next step is to create concept profiles for each protein
and each GO term. A concept profile is a list of concepts with
weights. These weights indicate the association between a
concept and the protein or GO term to which the concept
profile belongs.
For each PMID related to a protein or a GO term, the title
and abstract were retrieved from the MEDLINE database.
Using the thesaurus and the Collexis software, concepts
appearing in the title or abstract were identified.
The concept profiles were extracted from the literature sets
belonging to each protein and each GO term. Concepts that
were found in at least one abstract of a literature set were added
to the concept profile. The weight of each concept was calcu-
lated as follows: for each concept the hypothesis was tested
whether there was a statistical dependency between (a) the
occurrence of that concept in an abstract and (b) whether the
abstract in which the concept occurred was in the literature set.
The weight of a concept in the profile was defined as the log-
likelihood of this test, where a high weight indicated a concept
that was most distinguishing for the abstracts in the set when
compared to all the other abstracts used in this experiment.
2.4 Step 4: Comparing concept profiles
The concept profiles of the proteins were compared to the
concept profiles of the GO terms using cosine vector match-
ing, resulting in a matching score. The highest matching GO
terms for each protein were considered to be the most rele-
vant GO terms for that protein.
2.5 Evaluation of functional assignment
To evaluate the performance of our system, we used the
manually defined categorization by Coute et al. [16], who
grouped the nucleolar proteins into nine main categories,
some of which were decomposed furtherinto subcategories. In
collaboration with the creators of that classification, a mapping
between categories and subcategories, and GO terms was per-
formed manually. This mapping is provided in Table 1. When
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
924 M. Schuemie et al. Proteomics 2007, 7, 921–931
Table 1. Mapping between the functional categories defined by Coute et al. [16] and terms from the Gene Ontol-
ogy (GO)
Category GO term GO identifier
Chaperones Protein folding GO:0006457
Chromatin structure Chromatin assembly or disassembly GO:0006333
Fibrous proteins Intermediate filament GO:0005882
mRNA metabolism
30
end cleavage and polyadenylation mRNA polyadenylylation GO:0006378
Editing mRNA editing GO:0006381
Export/trafficking Ribosome–nucleus export GO:0000054
Splicing Nuclear mRNA splicing, via spliceosome GO:0000398
Stability/degredation mRNA catabolism GO:0006402
Transcription Transcription, DNA-dependent GO:0006351
Others
Cytoskeleton organization/transport Cytoskeleton organization and biogenesis GO:0007010
DNA repair DNA repair GO:0006281
DNA replication DNA replication GO:0006260
Kinases/phosphatases Kinase activity GO:0016301
Mitosis/cytokinesis/cell cycle regulation Mitosis GO:0007067
Cytokinesis GO:0000910
Regulation of cell cycle GO:0000074
Nucleocytoplasmic transport Protein–nucleus export GO:0006611
Other enzymes Catalytic activity GO:0003824
Sumoylation Protein sumoylation GO:0016925
Ubiquitination/protein degradation Protein ubiquitination GO:0016567
Ribosomal proteins Structural constituent of ribosome GO:0003735
Ribosome biogenesis rRNA processing GO:0006364
Translation
Methionyl aminopeptidase activity Methionyl aminopeptidase activity GO:0004239
Regulation of translation Regulation of translation GO:0006445
SRP SRP binding GO:0005047
Translation elongation Translation elongation GO:0006414
Translation initiation Translation initiation GO:0006413
Translation termination Translation termination GO:0006415
tRNA methyltransferase activity tRNA processing GO:0008175
Column 1 shows the functional terms originally designated to the nucleolar proteins. The creators of the original
classification selected specific GO terms (Column 2) that were determined to match most closely to their original
categories. Column 3 contains the GO identifiers for the terms found in Column 2.
the GO terms assigned to a protein by our system matched
the GO terms mapped to the manual categorization, the
automatic assignment was assumed to be correct.
The performance of the automated assignment was scored
using receiver operator characteristics (ROC) curves; for each
category, all proteins were ranked according to the vector match-
ing score between the protein profile and the profile of the GO
term associated with that category. If more than one GO term
was associated with a category, as is the case for several higher-
level categories, the ranking was based on the sum of the vector
matching scores. Based on the rankings, the area under the
curve (AuC) was calculated as a measure of the similarity be-
tween the manual classification and the automated assignment.
2.6 Visualization
We visualized the functional relationships by using a
technique known as multi-dimensional scaling (MDS)
[17]. This technique optimizes the positions of objects, in
this case nucleolar proteins, in a 2-D space to make the
distances in this space as similar as possible to a given
distance measure. Our distance measure was the cosine
vector matching score between the protein concept pro-
files, used as an indication of the functional similarity
between proteins. Thus, in the resulting 2-D MDS display,
proteins that are functionally similar should appear close
together.
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Proteomics 2007, 7, 921–931 Systems Biology 925
3 Results
3.1 Creation of concept profiles
Using the multi-level approach as in Section 2, we created
concept profiles from the scientific literature for 585 of the
704 known nucleolar proteins. For the remaining 119 pro-
teins, no literature was available, either for the protein itself
or for one of its homologs in Entrez Gene. Sixty-two of these
proteins were labeled function unknown by Coute et al. [16].
For the remaining 57, there was apparent literature available
to enable manual annotation, although we restricted our
analysis to proteins for which Entrez Gene references were
available. The number of homologs and subsequent papers
used for the creation of the 585 protein concept profiles is
displayed in Table 2.
Table 2. Summary of the data used for generating protein con-
cept profiles for all proteins for which at least one article
was found
Minimum Maximum Mean
Homologs used
Human 0 9 1.66
Mouse 0 10 1.37
Fruitfly 0 5 0.70
Yeast 0 8 1.21
Articles used
Articles 1 1046 91.31
This table shows the minimum, maximum, and mean number of
concept profiles from homologous proteins and the number of
articles used to create the concept profiles. The number of hu-
man homologs includes the query protein itself, for which a
concept profile was not always available.
3.2 Assignment of functional categories
Each main category of functional classification showed a
significant relationship between the functional term attrib-
uted automatically and the function of the proteins within
the category (Table 3). For each nucleolar protein, the best
matching GO terms and their associated categories were
determined. A complete table is included in Supplemen-
tary Table S1, showing these categories and the vector
matching scores between the GO term profiles and protein
profiles.
The automatically assigned categories did not always
match the manually assigned categories. An investigation
showed three main reasons for the observed dis-
crepancies. First, each protein is assigned to only one
class by the manually defined categorization, but in reality
several classes could be correct. For instance, SFRS pro-
tein kinase 1 (AC Q8IY12) plays a central role in the
splicing regulatory network and therefore is manually
assigned to the “Splicing” subcategory of “mRNA meta-
Table 3. AuC measures correlating the main nucleolar functional
categories and subcategories as assigned by Coute et al.
[16] with those automatically derived by literature anal-
ysis
Category AuC p
Chaperones 1.00 ,0.001
Chromatin structure 0.98 ,0.001
Fibrous proteins 0.97 ,0.001
mRNA metabolism 0.72 ,0.001
30
end cleavage and polyadenylation 0.65 ,0.001
Editing 0.68 ,0.001
Export/trafficking 0.66 ,0.001
Splicing 0.74 ,0.001
Stability/degradation 0.55 0.060
Transcription 0.56 ,0.001
Others 0.81 ,0.001
Cytoskeleton organization/transport 0.69 ,0.001
DNA repair 0.72 ,0.001
DNA replication 0.70 ,0.001
Kinases/phosphatases 0.58 0.002
Mitosis/cytokinesis/cell cycle regulation 0.74 ,0.001
Nucleocytoplasmic transport 0.57 0.015
Other enzymes 0.65 ,0.001
Sumoylation 0.73 ,0.001
Ubiquitination/protein degradation 0.67 ,0.001
Ribosomal proteins 0.97 ,0.001
Ribosome biogenesis 0.69 ,0.001
Translation 0.88 ,0.001
Methionyl aminopeptidase activity 0.56 0.273
Regulation of translation 0.77 ,0.001
SRP 0.64 0.009
Translation elongation 0.80 ,0.001
Translation initiation 0.80 ,0.001
Translation termination 0.83 ,0.001
tRNA methyltransferase activity 0.55 0.331
This table gives the AuC summary measure for the ROC curves
used to describe the capacity of our method to correctly dis-
criminate between the possible functional categories for each
protein. For each functional category one or more GO terms were
selected that best represented that category. The membership of
a protein to a category was calculated by determining the dis-
tance between the profile belonging to the GO term and the pro-
tein profile. An AuC of 0.5 represents a random assignment of the
functional category and 1.0 represents a perfect assignment.
bolism.” Additionally, it is a kinase and thus the GO term
“Kinase activity” is assigned by the automated system as the
primary term.
Second, words occurring in the documents related to a
protein do not always indicate its function. For example,
Structural maintenance of chromosome 3 (AC Q9UQE7) is
involved in chromosome cohesion during cell cycle, and is
primarily assigned by the system to the functional category
“mitosis” instead of the category “chromatin structure,” be-
cause the word mitosis is often mentioned in the articles
related to this protein.
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
926 M. Schuemie et al. Proteomics 2007, 7, 921–931
Third, in some cases ambiguity in language introduces
errors. For instance, the ambiguous term “factor 1” is a
synonym of the protein, transcription factor 1 (TTF-1, AC
Q15361). This term is erroneously recognized in the name
of the protein cleavage and polyadenylation specificity fac-
tor, 160 kDa subunit (CPSF1, AC Q10570) which in the lit-
erature maybe referred to as the protein complex poly-
adenylation factor 1. The TTF-1 protein is assigned to many
papers related to the GO term “mRNA polyadenylylation,”
and incorporated in the concept profile of that GO term,
resulting in the faulty assignment of the GO term to TTF-1.
Efforts are underway to remove these errors from the sys-
tem following the method, as described by Schijvenaars et
al. [18].
3.3 Functional categories of DEAD-box motif proteins
As a case study to investigate whether our system has added
value over purely sequence based annotations, we con-
structed a multiple sequence alignment of the nucleolar
members of the DEAD-box proteins (Fig. 2) using the tool
ClustalW [19]. The aligned proteins were all related by the
DEAD-box element, nine conserved sequence motifs that
possessed RNA dependent ATPase and ATP-dependent RNA
helicase activities. Although the specific function of many
DEAD-box proteins is uncertain, generally they are found to
play important roles in; RNA metabolism by affecting spli-
cing, ribosome biogenesis, RNA turnover, and mRNA export.
The proteins in the alignment were divided into three clus-
ters. Assuming most of the functional aspects of the
sequences matching DEAD-box family are known, we at-
tributed to each cluster the functions, from a limited set of
manually chosen GO terms most likely possessed by those
proteins. Given the fact that these proteins are all related by
the DEAD-box motif [20], a simple transfer of sequence
annotations would assign the same functional attributes to
all DEAD-box proteins, as seen in protein sequence classifi-
cation databases [21]. At the highest level of functional
Figure 2. Multiple sequence alignment of nucleolar DEAD-box
proteins and the automatic assignment of functional attributes.
Each of the proteins in the alignment is represented by the gene
name followed by the UniProtKB/Swiss-Prot accession number
in parenthesis. The alignment shows three main clusters. The
primary assigned term, ATP-dependent RNA helicases activity,
which represents the generally accepted activity of the DEAD box
motif, is assigned to all clusters. Differences between the clusters
are seen at the level of the secondary terms.
assignment our system behaved in this manner as predicted.
Assignment of functional attributes showed that all of the
clusters possessed ATP-dependant RNA helicase activity as
the major functional category (Fig. 2). However, at the level of
the secondary term, the clusters appeared to have the diver-
gent functions: translational initiation, mRNA-nucleus
export, and ribosome biogenesis (Fig. 2). Manual investiga-
tion showed that within full text articles there was evidence
to demonstrate that the automated assignment of the specific
concept was correct for many of the proteins in each cluster.
This fulfills an important goal in high-throughput proteom-
ics. The automated incorporation of prior knowledge across
species supports the generation of functional predictions at a
level more refined than the simple transfer of annotations
based on sequence similarities alone. The propagation of
erroneous assignments based on repeated reference in the
literature to a singular alignment probably plays a minor role
in literature derived automated annotation. We show that the
system is not limited to the highest level designation, and
most of the protein profiles are based on a large number of
papers (96% of the profiles are based on more than 20
papers) thus marginalizing errors. We have no indication
that this bias plays more than an inconsequential role, which
it will also do in manual annotation. Moreover, the integra-
tion of automated literature derived information facilitates
prediction strategies based on the analysis of diverse sources
of information, which if viewed individually may provide
incomplete or inconsistent representation of a biological
phenomenon.
3.4 MDS display
MDS is a technique for representing high-dimensional ob-
jects in typically two dimensions. In the main MDS display
(Fig. 3A), the proteins of each of the nucleolar functional
categories are represented by a different color and shape. The
MDS display appears to capture the correlation between the
concepts representing the proteins and the underlying fea-
tures of the functional categories. The proteins in the func-
tional categories of: chaperones, chromatin structure,
fibrous proteins, cytoskeleton organization/transport, DNA
repair, DNA replication, kinases/phosphotases, mitosis/
cytokinesis/cell cycle regulation, nucleocytoplasmic trans-
port, ubiquitination–protein degradation, and ribosomal
proteins appeared to cluster essentially along the functional
categories lines. Some categories featured a few outliers
from their respective clusters. The proteins in the functional
categories of ribosome biogenesis, mRNA metabolism, and
translation appeared to cluster but had more dispersion than
other categories.
Outliers from the principal cluster are of special interest
when they appear to fall directly within a different functional
cluster. For example, the ribosomal protein, 40S ribosomal
protein S27a (AC P62979) is located centrally within a group
of ubiquitination and sumoylation proteins. To make the
position of this outlier in relation to these groups and the
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Proteomics 2007, 7, 921–931 Systems Biology 927
Figure 3. MDS display of
nucleolar proteins. The auto-
matically derived concept pro-
file for each protein is used to
place each point on the map.
MDS achieves a drastic reduc-
tion in dimensionality and
therefore may not visually pre-
serve the strong separation of
distinct clusters in the original
high dimensional space. The
manually described functional
categories for each protein are
represented by different colors
and shapes. (A) The map of the
complete set of the 700 nucleo-
lar proteins is shown with an
inset of the exosome cluster. In
the inset, the proteins belonging
to the exosome functional cate-
gory are circled. The three pro-
teins detailed in the text, exo-
some component 10, PARN, and
SRP9 are indicated. The three
function unknown proteins
described in the text are indi-
cated by their UniProtKB/Swiss-
Prot accession numbers
(O43390, P98179, Q8N220). (B)
The clusters of the nucleolar
proteins from the functional
categories of ubiquitin–protein
degradation and sumoylation
were evaluated in combination
with the ribosomal protein cate-
gory to investigate the place-
ment of the ribosomal S27a
protein (AC P62979).
other ribosomal proteins more apparent, we have displayed
only the relevant proteins (Fig. 3B). Note that the MDS tech-
nique attempts to optimize the positions of the proteins to
reflect the relationships between the proteins, and that dis-
playing a subset of the proteins will result in a slightly dif-
ferent figure as compared to the complete MDS (Fig. 3A).
This is because only the relationships of the proteins being
displayed are taken into account.
Analysis of the ubiquitination and sumoylation func-
tional clusters shows that like ubiquitination, sumoylation
(small ubiquitin-related modifier modification) modulates
protein function through post-translational covalent attach-
ment to lysine residues within targeted proteins. Over the
past few years, it has become apparent that the two mod-
ification systems often communicate and jointly affect the
properties of common substrate proteins, sometimes even
involving the same lysine residue. This justifies their close
association on the MDS diagram. The placement of the 40S
ribosomal protein S27a within these clusters was rationa-
lized by the fact that ubiquitin is synthesized in eukaryotes as
a linear fusion with either itself or to one of two carboxy-ter-
minal extension proteins (CEPs). The 40S ribosomal protein
S27a is the nucleolar protein that is generated during the
post-translational processing events which generate the
ribosomal forms of CEPs [22].
Many proteins appear to be multi-functional and there-
forein annotation, these can correctly be attributed to a variety
of different functional classes. In the original manual classi-
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
928 M. Schuemie et al. Proteomics 2007, 7, 921–931
fication each nucleolar protein was associated with only one
category and the selection of category depended greatly upon
the bias of the annotator. Coute et al. [16] have leaned toward
finding the link to ribosome biogenesis for the classification
of proteins since the nucleolus is classically known as the site
of ribosome biogenesis. They have also reduced the granular-
ity to facilitate the annotation by limiting the number of cate-
gories to the main functional roles of the proteins.
The MDS visualization can offer an interpretation of the
functional categorization potentially different from the bias
of the scientist and potentially suggests unrealized cate-
gories. Evaluation of the ribosome biogenesis proteins
(Fig. 3A, purple dots) showed a small group of exosome pro-
teins located away from the principle cluster. The exosome
had been implicated in both biogenesis and degradation of
many RNA species including rRNA. Inclusion of exosome
proteins in ribosomal biogenesis is thus warranted as is the
separation from the main cluster due to their role in RNA
degradation. In close proximity to the exosome proteins is
the 30
-exoribonuclease (manually assigned to the functional
category mRNA metabolism, stability and degradation)
which preferentially cleaves poly(A) tails, poly(A)-specific
ribonuclease (PARN, AC O95453). This is interesting in light
of the recent discovery in Saccharomyces cerevisiae that all of
the rRNAs can be polyadenylated at a certain level [23]
therefore PARN along with exosome component 10 (AC
Q01780) could have a role in removing these poly(A) tails.
Additionally, the exosome group contains a protein manually
classified in the translation category. The exosome might
play a role in the biogenesis of the small RNA component of
the cytoplasmic signal recognition particle (SRP), which is
assembled in the nucleolus of eukaryotic cells [24, 25]. Due to
this association, the SRP 9 kDa protein is also found within
the exosome cluster (Fig. 3A, inset). The clustering of these
proteins from varied functional categories, but with an
underlying functional cohesion, suggests another possible
category, exosome.
3.5 Elucidating function unknown proteins from
MDS clusters
Coute et al. [16] have classified a number of proteins as
function unknown because explicit experiments to deter-
mine their function have not yet been carried out. Several of
these function unknown proteins appeared to fall within the
cluster analyzed for exosome proteins (Fig. 3A). Literature
concerning these “unknown” proteins was evaluated to cor-
relate their position on the MDS with the other clustered
proteins.
The heterogeneous nuclear ribonucleoprotein (hnRNP)
R (AC O43390, Fig. 3A, inset) was found near the exosome
proteins. hnRNPs are among the most abundant proteins in
the eukaryotic cell nucleus and play a direct role in several
aspects of the RNA life including pre-mRNA splicing,
mRNA transport and mRNA translation, stability, and turn-
over. It is commonly accepted that these proteins bind nas-
cent RNA in a transcript-specific assembly and that both
protein–RNA and protein–protein interactions govern the
final outcome of the ribonucleoprotein fiber, which forms
the substrate of the ensuing processing events [26, 27]. The
cooperative interactions between the proteins that assemble
onto an intricate splicing-regulatory sequence and the
hnRNP assembly are altered in different cell types by incor-
porating different but highly related proteins [28]. The KH-
type splicing-regulatory protein (KSRP) and the hnRNP each
bind to distinct sequence elements of specific transcripts to
mediate RNA binding, mRNA decay, and interactions with
the exosome and PARN [29]. It has been suggested that
NSAP1, which shares 80% homology with hnRNP R, may be
a common factor recruited by different destabilizing ele-
ments for directing the deadenylation of transcripts. This
suggests a possible role for hnRNP R with exosome proteins.
Another protein classified as function unknown found in
the exosome proximity is the RNA-binding protein 3 (Rbm3,
AC P98179). This protein belongs to the hnRNP subgroup of
RNA-binding proteins [30] that contain a single RNA recog-
nition motif (RRM) and glycine-rich tail. Although the func-
tion of this family of proteins is not yet precisely known, it
has been suggested that they affect both transcription and
translation [31]. Additionally, it has been hypothesized that
microRNAs are part of a homeostatic mechanism that con-
trols global protein synthesis [32] and play a crucial role in
post-transcriptional gene silencing. Dresios et al. [33] found
that Rbm3 had a major effect on the relative abundance of
microRNA-containing complexes in cells and that the
expression of an Rbm3 fusion protein led to a decrease in
microRNA levels. Since the cellular levels of different RNA
molecules must be tightly controlled, the inclusion of Rbm3
in the cluster with the exosome proteins involved in main-
taining correct RNA levels is justifiable.
One additional protein of the function unknown category
protein found associated with the exosome cluster was the
Hypothetical protein FLJ36307 (AC Q8N220). Upon detailed
inspection of the sequence data, this protein appears to be a
new splice variant of the splicing factor, arginine/serine-rich
2 protein (SC35, AC Q01130) (Supplementary Fig. S1). SC35
belongs to the family of SR proteins that regulate alternative
splicing in a concentration-dependent manner in vitro and in
vivo. A striking feature of most of the SR protein genes is that
they express alternative mRNAs encoding truncated proteins
whose biological roles remain to be established [34]. It has
been reported that SC35 overexpression leads to a pro-
nounced decrease of its endogenous mRNA levels in HeLa
cells [35]. Moreover, SC35 accumulation is also correlated to
changes in the splicing pattern of its own mRNAs. Finally, it
appears that SC35 autoregulates its expression by promoting
alternative splicing events altering the stability of its own
mRNAs which possibly make it the target of the exosome
surveillance mechanism.
In summary, the correlation between the concepts
representing the proteins and the underlying features of the
functional categories are explicit enough to suggest coherent
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Proteomics 2007, 7, 921–931 Systems Biology 929
mechanistic descriptions for exosome proteins that were
manually classified in other functional categories or as func-
tion unknown. The system provides the possibility of evalu-
ation of proteins under circumstances with an absence of
personal bias so that associations previously hidden from the
researcher can be found.
3.6 Prediction of novel nucleolar proteins
The concept profiles of the 90 proteins determined to have a
nucleolar location as part of their functional spectrum by all
four proteomic studies described in Coute et al. [16], were
used to derive a typical “nucleolar profile.” The proteins
selected to collectively form this profile were localized in the
human nucleolus either by direct visualization methods,
such as antibody detection, protein tagging with green fluo-
rescent protein, or electron microscopy, or thought to be
involved in ribosome biogenesis. These profiles were com-
pared to the profiles of all the other human proteins in the
UniProtKB/Swiss-Prot database. The resulting matching
scores were aggregated, and the overall 14 best matching
proteins are shown in Table 4. In the groups of the highest
Table 4. Prediction of novel nucleolar proteins
AC Protein name Description
Q92901 60S ribosomal
protein L3-like
Ribosomal protein
Q9NQI0 DDX4 DEAD-box protein
O15523 DDX3Y DEAD-box protein
Q9UI26 Importin-11 Interacts with nucleolar
protein rpL12 [38]
Q9UHI6 DDX20 DEAD-box protein
P05388 60S acidic ribosomal
protein P0
Ribosomal protein
Q15477 Helicase SKI2W Shown to be nucleolar
[41]
P60866 40S ribosomal
protein S20
Ribosomal protein
P26196 DDX6 DEAD-box protein
O95793 STAU1 Shown to be nucleolar
[40]
Q9UHL0 DDX25 DEAD-box protein
Q14240 Eukaryotic initiation
factor 4A-II
DEAD-box protein
Q9UMR2 DDX19B DEAD-box protein
P23396 40S ribosomal
protein S3
Ribosomal protein
This table shows the top 14 predicted nucleolar proteins. Col-
umns 1 (AC) and 2 (Protein name) contain the accession number
and the protein name. Column 3 contains a brief description of
the protein or the citation containing evidence for nucleolar
localization.
scoring proteins, many represent members of the DEAD/
DEAH-box protein and the ribosomal protein families. These
families have a precedent for nucleolar localization [36, 37]
and thus are not unexpected as potential nucleolar proteins,
although explicit nucleolar localization has not yet been
demonstrated. Although two of the potential nucleolar
DEAD-box proteins, ATP-dependent RNA helicase DDX3Y
(AC O15523) and eukaryotic initiation factor 4A-II (AC
Q14240) share a high degree of sequence similarity with two
proteins, ATP-dependent RNA helicase DDX3X (AC
O00571) and eukaryotic initiation factor 4A-I (AC P60842),
respectively, both of which have previously been identified as
nucleolar proteins by proteomics [16]. The appearance of
three of the top 14 proteins not classified in the DEAD box or
ribosomal protein families was justified by a literature search
of nonproteomic eukaryotic studies.
Importin-11 (AC Q9UI26) serves as a transport receptor
for the ribosomal protein L12 (rpL12) [38], a nucleolar pro-
tein. This interaction with the nucleolar protein, rpL12,
rationalizes its inclusion. Double-stranded RNA-binding
protein staufen homolog 1 (AC O95793) also interacts with a
nucleolar molecule, telomerase RNA, and has been inde-
pendently visualized by immunofluorescence in nucleoli [39,
40]. Additionally, immunofluorescence and colocalization
with the nucleolar protein nucleophosmin, has been used to
detect helicase SKI2W (AC Q15477) in nucleoli [41].
4 Discussion
The unmanageable amount of information that is both wait-
ing to be incorporated into curated databases and the rate
with which new information is produced today emphasizes
the need for the system presented here. We demonstrate a
technique that allows a rapid and accurate assignment of
functions to individual proteins and clusters proteins/genes
at the functional level. Such approaches can drastically de-
crease the time needed for manual annotation, but also
reveal functional clusters and correlations between result
sets that are not easily discernable at the individual gene
level. The techniques and approaches we propose are widely
applicable for genomics microarrays and proteomics re-
search as well as investigation of individual proteins in spe-
cific pathways.
The manual annotation of many biological objects and
molecules for databases, such as UniProtKB/Swiss-Prot and
GO, is crucial for modern complex biological research.
However, manual annotation by professionals is increasingly
challenging because of the information overload. With
greater than 200 000 proteins presently in UniProtKB/Swiss-
Prot and only slightly more than 70 annotators, it becomes
quite a daunting task to keep up with the existing informa-
tion, let alone to face the millions of proteins still to be
annotated. Our system includes more literature than
humanly possible and presents a concise overview of the
most important concepts associated with a given protein. In
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
930 M. Schuemie et al. Proteomics 2007, 7, 921–931
regard to the 704 proteins of nucleolus proteome, the meth-
od has reproduced the annotation of a 6-month literature
search with high accuracy in only a couple of hours. Appar-
ently, this wide coverage of literature compensates for the
fact that our system only used abstracts from MEDLINE,
while Coute et al. [16] had the full text articles at their dis-
posal. Evidently, UniProtKB/Swiss-Prot annotation will
remain the necessary standard for many years to come, and
therefore our system can function as a computer-assistance
to speed up, widen the scope and increase the objectivity of
annotation.
Data emerging from any single species or experimental
approach provides only one perspective on protein function.
Combining the literature, due to similarities in sequence
alignment, increases the scope of evaluation even further, in
that it may suggest the function of a protein in one species
based on conserved sequences in multiple other species for
which more information is available in the specialized liter-
ature. Certainly, the simple transfer of sequence annotations
can wrongly assign functional attributes but our meta-analy-
sis method makes it possible to go beyond identification of
the strict sequence associated activities to determine the
subtle functional subtypes as seen with the different clusters
of the family of DEAD-box proteins.
The aggregation of information over multiple documents
has great potential for actual discovery of knowledge in
genomics and proteomics, which are fundamentally net-
work-based systems biology disciplines. Complex cell prop-
erties such as those that control the exosome arise from net-
works of molecular interactions. Control of these networks
involves cellular processes at multiple levels, including
mRNA and protein quantities, and their molecular mod-
ifications, localization, and interactions. The integration of
information of disparate literature, and subsequent auto-
matic analysis enabled the elucidation of information that
may not even be explicit in the literature. This is exemplified
by our functional prediction for proteins previously identi-
fied as function unknown by Coute et al. [16]. Information
from multiple levels was extracted from the literature and
integrated to suggest biological functions for these less well-
characterized proteins. Moreover, a new splice variant of
SC35 was discovered due to inferences from this system.
Systems biology research is characterized by the devel-
opment and application of technologies that allow the mon-
itoring of biology at the system level. In our study, based on
the aggregation of the nucleolar protein functional profiles,
other proteins within the nucleolar system were identified.
System biology predicts that the nucleolus protein composi-
tion may vary as the biological processes respond to specific
conditions. Ninety proteins are consistently found in the
nucleolus by proteomic studies indicating that they are loca-
ted in the nucleolus at least for part of their functional life
span. Although these proteins have a wide range of func-
tions, the aggregated conceptual information appeared to
produce correct predictions of other proteins that may be
related to this biological system. Additionally, we have pre-
liminary evidence from other groups using the same
approach that the profiles from a focused group of proteins
such as those centered on a property like pathogenicity are
also powerful tools to predict the additional proteins
involved.
Our method for the extraction of biological information
from literature associated with high-throughput data appears
to accurately represent biological phenomena due to the fact
that we:
(i) exploit sequence similarities at a level more refined
than the simple transfer of annotations,
(ii) correctly capture relationships between the functional
terms attributed automatically and the function of the pro-
tein, and
(iii) visualize relationships by clustering proteins using
their underlying functional cohesion.
This process is essentially a generic tool for network-
based systems biology, and we have no reason to believe that
the approach described here will be any less powerful with
different biological datasets. We are currently working on a
number of features that systematize the approach and make
the steps more user-friendly. We anticipate having an open
access prototype system online early 2007, to be used by the
scientific community at large, which will be made available
through our website: http://www.biosemantics.org.
This study was supported by the Biorange project sp 4.1.1.
from the Netherlands Bioinformatics Centre, by Leiden Uni-
versity Medical Centre, Erasmus University Medical Centre, and
Knewco Inc., and was conducted within the Centre for Medical
Systems Biology (CMSB), established by the Netherlands
Genomic Initiative/Netherlands Organisation for Scientific
Research (NGI/NWO).
5 References
[1] Andersen, J. S., Lyon, C. E., Fox, A. H., Leung, A. K. et al., Curr.
Biol. 2002, 12, 1–11.
[2] Scherl, A., Coute, Y., Deon, C., Calle, A. et al., Mol. Biol. Cell
2002, 13, 4100–4109.
[3] Andersen, J. S., Lam, Y. W., Leung, A. K., Ong, S. E. et al.,
Nature 2005, 433, 77–83.
[4] Vollmer, M., Horth, P., Rozing, G., Coute, Y. et al., J. Sep. Sci.
2006, 29, 499–509.
[5] Raska, I., Shaw, P. J., Cmarko, D., Curr. Opin. Cell Biol. 2006,
18, 325–334.
[6] Olson, M. O., Dundr, M., Histochem. Cell Biol. 2005, 123, 203–
216.
[7] Raychaudhuri, S., Chang, J. T., Sutphin, P. D., Altman, R. B.,
Genome Res. 2002, 12, 203–214.
[8] Koike, A., Niwa, Y., Takagi, T., Bioinformatics 2005, 21, 1227–
1236.
[9] Xie, H., Wasserman, A., Levine, Z., Novik, A. et al., Genome
Res. 2002, 12, 785–794.
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Proteomics 2007, 7, 921–931 Systems Biology 931
[10] Maglott, D., Ostell, J., Pruitt, K. D., Tatusova, T., Nucleic
Acids Res. 2005, 33, D54–D58.
[11] Blake, J. A., Richardson, J. E., Bult, C. J., Kadin, J. A., Eppig,
J. T., Nucleic Acids Res. 2003, 31, 193–195.
[12] Drysdale, R. A., Crosby, M. A., Nucleic Acids Res. 2005, 33,
D390–D395.
[13] Dwight, S. S., Balakrishnan, R., Christie, K. R., Costanzo, M.
C. et al., Brief. Bioinform. 2004, 5, 9–22.
[14] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. et al.,
Nucleic Acids Res. 1997, 25, 3389–3402.
[15] Schuemie, M. J., Mons, B., Weeber, M., Kors, J. A., J.
Biomed. Inform., In Press.
[16] Coute, Y., Burgess, J. A., Diaz, J. J., Chichester, C. et al., Mass
Spectrom. Rev. 2006, 25, 215–234.
[17] Borg, I., Groenen, P., Modern Multidimensional Scaling:
Theory and Applications, Springer, New York 1997.
[18] Schijvenaars, B. J., Mons, B., Weeber, M., Schuemie, M. J. et
al., BMC Bioinformatics 2005, 6, 149.
[19] Thompson, J. D., Higgins, D. G., Gibson, T. J., Nucleic Acids
Res. 1994, 22, 4673–4680.
[20] Schmid, S. R., Linder, P., Mol. Microbiol. 1992, 6, 283–291.
[21] Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L. et al., Nucleic
Acids Res. 2006, 34, D227–D230.
[22] Williamson, N. A., Raliegh, J., Morrice, N. A., Wettenhall, R.
E., Eur. J. Biochem. 1997, 246, 786–793.
[23] Kuai, L., Fang, F., Butler, J. S., Sherman, F., Proc. Natl. Acad.
Sci. USA 2004, 101, 8581–8586.
[24] Grosshans, H., Deinert, K., Hurt, E., Simos, G., J. Cell. Biol.
2001, 153, 745–762.
[25] Politz, J. C., Yarovoi, S., Kilroy, S. M., Gowda, K. et al., Proc.
Natl. Acad. Sci. USA 2000, 97, 55–60.
[26] Bennett, M., Pinol-Roma, S., Staknis, D., Dreyfuss, G., Reed,
R., Mol. Cell Biol. 1992, 12, 3165–3175.
[27] Matunis, E. L., Matunis, M. J., Dreyfuss, G., J. Cell Biol. 1993,
121, 219–228.
[28] Markovtsov, V., Nikolic, J. M., Goldman, J. A., Turck, C. W. et
al., Mol. Cell Biol. 2000, 20, 7463–7479.
[29] Gherzi, R., Lee, K. Y., Briata, P., Wegmuller, D. et al., Mol. Cell
2004, 14, 571–583.
[30] Wellmann, S., Buhrer, C., Moderegger, E., Zelmer, A. et al., J.
Cell Sci. 2004, 117, 1785–1794.
[31] Wright, C. F., Oswald, B. W., Dellis, S., J. Biol. Chem. 2001,
276, 40680–40686.
[32] Bartel, D. P., Cell 2004, 116, 281–297.
[33] Dresios, J., Aschrafi, A., Owens, G. C., Vanderklish, P. W. et
al., Proc. Natl. Acad. Sci. USA 2005, 102, 1865–1870.
[34] Screaton, G. R., Caceres, J. F., Mayeda, A., Bell, M. V. et al.,
EMBO J. 1995, 14, 4336–4349.
[35] Sureau, A., Gattoni, R., Dooghe, Y., Stevenin, J., Soret, J.,
EMBO J. 2001, 20, 1785–1796.
[36] Leung, A. K., Lamond, A. I., Crit. Rev. Eukaryot. Gene Expr.
2003, 13, 39–54.
[37] Rocak, S., Linder, P., Nat. Rev. Mol. Cell Biol. 2004, 5, 232–
241.
[38] Plafker, K., Macara, I. G., J. Biol. Chem. 2002, 277, 30121–
30127.
[39] Marion, R. M., Fortes, P., Beloso, A., Dotti, C., Ortin, J., Mol.
Cell Biol. 1999, 19, 2212–2219.
[40] Le, S., Sternglanz, R., Greider, C. W., Mol. Biol. Cell 2000, 11,
999–1010.
[41] Qu, X., Yang, Z., Zhang, S., Shen, L. et al., Nucleic Acids Res.
1998, 26, 4068–4077.
 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

More Related Content

Similar to Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Based On Automatic Analysis Of MEDLINE

Unveiling the role of network and systems biology in drug discovery
Unveiling the role of network and systems biology in drug discoveryUnveiling the role of network and systems biology in drug discovery
Unveiling the role of network and systems biology in drug discoverychengcheng zhou
 
Protein sequence classification in data mining– a study
Protein sequence classification in data mining– a studyProtein sequence classification in data mining– a study
Protein sequence classification in data mining– a studyZac Darcy
 
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDYPROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDYZac Darcy
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...Sara Alvarez
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataYannick Pouliot
 
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO csandit
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSMSCW Mysore
 
Proteomics, definatio , general concept, signficance
Proteomics,  definatio , general concept, signficanceProteomics,  definatio , general concept, signficance
Proteomics, definatio , general concept, signficanceKAUSHAL SAHU
 
Prediction of the in vitro permeability determined in Caco-2 cells by using a...
Prediction of the in vitro permeability determined in Caco-2 cells by using a...Prediction of the in vitro permeability determined in Caco-2 cells by using a...
Prediction of the in vitro permeability determined in Caco-2 cells by using a...PPaixao
 
Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsAmit Sheth
 
Computer simulation in pkpd
Computer simulation in pkpdComputer simulation in pkpd
Computer simulation in pkpdDollySadrani
 
Epigeneticsand methylation
Epigeneticsand methylationEpigeneticsand methylation
Epigeneticsand methylationShubhda Roy
 
A statistical framework for multiparameter analysis at the single cell level
A statistical framework for multiparameter analysis at the single cell levelA statistical framework for multiparameter analysis at the single cell level
A statistical framework for multiparameter analysis at the single cell levelShashaanka Ashili
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptxAshuAsh15
 
INBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision
 

Similar to Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Based On Automatic Analysis Of MEDLINE (20)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Unveiling the role of network and systems biology in drug discovery
Unveiling the role of network and systems biology in drug discoveryUnveiling the role of network and systems biology in drug discovery
Unveiling the role of network and systems biology in drug discovery
 
Artigo salivaprint
Artigo salivaprintArtigo salivaprint
Artigo salivaprint
 
Protein sequence classification in data mining– a study
Protein sequence classification in data mining– a studyProtein sequence classification in data mining– a study
Protein sequence classification in data mining– a study
 
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDYPROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
 
Proteomics.pptx
Proteomics.pptxProteomics.pptx
Proteomics.pptx
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Proteomics, definatio , general concept, signficance
Proteomics,  definatio , general concept, signficanceProteomics,  definatio , general concept, signficance
Proteomics, definatio , general concept, signficance
 
Prediction of the in vitro permeability determined in Caco-2 cells by using a...
Prediction of the in vitro permeability determined in Caco-2 cells by using a...Prediction of the in vitro permeability determined in Caco-2 cells by using a...
Prediction of the in vitro permeability determined in Caco-2 cells by using a...
 
Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical Informatics
 
Computer simulation in pkpd
Computer simulation in pkpdComputer simulation in pkpd
Computer simulation in pkpd
 
Epigeneticsand methylation
Epigeneticsand methylationEpigeneticsand methylation
Epigeneticsand methylation
 
A statistical framework for multiparameter analysis at the single cell level
A statistical framework for multiparameter analysis at the single cell levelA statistical framework for multiparameter analysis at the single cell level
A statistical framework for multiparameter analysis at the single cell level
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Geomics proteomics
Geomics proteomicsGeomics proteomics
Geomics proteomics
 
INBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria López
 

More from Jim Jimenez

My School Essay Writing - College Homework Help A
My School Essay Writing - College Homework Help AMy School Essay Writing - College Homework Help A
My School Essay Writing - College Homework Help AJim Jimenez
 
017 Difference Between Paragraph And Essay Ppt
017 Difference Between Paragraph And Essay Ppt017 Difference Between Paragraph And Essay Ppt
017 Difference Between Paragraph And Essay PptJim Jimenez
 
40 Can You Use The Same Essay For Different
40 Can You Use The Same Essay For Different40 Can You Use The Same Essay For Different
40 Can You Use The Same Essay For DifferentJim Jimenez
 
Printable Frog Writing Paper Curbeu Co Uk
Printable Frog Writing Paper Curbeu Co UkPrintable Frog Writing Paper Curbeu Co Uk
Printable Frog Writing Paper Curbeu Co UkJim Jimenez
 
013 Essay Example Historiographical Glamoro
013 Essay Example Historiographical Glamoro013 Essay Example Historiographical Glamoro
013 Essay Example Historiographical GlamoroJim Jimenez
 
Scholarship Essay Sample About Why I Deserve The
Scholarship Essay Sample About Why I Deserve TheScholarship Essay Sample About Why I Deserve The
Scholarship Essay Sample About Why I Deserve TheJim Jimenez
 
Lined Printable A4 Paper Letter Writing Personal Us
Lined Printable A4 Paper Letter Writing Personal UsLined Printable A4 Paper Letter Writing Personal Us
Lined Printable A4 Paper Letter Writing Personal UsJim Jimenez
 
College Pressures Essay 1 VOL.1 .Docx - Economic Se
College Pressures Essay 1 VOL.1 .Docx - Economic SeCollege Pressures Essay 1 VOL.1 .Docx - Economic Se
College Pressures Essay 1 VOL.1 .Docx - Economic SeJim Jimenez
 
Mla Format Double Spaced Essay - Term Paper Doubl
Mla Format Double Spaced Essay - Term Paper DoublMla Format Double Spaced Essay - Term Paper Doubl
Mla Format Double Spaced Essay - Term Paper DoublJim Jimenez
 
012 Essay Example College Application Examples Th
012 Essay Example College Application Examples Th012 Essay Example College Application Examples Th
012 Essay Example College Application Examples ThJim Jimenez
 
Critical Review Research Papers
Critical Review Research PapersCritical Review Research Papers
Critical Review Research PapersJim Jimenez
 
Samples Of Dissertation Proposals. Writing A Disser
Samples Of Dissertation Proposals. Writing A DisserSamples Of Dissertation Proposals. Writing A Disser
Samples Of Dissertation Proposals. Writing A DisserJim Jimenez
 
Sample National Junior Honor Society Essay Tel
Sample National Junior Honor Society Essay TelSample National Junior Honor Society Essay Tel
Sample National Junior Honor Society Essay TelJim Jimenez
 
Papers 9 Essays Research Essay Example Apa Template Microsoft Wor
Papers 9 Essays Research Essay Example Apa Template Microsoft WorPapers 9 Essays Research Essay Example Apa Template Microsoft Wor
Papers 9 Essays Research Essay Example Apa Template Microsoft WorJim Jimenez
 
Personalised Luxury Writing Paper By Able Labels Not
Personalised Luxury Writing Paper By Able Labels NotPersonalised Luxury Writing Paper By Able Labels Not
Personalised Luxury Writing Paper By Able Labels NotJim Jimenez
 
Homework Help Best Topics For An Argumentative Essa
Homework Help Best Topics For An Argumentative EssaHomework Help Best Topics For An Argumentative Essa
Homework Help Best Topics For An Argumentative EssaJim Jimenez
 
🌈 Essay Writing My Teacher. Essay On My
🌈 Essay Writing My Teacher. Essay On My🌈 Essay Writing My Teacher. Essay On My
🌈 Essay Writing My Teacher. Essay On MyJim Jimenez
 
Guide To The 2019-20 Columbia University Suppl
Guide To The 2019-20 Columbia University SupplGuide To The 2019-20 Columbia University Suppl
Guide To The 2019-20 Columbia University SupplJim Jimenez
 
Help Writing Papers For College - The Best Place T
Help Writing Papers For College - The Best Place THelp Writing Papers For College - The Best Place T
Help Writing Papers For College - The Best Place TJim Jimenez
 
Essay Def. What Is An Essay The Definition And Main Features Of
Essay Def. What Is An Essay The Definition And Main Features OfEssay Def. What Is An Essay The Definition And Main Features Of
Essay Def. What Is An Essay The Definition And Main Features OfJim Jimenez
 

More from Jim Jimenez (20)

My School Essay Writing - College Homework Help A
My School Essay Writing - College Homework Help AMy School Essay Writing - College Homework Help A
My School Essay Writing - College Homework Help A
 
017 Difference Between Paragraph And Essay Ppt
017 Difference Between Paragraph And Essay Ppt017 Difference Between Paragraph And Essay Ppt
017 Difference Between Paragraph And Essay Ppt
 
40 Can You Use The Same Essay For Different
40 Can You Use The Same Essay For Different40 Can You Use The Same Essay For Different
40 Can You Use The Same Essay For Different
 
Printable Frog Writing Paper Curbeu Co Uk
Printable Frog Writing Paper Curbeu Co UkPrintable Frog Writing Paper Curbeu Co Uk
Printable Frog Writing Paper Curbeu Co Uk
 
013 Essay Example Historiographical Glamoro
013 Essay Example Historiographical Glamoro013 Essay Example Historiographical Glamoro
013 Essay Example Historiographical Glamoro
 
Scholarship Essay Sample About Why I Deserve The
Scholarship Essay Sample About Why I Deserve TheScholarship Essay Sample About Why I Deserve The
Scholarship Essay Sample About Why I Deserve The
 
Lined Printable A4 Paper Letter Writing Personal Us
Lined Printable A4 Paper Letter Writing Personal UsLined Printable A4 Paper Letter Writing Personal Us
Lined Printable A4 Paper Letter Writing Personal Us
 
College Pressures Essay 1 VOL.1 .Docx - Economic Se
College Pressures Essay 1 VOL.1 .Docx - Economic SeCollege Pressures Essay 1 VOL.1 .Docx - Economic Se
College Pressures Essay 1 VOL.1 .Docx - Economic Se
 
Mla Format Double Spaced Essay - Term Paper Doubl
Mla Format Double Spaced Essay - Term Paper DoublMla Format Double Spaced Essay - Term Paper Doubl
Mla Format Double Spaced Essay - Term Paper Doubl
 
012 Essay Example College Application Examples Th
012 Essay Example College Application Examples Th012 Essay Example College Application Examples Th
012 Essay Example College Application Examples Th
 
Critical Review Research Papers
Critical Review Research PapersCritical Review Research Papers
Critical Review Research Papers
 
Samples Of Dissertation Proposals. Writing A Disser
Samples Of Dissertation Proposals. Writing A DisserSamples Of Dissertation Proposals. Writing A Disser
Samples Of Dissertation Proposals. Writing A Disser
 
Sample National Junior Honor Society Essay Tel
Sample National Junior Honor Society Essay TelSample National Junior Honor Society Essay Tel
Sample National Junior Honor Society Essay Tel
 
Papers 9 Essays Research Essay Example Apa Template Microsoft Wor
Papers 9 Essays Research Essay Example Apa Template Microsoft WorPapers 9 Essays Research Essay Example Apa Template Microsoft Wor
Papers 9 Essays Research Essay Example Apa Template Microsoft Wor
 
Personalised Luxury Writing Paper By Able Labels Not
Personalised Luxury Writing Paper By Able Labels NotPersonalised Luxury Writing Paper By Able Labels Not
Personalised Luxury Writing Paper By Able Labels Not
 
Homework Help Best Topics For An Argumentative Essa
Homework Help Best Topics For An Argumentative EssaHomework Help Best Topics For An Argumentative Essa
Homework Help Best Topics For An Argumentative Essa
 
🌈 Essay Writing My Teacher. Essay On My
🌈 Essay Writing My Teacher. Essay On My🌈 Essay Writing My Teacher. Essay On My
🌈 Essay Writing My Teacher. Essay On My
 
Guide To The 2019-20 Columbia University Suppl
Guide To The 2019-20 Columbia University SupplGuide To The 2019-20 Columbia University Suppl
Guide To The 2019-20 Columbia University Suppl
 
Help Writing Papers For College - The Best Place T
Help Writing Papers For College - The Best Place THelp Writing Papers For College - The Best Place T
Help Writing Papers For College - The Best Place T
 
Essay Def. What Is An Essay The Definition And Main Features Of
Essay Def. What Is An Essay The Definition And Main Features OfEssay Def. What Is An Essay The Definition And Main Features Of
Essay Def. What Is An Essay The Definition And Main Features Of
 

Recently uploaded

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxsqpmdrvczh
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 

Recently uploaded (20)

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 

Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Based On Automatic Analysis Of MEDLINE

  • 1. RESEARCH ARTICLE Assignment of protein function and discovery of novel nucleolar proteins based on automatic analysis of MEDLINE Martijn Schuemie1*, Christine Chichester2, 3, 4*, Frederique Lisacek5 , Yohann Coute6 , Peter-Jan Roes1 , Jean Charles Sanchez6 , Jan Kors1 and Barend Mons1, 2, 3 1 Biosemantics Group, Medical Informatics Department, Erasmus Medical Center, Rotterdam, The Netherlands 2 Human and Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands 3 Knewco Inc., Rockville, MD, USA 4 Swiss-Prot Group, Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland 5 Proteome Informatics Group, Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland 6 Biomedical Proteomics Research Group, Departement de Biologie Structurale et Bioinformatique, Centre Medical Universitaire, Geneva, Switzerland Attribution of the most probable functions to proteins identified by proteomics is a significant challenge that requires extensive literature analysis. We have developed a system for automated prediction of implicit and explicit biologically meaningful functions for a proteomics study of the nucleolus. This approach uses a set of vocabularyterms to map and integrate the information from the entire MEDLINE database. Based on a combination of cross-species sequence homology searches and the corresponding literature, our approach facilitated the direct association between sequence data and information from biological texts describing function. Comparison of our automated functional assignment to manual annotation demonstrated our method to be highly effective. To establish the sensitivity, we defined the functional subtleties within a family contain- ing a highly conserved sequence. Clustering of the DEAD-box protein family of RNA helicases confirmed that these proteins shared similar morphology although functional subfamilies were accurately identified by our approach. We visualized the nucleolar proteome in terms of protein functions using multi-dimensional scaling, showing functional associations between nucleolar proteins that were not previously realized. Finally, by clustering the functional properties of the established nucleolar proteins, we predicted novel nucleolar proteins. Subsequently, non- proteomics studies confirmed the predictions of previously unidentified nucleolar proteins. Received: September 14, 2006 Revised: December 18, 2006 Accepted: December 20, 2006 Keywords: Exosome / Multi-dimensional scaling / Nucleolus / RNA helicase / Sequence alignment Proteomics 2007, 7, 921–931 921 1 Introduction There is a high demand for approaches and techniques that can accelerate the annotation of genes, proteins, and other entities in the biomedical sciences. The manual annotation of proteins, such as performed by UniProtKB/Swiss-Prot, is a demanding process that could benefit enormously from computer assistance. Here we describe the evaluation of a novel approach that uses a combination of techniques that Correspondence: Dr Christine Chichester, Human and Clinical Genetics, Leiden University Medical Center, gebouw 2, Eintho- venweg 20, 2333ZC Leiden, The Netherlands E-mail: chichester@knewco.com Fax: 131-79-593-1601 Abbreviations: AC, UniProtKB/Swiss-Prot accession number; AUC, area under the curve; GO, Gene Ontology; MDS, multi- dimensional scaling; MeSH, Medical Subject Headings; PARN, poly(A)-specific ribonuclease; PMID, PubMed ID; Rbm3, RNA- binding protein 3; rpL12, ribosomal protein L12; SC35, splicing factor, arginine/serine-rich 2 protein; SRP, signal recognition par- ticle; TTF-1, transcription factor 1 * Both authors have contributed equally to this work. DOI 10.1002/pmic.200600693  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 2. 922 M. Schuemie et al. Proteomics 2007, 7, 921–931 process free text resources to reconstruct and refine the manual annotation of 585 proteins identified in the human nucleolus. The nucleolus is a subnuclear compartment of eukar- yotic cells that has classically been characterized as the site of ribosomal RNA (rRNA) synthesis and ribosome biogenesis. Recently, remarkable progress in understanding the func- tional organization of the nucleolus has occurred, mainly due to a direct application of proteomics research which has enabled the identification of more than 705 proteins [1–4]. These studies and more recent findings [5] have suggested that the nucleolus may also play a key role in growth and cell cycle control, tumorigenesis, aging, and sequestration of nonribosomal macromolecules and the consequent modula- tion of their molecular pathways [6]. The recent high- throughput studies have delivered massive amounts of data, which in turn lead to time-consuming manual examination of information contained in databases and biomedical litera- ture to attribute functional characteristics to the proteins discovered. Many of the protein entities have been previously described in the literature, thus they need to be correlated with the established knowledge and eventually categorized into functional pathways. Many of the nucleolar proteins identified via proteomics are still classified as “function unknown.” Proteins that represent new entities present the challenge of deciphering the biological function without lit- erature validation specific for the individual human protein. The first step in these processes is to employ techniques that support educated inferences by the scientist. An established method of attributing possible biological functions to newly discovered proteins is through the use of sequence alignments. Significant similarity between sequences infers biological relatedness and often, although not always, functional similarity. Current prediction of pro- tein function is based on extrapolation of the information accumulated for a relatively small set of proteins for which direct functions have been determined experimentally. Ad- ditional clues to individual protein function reside in the vast ocean of the biomedical literature. Existing methodologies mine functional information by relying on document classification methods based on the relationships between a limited set of GO terms and manually assigned Medical Subject Headings (MeSH) [7], by associat- ing literature with proteins and GO terms using a dictionary [8], or sequence homology clustering with a GOA protein [9] for the transfer of GO terms. The increasing number and diversity of protein sequence families require new methods to define and predict details regarding function. Our method extracts relations between biological entities from the entire MEDLINE literature database to produce a condensedversion of it; an impossible task for a single scientist. In the present work, we focus on human nucleolar proteins and their sequence homologs. The condensed literature is represented by a set of vocabulary terms taken from a thesaurus built from the MeSH and several gene and protein databases. These representations are called concept profiles. Concept profiles calculated from papers related to human nucleolar proteins and their sequence homologs are then compared to the con- cept profiles calculated from papers associated with each GO term. This comparison functions as a bridge from the molec- ular function data to the sequence data and allows the associ- ation of the correct term to each nucleolar protein. For new sequences known to be homologous to an existing family, but of unknown function, our method can exploit the information stored in an unstructured form in literature and complement the information in structured databases by predicting specific functions at a high level of detail. We demonstrate the method by application to a well-characterized protein family, the DEAD-box RNA helicases. Particularly important from the perspective of systems biology, we show how our methodology can generate biolog- ical leads concerning proteins associated with the exosome surveillance mechanism. This is due to the fact that we are able to capture the correlation between the concept profiles representing the exosome proteins and the concept profiles of the exosome functional categories to suggest this over- looked biological system. Equally relevant to systems biology, the concept profiles resulting from the condensed literature are also used to perform a large-scale prediction of new potential nucleolar proteins; we automatically derived and aggregated the functional concepts for all known nucleolar proteins and used these to predict novel nucleolar proteins. Several of our predictions, although not found by proteomics analysis of the nucleolus, have been subsequently confirmed by manual evaluation of published experimental findings. 2 Materials and methods We automatically assigned GO terms to each of the nucleolar proteins that were identified by Coute et al. [16] for which enough information was available. The process is depicted in Fig. 1 and consists of the following steps. 2.1 Step 1: Creating literature sets per protein We used the information from several sequence databases to connect proteins to the PubMed IDs (PMIDs) of relevant arti- cles about these proteins. For human proteins, we used the articlesreferredinEntrezGene[10].Formouse,drosophila,and yeast, we used the Mouse Genome Database [11], FlyBase [12], and the SaccharomycesGenome Database [13], respectively. We removed articles that were referred in more than 25 entries that focused in general on sequencing projects. These papers do not contain functional information for specific proteins. To provide a potential function for previously undocu- mented proteins we combined the information found for similar sequences. This technique is based on the assump- tion that proteins with high sequence homology will have similar function and that using all the identified literature, including cross-species information will provide more func- tional insight than the unique sequence literature alone. We  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 3. Proteomics 2007, 7, 921–931 Systems Biology 923 Figure 1. Overview of the process for assigning GO terms to proteins. Step 1: A set of PMIDs is extracted from sequence databases for each protein in the databases. Step 2: A set of PMIDs is extracted from the GOA database for each GO term. Step 3: For each set of PMIDs, a concept profile is generated. Step 4: The concept profiles of proteins are compared to the concept profiles of GO terms using vector matching. aligned the protein sequences from UniProtKB/Swiss-Prot to similar sequences using blastp with the default parame- ters [14]. We pooled the literature from human sequences that matched the query sequence with an E-value of 10290 or lower. For nonhuman sequences, we relaxed the criteria to sequences with an E-value of 10220 or lower to allow for cross-species differences in sequence that were not related to the function of the protein. 2.2 Step 2: Finding articles related to GO terms The GOA database lists proteins and their corresponding GO terms, and when applicable, mentions the PMIDs of articles used for assigning a specific GO term. We used these articles to represent each GO term. We found that many GO terms were seldom or not even assigned to gene products. We had included only those GO terms for which more than five PMIDs were found. 2.3 Step 3: Creating concept profiles Concepts are defined via a thesaurus. A thesaurus contains terms and synonyms that are used to identify textual refer- ences to concepts. Our thesaurus was created by combining biomedical concepts obtained from MeSH, gene product- specific information obtained from GO, and gene/protein names obtained from five different databases for four differ- ent species [15]. There is some overlap between MeSH and GO, resulting in some concepts being identified twice. However, this has little consequence for the subsequent pro- cessing steps. The next step is to create concept profiles for each protein and each GO term. A concept profile is a list of concepts with weights. These weights indicate the association between a concept and the protein or GO term to which the concept profile belongs. For each PMID related to a protein or a GO term, the title and abstract were retrieved from the MEDLINE database. Using the thesaurus and the Collexis software, concepts appearing in the title or abstract were identified. The concept profiles were extracted from the literature sets belonging to each protein and each GO term. Concepts that were found in at least one abstract of a literature set were added to the concept profile. The weight of each concept was calcu- lated as follows: for each concept the hypothesis was tested whether there was a statistical dependency between (a) the occurrence of that concept in an abstract and (b) whether the abstract in which the concept occurred was in the literature set. The weight of a concept in the profile was defined as the log- likelihood of this test, where a high weight indicated a concept that was most distinguishing for the abstracts in the set when compared to all the other abstracts used in this experiment. 2.4 Step 4: Comparing concept profiles The concept profiles of the proteins were compared to the concept profiles of the GO terms using cosine vector match- ing, resulting in a matching score. The highest matching GO terms for each protein were considered to be the most rele- vant GO terms for that protein. 2.5 Evaluation of functional assignment To evaluate the performance of our system, we used the manually defined categorization by Coute et al. [16], who grouped the nucleolar proteins into nine main categories, some of which were decomposed furtherinto subcategories. In collaboration with the creators of that classification, a mapping between categories and subcategories, and GO terms was per- formed manually. This mapping is provided in Table 1. When  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 4. 924 M. Schuemie et al. Proteomics 2007, 7, 921–931 Table 1. Mapping between the functional categories defined by Coute et al. [16] and terms from the Gene Ontol- ogy (GO) Category GO term GO identifier Chaperones Protein folding GO:0006457 Chromatin structure Chromatin assembly or disassembly GO:0006333 Fibrous proteins Intermediate filament GO:0005882 mRNA metabolism 30 end cleavage and polyadenylation mRNA polyadenylylation GO:0006378 Editing mRNA editing GO:0006381 Export/trafficking Ribosome–nucleus export GO:0000054 Splicing Nuclear mRNA splicing, via spliceosome GO:0000398 Stability/degredation mRNA catabolism GO:0006402 Transcription Transcription, DNA-dependent GO:0006351 Others Cytoskeleton organization/transport Cytoskeleton organization and biogenesis GO:0007010 DNA repair DNA repair GO:0006281 DNA replication DNA replication GO:0006260 Kinases/phosphatases Kinase activity GO:0016301 Mitosis/cytokinesis/cell cycle regulation Mitosis GO:0007067 Cytokinesis GO:0000910 Regulation of cell cycle GO:0000074 Nucleocytoplasmic transport Protein–nucleus export GO:0006611 Other enzymes Catalytic activity GO:0003824 Sumoylation Protein sumoylation GO:0016925 Ubiquitination/protein degradation Protein ubiquitination GO:0016567 Ribosomal proteins Structural constituent of ribosome GO:0003735 Ribosome biogenesis rRNA processing GO:0006364 Translation Methionyl aminopeptidase activity Methionyl aminopeptidase activity GO:0004239 Regulation of translation Regulation of translation GO:0006445 SRP SRP binding GO:0005047 Translation elongation Translation elongation GO:0006414 Translation initiation Translation initiation GO:0006413 Translation termination Translation termination GO:0006415 tRNA methyltransferase activity tRNA processing GO:0008175 Column 1 shows the functional terms originally designated to the nucleolar proteins. The creators of the original classification selected specific GO terms (Column 2) that were determined to match most closely to their original categories. Column 3 contains the GO identifiers for the terms found in Column 2. the GO terms assigned to a protein by our system matched the GO terms mapped to the manual categorization, the automatic assignment was assumed to be correct. The performance of the automated assignment was scored using receiver operator characteristics (ROC) curves; for each category, all proteins were ranked according to the vector match- ing score between the protein profile and the profile of the GO term associated with that category. If more than one GO term was associated with a category, as is the case for several higher- level categories, the ranking was based on the sum of the vector matching scores. Based on the rankings, the area under the curve (AuC) was calculated as a measure of the similarity be- tween the manual classification and the automated assignment. 2.6 Visualization We visualized the functional relationships by using a technique known as multi-dimensional scaling (MDS) [17]. This technique optimizes the positions of objects, in this case nucleolar proteins, in a 2-D space to make the distances in this space as similar as possible to a given distance measure. Our distance measure was the cosine vector matching score between the protein concept pro- files, used as an indication of the functional similarity between proteins. Thus, in the resulting 2-D MDS display, proteins that are functionally similar should appear close together.  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 5. Proteomics 2007, 7, 921–931 Systems Biology 925 3 Results 3.1 Creation of concept profiles Using the multi-level approach as in Section 2, we created concept profiles from the scientific literature for 585 of the 704 known nucleolar proteins. For the remaining 119 pro- teins, no literature was available, either for the protein itself or for one of its homologs in Entrez Gene. Sixty-two of these proteins were labeled function unknown by Coute et al. [16]. For the remaining 57, there was apparent literature available to enable manual annotation, although we restricted our analysis to proteins for which Entrez Gene references were available. The number of homologs and subsequent papers used for the creation of the 585 protein concept profiles is displayed in Table 2. Table 2. Summary of the data used for generating protein con- cept profiles for all proteins for which at least one article was found Minimum Maximum Mean Homologs used Human 0 9 1.66 Mouse 0 10 1.37 Fruitfly 0 5 0.70 Yeast 0 8 1.21 Articles used Articles 1 1046 91.31 This table shows the minimum, maximum, and mean number of concept profiles from homologous proteins and the number of articles used to create the concept profiles. The number of hu- man homologs includes the query protein itself, for which a concept profile was not always available. 3.2 Assignment of functional categories Each main category of functional classification showed a significant relationship between the functional term attrib- uted automatically and the function of the proteins within the category (Table 3). For each nucleolar protein, the best matching GO terms and their associated categories were determined. A complete table is included in Supplemen- tary Table S1, showing these categories and the vector matching scores between the GO term profiles and protein profiles. The automatically assigned categories did not always match the manually assigned categories. An investigation showed three main reasons for the observed dis- crepancies. First, each protein is assigned to only one class by the manually defined categorization, but in reality several classes could be correct. For instance, SFRS pro- tein kinase 1 (AC Q8IY12) plays a central role in the splicing regulatory network and therefore is manually assigned to the “Splicing” subcategory of “mRNA meta- Table 3. AuC measures correlating the main nucleolar functional categories and subcategories as assigned by Coute et al. [16] with those automatically derived by literature anal- ysis Category AuC p Chaperones 1.00 ,0.001 Chromatin structure 0.98 ,0.001 Fibrous proteins 0.97 ,0.001 mRNA metabolism 0.72 ,0.001 30 end cleavage and polyadenylation 0.65 ,0.001 Editing 0.68 ,0.001 Export/trafficking 0.66 ,0.001 Splicing 0.74 ,0.001 Stability/degradation 0.55 0.060 Transcription 0.56 ,0.001 Others 0.81 ,0.001 Cytoskeleton organization/transport 0.69 ,0.001 DNA repair 0.72 ,0.001 DNA replication 0.70 ,0.001 Kinases/phosphatases 0.58 0.002 Mitosis/cytokinesis/cell cycle regulation 0.74 ,0.001 Nucleocytoplasmic transport 0.57 0.015 Other enzymes 0.65 ,0.001 Sumoylation 0.73 ,0.001 Ubiquitination/protein degradation 0.67 ,0.001 Ribosomal proteins 0.97 ,0.001 Ribosome biogenesis 0.69 ,0.001 Translation 0.88 ,0.001 Methionyl aminopeptidase activity 0.56 0.273 Regulation of translation 0.77 ,0.001 SRP 0.64 0.009 Translation elongation 0.80 ,0.001 Translation initiation 0.80 ,0.001 Translation termination 0.83 ,0.001 tRNA methyltransferase activity 0.55 0.331 This table gives the AuC summary measure for the ROC curves used to describe the capacity of our method to correctly dis- criminate between the possible functional categories for each protein. For each functional category one or more GO terms were selected that best represented that category. The membership of a protein to a category was calculated by determining the dis- tance between the profile belonging to the GO term and the pro- tein profile. An AuC of 0.5 represents a random assignment of the functional category and 1.0 represents a perfect assignment. bolism.” Additionally, it is a kinase and thus the GO term “Kinase activity” is assigned by the automated system as the primary term. Second, words occurring in the documents related to a protein do not always indicate its function. For example, Structural maintenance of chromosome 3 (AC Q9UQE7) is involved in chromosome cohesion during cell cycle, and is primarily assigned by the system to the functional category “mitosis” instead of the category “chromatin structure,” be- cause the word mitosis is often mentioned in the articles related to this protein.  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 6. 926 M. Schuemie et al. Proteomics 2007, 7, 921–931 Third, in some cases ambiguity in language introduces errors. For instance, the ambiguous term “factor 1” is a synonym of the protein, transcription factor 1 (TTF-1, AC Q15361). This term is erroneously recognized in the name of the protein cleavage and polyadenylation specificity fac- tor, 160 kDa subunit (CPSF1, AC Q10570) which in the lit- erature maybe referred to as the protein complex poly- adenylation factor 1. The TTF-1 protein is assigned to many papers related to the GO term “mRNA polyadenylylation,” and incorporated in the concept profile of that GO term, resulting in the faulty assignment of the GO term to TTF-1. Efforts are underway to remove these errors from the sys- tem following the method, as described by Schijvenaars et al. [18]. 3.3 Functional categories of DEAD-box motif proteins As a case study to investigate whether our system has added value over purely sequence based annotations, we con- structed a multiple sequence alignment of the nucleolar members of the DEAD-box proteins (Fig. 2) using the tool ClustalW [19]. The aligned proteins were all related by the DEAD-box element, nine conserved sequence motifs that possessed RNA dependent ATPase and ATP-dependent RNA helicase activities. Although the specific function of many DEAD-box proteins is uncertain, generally they are found to play important roles in; RNA metabolism by affecting spli- cing, ribosome biogenesis, RNA turnover, and mRNA export. The proteins in the alignment were divided into three clus- ters. Assuming most of the functional aspects of the sequences matching DEAD-box family are known, we at- tributed to each cluster the functions, from a limited set of manually chosen GO terms most likely possessed by those proteins. Given the fact that these proteins are all related by the DEAD-box motif [20], a simple transfer of sequence annotations would assign the same functional attributes to all DEAD-box proteins, as seen in protein sequence classifi- cation databases [21]. At the highest level of functional Figure 2. Multiple sequence alignment of nucleolar DEAD-box proteins and the automatic assignment of functional attributes. Each of the proteins in the alignment is represented by the gene name followed by the UniProtKB/Swiss-Prot accession number in parenthesis. The alignment shows three main clusters. The primary assigned term, ATP-dependent RNA helicases activity, which represents the generally accepted activity of the DEAD box motif, is assigned to all clusters. Differences between the clusters are seen at the level of the secondary terms. assignment our system behaved in this manner as predicted. Assignment of functional attributes showed that all of the clusters possessed ATP-dependant RNA helicase activity as the major functional category (Fig. 2). However, at the level of the secondary term, the clusters appeared to have the diver- gent functions: translational initiation, mRNA-nucleus export, and ribosome biogenesis (Fig. 2). Manual investiga- tion showed that within full text articles there was evidence to demonstrate that the automated assignment of the specific concept was correct for many of the proteins in each cluster. This fulfills an important goal in high-throughput proteom- ics. The automated incorporation of prior knowledge across species supports the generation of functional predictions at a level more refined than the simple transfer of annotations based on sequence similarities alone. The propagation of erroneous assignments based on repeated reference in the literature to a singular alignment probably plays a minor role in literature derived automated annotation. We show that the system is not limited to the highest level designation, and most of the protein profiles are based on a large number of papers (96% of the profiles are based on more than 20 papers) thus marginalizing errors. We have no indication that this bias plays more than an inconsequential role, which it will also do in manual annotation. Moreover, the integra- tion of automated literature derived information facilitates prediction strategies based on the analysis of diverse sources of information, which if viewed individually may provide incomplete or inconsistent representation of a biological phenomenon. 3.4 MDS display MDS is a technique for representing high-dimensional ob- jects in typically two dimensions. In the main MDS display (Fig. 3A), the proteins of each of the nucleolar functional categories are represented by a different color and shape. The MDS display appears to capture the correlation between the concepts representing the proteins and the underlying fea- tures of the functional categories. The proteins in the func- tional categories of: chaperones, chromatin structure, fibrous proteins, cytoskeleton organization/transport, DNA repair, DNA replication, kinases/phosphotases, mitosis/ cytokinesis/cell cycle regulation, nucleocytoplasmic trans- port, ubiquitination–protein degradation, and ribosomal proteins appeared to cluster essentially along the functional categories lines. Some categories featured a few outliers from their respective clusters. The proteins in the functional categories of ribosome biogenesis, mRNA metabolism, and translation appeared to cluster but had more dispersion than other categories. Outliers from the principal cluster are of special interest when they appear to fall directly within a different functional cluster. For example, the ribosomal protein, 40S ribosomal protein S27a (AC P62979) is located centrally within a group of ubiquitination and sumoylation proteins. To make the position of this outlier in relation to these groups and the  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 7. Proteomics 2007, 7, 921–931 Systems Biology 927 Figure 3. MDS display of nucleolar proteins. The auto- matically derived concept pro- file for each protein is used to place each point on the map. MDS achieves a drastic reduc- tion in dimensionality and therefore may not visually pre- serve the strong separation of distinct clusters in the original high dimensional space. The manually described functional categories for each protein are represented by different colors and shapes. (A) The map of the complete set of the 700 nucleo- lar proteins is shown with an inset of the exosome cluster. In the inset, the proteins belonging to the exosome functional cate- gory are circled. The three pro- teins detailed in the text, exo- some component 10, PARN, and SRP9 are indicated. The three function unknown proteins described in the text are indi- cated by their UniProtKB/Swiss- Prot accession numbers (O43390, P98179, Q8N220). (B) The clusters of the nucleolar proteins from the functional categories of ubiquitin–protein degradation and sumoylation were evaluated in combination with the ribosomal protein cate- gory to investigate the place- ment of the ribosomal S27a protein (AC P62979). other ribosomal proteins more apparent, we have displayed only the relevant proteins (Fig. 3B). Note that the MDS tech- nique attempts to optimize the positions of the proteins to reflect the relationships between the proteins, and that dis- playing a subset of the proteins will result in a slightly dif- ferent figure as compared to the complete MDS (Fig. 3A). This is because only the relationships of the proteins being displayed are taken into account. Analysis of the ubiquitination and sumoylation func- tional clusters shows that like ubiquitination, sumoylation (small ubiquitin-related modifier modification) modulates protein function through post-translational covalent attach- ment to lysine residues within targeted proteins. Over the past few years, it has become apparent that the two mod- ification systems often communicate and jointly affect the properties of common substrate proteins, sometimes even involving the same lysine residue. This justifies their close association on the MDS diagram. The placement of the 40S ribosomal protein S27a within these clusters was rationa- lized by the fact that ubiquitin is synthesized in eukaryotes as a linear fusion with either itself or to one of two carboxy-ter- minal extension proteins (CEPs). The 40S ribosomal protein S27a is the nucleolar protein that is generated during the post-translational processing events which generate the ribosomal forms of CEPs [22]. Many proteins appear to be multi-functional and there- forein annotation, these can correctly be attributed to a variety of different functional classes. In the original manual classi-  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 8. 928 M. Schuemie et al. Proteomics 2007, 7, 921–931 fication each nucleolar protein was associated with only one category and the selection of category depended greatly upon the bias of the annotator. Coute et al. [16] have leaned toward finding the link to ribosome biogenesis for the classification of proteins since the nucleolus is classically known as the site of ribosome biogenesis. They have also reduced the granular- ity to facilitate the annotation by limiting the number of cate- gories to the main functional roles of the proteins. The MDS visualization can offer an interpretation of the functional categorization potentially different from the bias of the scientist and potentially suggests unrealized cate- gories. Evaluation of the ribosome biogenesis proteins (Fig. 3A, purple dots) showed a small group of exosome pro- teins located away from the principle cluster. The exosome had been implicated in both biogenesis and degradation of many RNA species including rRNA. Inclusion of exosome proteins in ribosomal biogenesis is thus warranted as is the separation from the main cluster due to their role in RNA degradation. In close proximity to the exosome proteins is the 30 -exoribonuclease (manually assigned to the functional category mRNA metabolism, stability and degradation) which preferentially cleaves poly(A) tails, poly(A)-specific ribonuclease (PARN, AC O95453). This is interesting in light of the recent discovery in Saccharomyces cerevisiae that all of the rRNAs can be polyadenylated at a certain level [23] therefore PARN along with exosome component 10 (AC Q01780) could have a role in removing these poly(A) tails. Additionally, the exosome group contains a protein manually classified in the translation category. The exosome might play a role in the biogenesis of the small RNA component of the cytoplasmic signal recognition particle (SRP), which is assembled in the nucleolus of eukaryotic cells [24, 25]. Due to this association, the SRP 9 kDa protein is also found within the exosome cluster (Fig. 3A, inset). The clustering of these proteins from varied functional categories, but with an underlying functional cohesion, suggests another possible category, exosome. 3.5 Elucidating function unknown proteins from MDS clusters Coute et al. [16] have classified a number of proteins as function unknown because explicit experiments to deter- mine their function have not yet been carried out. Several of these function unknown proteins appeared to fall within the cluster analyzed for exosome proteins (Fig. 3A). Literature concerning these “unknown” proteins was evaluated to cor- relate their position on the MDS with the other clustered proteins. The heterogeneous nuclear ribonucleoprotein (hnRNP) R (AC O43390, Fig. 3A, inset) was found near the exosome proteins. hnRNPs are among the most abundant proteins in the eukaryotic cell nucleus and play a direct role in several aspects of the RNA life including pre-mRNA splicing, mRNA transport and mRNA translation, stability, and turn- over. It is commonly accepted that these proteins bind nas- cent RNA in a transcript-specific assembly and that both protein–RNA and protein–protein interactions govern the final outcome of the ribonucleoprotein fiber, which forms the substrate of the ensuing processing events [26, 27]. The cooperative interactions between the proteins that assemble onto an intricate splicing-regulatory sequence and the hnRNP assembly are altered in different cell types by incor- porating different but highly related proteins [28]. The KH- type splicing-regulatory protein (KSRP) and the hnRNP each bind to distinct sequence elements of specific transcripts to mediate RNA binding, mRNA decay, and interactions with the exosome and PARN [29]. It has been suggested that NSAP1, which shares 80% homology with hnRNP R, may be a common factor recruited by different destabilizing ele- ments for directing the deadenylation of transcripts. This suggests a possible role for hnRNP R with exosome proteins. Another protein classified as function unknown found in the exosome proximity is the RNA-binding protein 3 (Rbm3, AC P98179). This protein belongs to the hnRNP subgroup of RNA-binding proteins [30] that contain a single RNA recog- nition motif (RRM) and glycine-rich tail. Although the func- tion of this family of proteins is not yet precisely known, it has been suggested that they affect both transcription and translation [31]. Additionally, it has been hypothesized that microRNAs are part of a homeostatic mechanism that con- trols global protein synthesis [32] and play a crucial role in post-transcriptional gene silencing. Dresios et al. [33] found that Rbm3 had a major effect on the relative abundance of microRNA-containing complexes in cells and that the expression of an Rbm3 fusion protein led to a decrease in microRNA levels. Since the cellular levels of different RNA molecules must be tightly controlled, the inclusion of Rbm3 in the cluster with the exosome proteins involved in main- taining correct RNA levels is justifiable. One additional protein of the function unknown category protein found associated with the exosome cluster was the Hypothetical protein FLJ36307 (AC Q8N220). Upon detailed inspection of the sequence data, this protein appears to be a new splice variant of the splicing factor, arginine/serine-rich 2 protein (SC35, AC Q01130) (Supplementary Fig. S1). SC35 belongs to the family of SR proteins that regulate alternative splicing in a concentration-dependent manner in vitro and in vivo. A striking feature of most of the SR protein genes is that they express alternative mRNAs encoding truncated proteins whose biological roles remain to be established [34]. It has been reported that SC35 overexpression leads to a pro- nounced decrease of its endogenous mRNA levels in HeLa cells [35]. Moreover, SC35 accumulation is also correlated to changes in the splicing pattern of its own mRNAs. Finally, it appears that SC35 autoregulates its expression by promoting alternative splicing events altering the stability of its own mRNAs which possibly make it the target of the exosome surveillance mechanism. In summary, the correlation between the concepts representing the proteins and the underlying features of the functional categories are explicit enough to suggest coherent  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 9. Proteomics 2007, 7, 921–931 Systems Biology 929 mechanistic descriptions for exosome proteins that were manually classified in other functional categories or as func- tion unknown. The system provides the possibility of evalu- ation of proteins under circumstances with an absence of personal bias so that associations previously hidden from the researcher can be found. 3.6 Prediction of novel nucleolar proteins The concept profiles of the 90 proteins determined to have a nucleolar location as part of their functional spectrum by all four proteomic studies described in Coute et al. [16], were used to derive a typical “nucleolar profile.” The proteins selected to collectively form this profile were localized in the human nucleolus either by direct visualization methods, such as antibody detection, protein tagging with green fluo- rescent protein, or electron microscopy, or thought to be involved in ribosome biogenesis. These profiles were com- pared to the profiles of all the other human proteins in the UniProtKB/Swiss-Prot database. The resulting matching scores were aggregated, and the overall 14 best matching proteins are shown in Table 4. In the groups of the highest Table 4. Prediction of novel nucleolar proteins AC Protein name Description Q92901 60S ribosomal protein L3-like Ribosomal protein Q9NQI0 DDX4 DEAD-box protein O15523 DDX3Y DEAD-box protein Q9UI26 Importin-11 Interacts with nucleolar protein rpL12 [38] Q9UHI6 DDX20 DEAD-box protein P05388 60S acidic ribosomal protein P0 Ribosomal protein Q15477 Helicase SKI2W Shown to be nucleolar [41] P60866 40S ribosomal protein S20 Ribosomal protein P26196 DDX6 DEAD-box protein O95793 STAU1 Shown to be nucleolar [40] Q9UHL0 DDX25 DEAD-box protein Q14240 Eukaryotic initiation factor 4A-II DEAD-box protein Q9UMR2 DDX19B DEAD-box protein P23396 40S ribosomal protein S3 Ribosomal protein This table shows the top 14 predicted nucleolar proteins. Col- umns 1 (AC) and 2 (Protein name) contain the accession number and the protein name. Column 3 contains a brief description of the protein or the citation containing evidence for nucleolar localization. scoring proteins, many represent members of the DEAD/ DEAH-box protein and the ribosomal protein families. These families have a precedent for nucleolar localization [36, 37] and thus are not unexpected as potential nucleolar proteins, although explicit nucleolar localization has not yet been demonstrated. Although two of the potential nucleolar DEAD-box proteins, ATP-dependent RNA helicase DDX3Y (AC O15523) and eukaryotic initiation factor 4A-II (AC Q14240) share a high degree of sequence similarity with two proteins, ATP-dependent RNA helicase DDX3X (AC O00571) and eukaryotic initiation factor 4A-I (AC P60842), respectively, both of which have previously been identified as nucleolar proteins by proteomics [16]. The appearance of three of the top 14 proteins not classified in the DEAD box or ribosomal protein families was justified by a literature search of nonproteomic eukaryotic studies. Importin-11 (AC Q9UI26) serves as a transport receptor for the ribosomal protein L12 (rpL12) [38], a nucleolar pro- tein. This interaction with the nucleolar protein, rpL12, rationalizes its inclusion. Double-stranded RNA-binding protein staufen homolog 1 (AC O95793) also interacts with a nucleolar molecule, telomerase RNA, and has been inde- pendently visualized by immunofluorescence in nucleoli [39, 40]. Additionally, immunofluorescence and colocalization with the nucleolar protein nucleophosmin, has been used to detect helicase SKI2W (AC Q15477) in nucleoli [41]. 4 Discussion The unmanageable amount of information that is both wait- ing to be incorporated into curated databases and the rate with which new information is produced today emphasizes the need for the system presented here. We demonstrate a technique that allows a rapid and accurate assignment of functions to individual proteins and clusters proteins/genes at the functional level. Such approaches can drastically de- crease the time needed for manual annotation, but also reveal functional clusters and correlations between result sets that are not easily discernable at the individual gene level. The techniques and approaches we propose are widely applicable for genomics microarrays and proteomics re- search as well as investigation of individual proteins in spe- cific pathways. The manual annotation of many biological objects and molecules for databases, such as UniProtKB/Swiss-Prot and GO, is crucial for modern complex biological research. However, manual annotation by professionals is increasingly challenging because of the information overload. With greater than 200 000 proteins presently in UniProtKB/Swiss- Prot and only slightly more than 70 annotators, it becomes quite a daunting task to keep up with the existing informa- tion, let alone to face the millions of proteins still to be annotated. Our system includes more literature than humanly possible and presents a concise overview of the most important concepts associated with a given protein. In  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 10. 930 M. Schuemie et al. Proteomics 2007, 7, 921–931 regard to the 704 proteins of nucleolus proteome, the meth- od has reproduced the annotation of a 6-month literature search with high accuracy in only a couple of hours. Appar- ently, this wide coverage of literature compensates for the fact that our system only used abstracts from MEDLINE, while Coute et al. [16] had the full text articles at their dis- posal. Evidently, UniProtKB/Swiss-Prot annotation will remain the necessary standard for many years to come, and therefore our system can function as a computer-assistance to speed up, widen the scope and increase the objectivity of annotation. Data emerging from any single species or experimental approach provides only one perspective on protein function. Combining the literature, due to similarities in sequence alignment, increases the scope of evaluation even further, in that it may suggest the function of a protein in one species based on conserved sequences in multiple other species for which more information is available in the specialized liter- ature. Certainly, the simple transfer of sequence annotations can wrongly assign functional attributes but our meta-analy- sis method makes it possible to go beyond identification of the strict sequence associated activities to determine the subtle functional subtypes as seen with the different clusters of the family of DEAD-box proteins. The aggregation of information over multiple documents has great potential for actual discovery of knowledge in genomics and proteomics, which are fundamentally net- work-based systems biology disciplines. Complex cell prop- erties such as those that control the exosome arise from net- works of molecular interactions. Control of these networks involves cellular processes at multiple levels, including mRNA and protein quantities, and their molecular mod- ifications, localization, and interactions. The integration of information of disparate literature, and subsequent auto- matic analysis enabled the elucidation of information that may not even be explicit in the literature. This is exemplified by our functional prediction for proteins previously identi- fied as function unknown by Coute et al. [16]. Information from multiple levels was extracted from the literature and integrated to suggest biological functions for these less well- characterized proteins. Moreover, a new splice variant of SC35 was discovered due to inferences from this system. Systems biology research is characterized by the devel- opment and application of technologies that allow the mon- itoring of biology at the system level. In our study, based on the aggregation of the nucleolar protein functional profiles, other proteins within the nucleolar system were identified. System biology predicts that the nucleolus protein composi- tion may vary as the biological processes respond to specific conditions. Ninety proteins are consistently found in the nucleolus by proteomic studies indicating that they are loca- ted in the nucleolus at least for part of their functional life span. Although these proteins have a wide range of func- tions, the aggregated conceptual information appeared to produce correct predictions of other proteins that may be related to this biological system. Additionally, we have pre- liminary evidence from other groups using the same approach that the profiles from a focused group of proteins such as those centered on a property like pathogenicity are also powerful tools to predict the additional proteins involved. Our method for the extraction of biological information from literature associated with high-throughput data appears to accurately represent biological phenomena due to the fact that we: (i) exploit sequence similarities at a level more refined than the simple transfer of annotations, (ii) correctly capture relationships between the functional terms attributed automatically and the function of the pro- tein, and (iii) visualize relationships by clustering proteins using their underlying functional cohesion. This process is essentially a generic tool for network- based systems biology, and we have no reason to believe that the approach described here will be any less powerful with different biological datasets. We are currently working on a number of features that systematize the approach and make the steps more user-friendly. We anticipate having an open access prototype system online early 2007, to be used by the scientific community at large, which will be made available through our website: http://www.biosemantics.org. This study was supported by the Biorange project sp 4.1.1. from the Netherlands Bioinformatics Centre, by Leiden Uni- versity Medical Centre, Erasmus University Medical Centre, and Knewco Inc., and was conducted within the Centre for Medical Systems Biology (CMSB), established by the Netherlands Genomic Initiative/Netherlands Organisation for Scientific Research (NGI/NWO). 5 References [1] Andersen, J. S., Lyon, C. E., Fox, A. H., Leung, A. K. et al., Curr. Biol. 2002, 12, 1–11. [2] Scherl, A., Coute, Y., Deon, C., Calle, A. et al., Mol. Biol. Cell 2002, 13, 4100–4109. [3] Andersen, J. S., Lam, Y. W., Leung, A. K., Ong, S. E. et al., Nature 2005, 433, 77–83. [4] Vollmer, M., Horth, P., Rozing, G., Coute, Y. et al., J. Sep. Sci. 2006, 29, 499–509. [5] Raska, I., Shaw, P. J., Cmarko, D., Curr. Opin. Cell Biol. 2006, 18, 325–334. [6] Olson, M. O., Dundr, M., Histochem. Cell Biol. 2005, 123, 203– 216. [7] Raychaudhuri, S., Chang, J. T., Sutphin, P. D., Altman, R. B., Genome Res. 2002, 12, 203–214. [8] Koike, A., Niwa, Y., Takagi, T., Bioinformatics 2005, 21, 1227– 1236. [9] Xie, H., Wasserman, A., Levine, Z., Novik, A. et al., Genome Res. 2002, 12, 785–794.  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
  • 11. Proteomics 2007, 7, 921–931 Systems Biology 931 [10] Maglott, D., Ostell, J., Pruitt, K. D., Tatusova, T., Nucleic Acids Res. 2005, 33, D54–D58. [11] Blake, J. A., Richardson, J. E., Bult, C. J., Kadin, J. A., Eppig, J. T., Nucleic Acids Res. 2003, 31, 193–195. [12] Drysdale, R. A., Crosby, M. A., Nucleic Acids Res. 2005, 33, D390–D395. [13] Dwight, S. S., Balakrishnan, R., Christie, K. R., Costanzo, M. C. et al., Brief. Bioinform. 2004, 5, 9–22. [14] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. et al., Nucleic Acids Res. 1997, 25, 3389–3402. [15] Schuemie, M. J., Mons, B., Weeber, M., Kors, J. A., J. Biomed. Inform., In Press. [16] Coute, Y., Burgess, J. A., Diaz, J. J., Chichester, C. et al., Mass Spectrom. Rev. 2006, 25, 215–234. [17] Borg, I., Groenen, P., Modern Multidimensional Scaling: Theory and Applications, Springer, New York 1997. [18] Schijvenaars, B. J., Mons, B., Weeber, M., Schuemie, M. J. et al., BMC Bioinformatics 2005, 6, 149. [19] Thompson, J. D., Higgins, D. G., Gibson, T. J., Nucleic Acids Res. 1994, 22, 4673–4680. [20] Schmid, S. R., Linder, P., Mol. Microbiol. 1992, 6, 283–291. [21] Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L. et al., Nucleic Acids Res. 2006, 34, D227–D230. [22] Williamson, N. A., Raliegh, J., Morrice, N. A., Wettenhall, R. E., Eur. J. Biochem. 1997, 246, 786–793. [23] Kuai, L., Fang, F., Butler, J. S., Sherman, F., Proc. Natl. Acad. Sci. USA 2004, 101, 8581–8586. [24] Grosshans, H., Deinert, K., Hurt, E., Simos, G., J. Cell. Biol. 2001, 153, 745–762. [25] Politz, J. C., Yarovoi, S., Kilroy, S. M., Gowda, K. et al., Proc. Natl. Acad. Sci. USA 2000, 97, 55–60. [26] Bennett, M., Pinol-Roma, S., Staknis, D., Dreyfuss, G., Reed, R., Mol. Cell Biol. 1992, 12, 3165–3175. [27] Matunis, E. L., Matunis, M. J., Dreyfuss, G., J. Cell Biol. 1993, 121, 219–228. [28] Markovtsov, V., Nikolic, J. M., Goldman, J. A., Turck, C. W. et al., Mol. Cell Biol. 2000, 20, 7463–7479. [29] Gherzi, R., Lee, K. Y., Briata, P., Wegmuller, D. et al., Mol. Cell 2004, 14, 571–583. [30] Wellmann, S., Buhrer, C., Moderegger, E., Zelmer, A. et al., J. Cell Sci. 2004, 117, 1785–1794. [31] Wright, C. F., Oswald, B. W., Dellis, S., J. Biol. Chem. 2001, 276, 40680–40686. [32] Bartel, D. P., Cell 2004, 116, 281–297. [33] Dresios, J., Aschrafi, A., Owens, G. C., Vanderklish, P. W. et al., Proc. Natl. Acad. Sci. USA 2005, 102, 1865–1870. [34] Screaton, G. R., Caceres, J. F., Mayeda, A., Bell, M. V. et al., EMBO J. 1995, 14, 4336–4349. [35] Sureau, A., Gattoni, R., Dooghe, Y., Stevenin, J., Soret, J., EMBO J. 2001, 20, 1785–1796. [36] Leung, A. K., Lamond, A. I., Crit. Rev. Eukaryot. Gene Expr. 2003, 13, 39–54. [37] Rocak, S., Linder, P., Nat. Rev. Mol. Cell Biol. 2004, 5, 232– 241. [38] Plafker, K., Macara, I. G., J. Biol. Chem. 2002, 277, 30121– 30127. [39] Marion, R. M., Fortes, P., Beloso, A., Dotti, C., Ortin, J., Mol. Cell Biol. 1999, 19, 2212–2219. [40] Le, S., Sternglanz, R., Greider, C. W., Mol. Biol. Cell 2000, 11, 999–1010. [41] Qu, X., Yang, Z., Zhang, S., Shen, L. et al., Nucleic Acids Res. 1998, 26, 4068–4077.  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com