SlideShare a Scribd company logo
New strategy for the viral protein structure predictions: “Consensus
model” approach to take advantage of sequence diversity
Introduction
The viral surface proteins are good targets for the vaccine development. They are major
targets for neutralizing antibody against viruses but elicitation of broadly reactive neutralizing antibody
(brNAb) is proved to be difficult. The structural information for these viral surface proteins is extremely
important for rational design for immunogen targeting these viruses in order to understand conserved
surface residues. Although drug targets such as viral proteases, reverse transcriptases are well
studied by crystallographic or NMR structural determinations in order to design effective drug against
these viruses, surface proteins are not as much studied as drug target proteins. Some well-studied
surface proteins such as HIV gp120 and influenza hemagglutinin show significant sequence diversity
and shown to be challenging target for elicitation of brNAb. The peptide based immunogen and
epitope scaffold-based approaches utilizing known brNAb epitope are not successful to elicit brNAb
against HIV-1 and influenza so far [1,2], and the use of minimal-sized rigid protein of conserved
sequence region in the native structural context would be necessary [3]. In order to rationally design
such immunogen, structural information for the region of protein that can fold independently is crucial.
With insufficient solved structures, we need to rely on the computational structure prediction on these
proteins. The structure prediction of viral proteins is not easy task. This class of proteins does not
share high sequence similarity with known cellular proteins [4,5,6] and also solved structures of viral
proteins are sparse compared to other protein families. Thus, it is poor target for the comparative
modeling strategy. In addition to that, the sequence diversity within the same protein is extremely high
[7,8] with their high mutation rates for virus’ immune evasion and makes it a moving target for the
structure prediction (the choice of sequence may affect outcome). Some proteins show below “twilight
zone” sequence identities [9] among same protein from different strains but same specie. In these
cases, first of all, even it is not clear what is the principle of choosing target sequence for prediction,
as even consensus sequence is just another variation. Thus, it is also true that these proteins are
difficult target for de novo prediction. In addition to extensive diversity within same protein, the
sequence space of viral proteins is vastly large yet not overlapping enough with “cellular”
(prokaryotic/archaeal/eukaryotic) sequence spaces [5,6,10]. Thus, predicting the protein structure
with knowledge-based algorithms [11,12] appears to be difficult including both comparative and
fragment assembly based free modeling strategies. As completely physics-based modeling is still not
feasible or practical at this time [13], in order to predict the structure of viral proteins, we need to use
currently available algorithms [14,15]. Although overall sequence spaces of cellular and viral proteins
may not overlap very much and unique to each other except some exchanged genes between two,
still fragment library has short “viral” sequences (such as 9mers for Rosetta fragment library) in some
degree even assembled from database composed of mostly cellular proteins. It is assumed that due
to physicochemical constraints, same sequence in both cellular and viral proteins should share similar
ensemble of structures as fragments. For this reason, physics-based fragment assembly algorithms
such as Rosetta should be able to produce decoys with same fold as native protein, probably with low
frequency. 	
  
The strategy for computational structure prediction of those viral proteins was developed using
simple principle in order to address difficulty associated with high diversity and uniqueness of viral
sequence space. Despite genetic distance, same protein must retain its function, thus same structure
(similar enough to have intact functionality). Although human papillomavirus E7 protein C-terminal
domain only shares as low as 15% sequence identity among strains (data not shown), all variants are
functional protein from isolates. Thus, in principle, they should share same fold among genetically
distant sequences, even each member of assembled sequences would generate considerably
different set of decoy population. By incorporating farthest sequence pairs among the sequences
from same protein in order to minimize overlapped structure space belonging to each sequence, this
strategy is designed to capture the common fold being populated by maximum number of sequences.
It may not be abundantly populated members nor low energy structure, but common among all
sequence space and it is the “consensus model”. The multiple sequence alignment information has
been used to obtain the hydrophobic core forming residues or to improve secondary/tertiary structure
predictions and reported successful improvement of predicted structures [16,17] but the use of
different sequences within same protein directly has not been reported, appears to be mainly due to
negligible sequence diversity for cellular proteins. Utilizing Rosetta for decoy generation, 5 small viral
proteins with solved structures were used as the benchmark proteins for evaluation of this strategy.
Increase of computational power in workstation allows such “redundant” approach with manageable
time scale with affordable resources. Total of 800,000 decoys (10,000x16x5) were generated and
clustered the decoys with their pairwise TM-score [18] using just two 8-core CPU equipped
workstations. For this approach, we have chosen to use computationally expensive TM-score instead
of RMSD (this step is the most time consuming step for all process) in order to capture the overall
“fold” [19,20] rather than atomic details. In this report, first, we show the results of Rosetta generated
decoys and TM-score/energy score relations of these benchmark viral proteins then show the
performance of “consensus” approach.
Methods
Selection of benchmark proteins
The 5 small viral proteins with known structure
and variable degree of sequence diversity have been
chosen for this study to test “consensus” approach.
Availability of solved structure, sequence length below
100 residues and extremely to mildly diverse
sequences are the criteria for selection of target
proteins. We have chosen HIV p24 C-terminal
dimerization domain, vpr and vpu cytoplasmic domain,
HPV E2 DNA binding and E7 C-terminal domains that
matches above defined criteria. The 16 sequences
from each protein are chosen to be most distant in pairwise and in average. Pairwise sequence
identity among the target sequences ranges from 15 to 92% and averages pairwise identity ranges
are from 33 to 79% among same protein sequences. Table 1 illustrates sequence statistics of target
proteins. Sequences were obtained from Los Alamos HIV database (http://www.hiv.lanl.gov/)
(HIV p24/vpr/vpu) or manually collected from GenBank [21] (HPV E2/E7) and subjected to multiple
alignment by Clustal W [22] and T-Coffee [23]. Sequences are then trimmed down to the domains
with structure assessments and N- or C- terminal flexible regions in solution structure were removed
from target sequences.
Decoy generation, calculation of pairwise TM-scores and clustering
The sets of 10,000 decoys are generated for each sequence using Rosetta 2.2 without full-
atom relaxation step. Each decoy dataset was then subjected to the calculation of pairwise TM-score
of all decoy combinations. TM-scores were calculated by in-house program written in C (with TM-
align source code ported to C). Then decoys were clustered by decoy-decoy pairwise TM-scores.
Clustering was done with in-house program based on algorithm from Cluster 3.0 software source
code [24,25] by complete linkage clustering with threshold of TM-score 0.5. The clusters are further
examined for distances among all cluster members of two clusters. If all distance among members
are above threshold, they were merged. TM-scores against solved structures (structures used as
native are following; p24:2JYL, vpr:1M8L, vpu:1VPU, E2:1A7G, E7:2B9D) were calculated for all
decoys (1VPU has 9 models in PDB file so that all 9 models were used for the calculation) and
Rosetta energy scores were extracted From Rosetta silent files. All calculations were executed on 2
of 2.66GHz dual quad-core Xeon equipped Mac Pro with Mac OSX 10.6 and all programs were
compiled with Intel compiler v11.1 with full optimization flags.
Calculation of consensus cluster
Table 1
Sequence ID (%)Length
mean Max Min
P24 70 79.3 92 62
vpr 56 72.4 89 50
vpu 81 57.7 82 41
E2 84 40.1 90 22
E7 44-54 33.3 64 15
	
  
Top 10 largest clusters for each sequence are taken and cluster center decoys were
assembled as top 10 decoy set for each protein. Top 10 decoy set was then subjected to the
calculation of pairwise TM-scores and clustered again with TM-score threshold of 0.3. The largest
cluster of top 10 decoy set was assigned as consensus cluster and its cluster center is assumed as
consensus model. Even not the largest cluster, the cluster with highest sequence coverage was
considered to be a consensus cluster in order to cover larger sequence space rather than decoy
population.
Calculation of e-values between viral proteins and their fragment libraries
Rosetta 9mer fragment libraries for 5 viral and 3 cellular proteins were parsed and assembled
into plain fasta-formatted files by script. All viral sequences were also converted to overlapping 9mers
fragments in fasta-formatted file. The viral 9mer fragments and Rosetta fragment library 9mers were
then compared by ssearch version 3.5 and e-values were assembled for all 16 sequences.
Depending sequence length, total number of fragments varies so that frequencies were calculated as
relative values for total fragments.
Results
Decoys of 5 viral proteins
The decoys for viral proteins
generated by Rosetta 2.2 did not
show good correlation between
Rosetta energy and TM-score.
Figure 1 indicates Rosetta
energy score vs. TM-score plots
of decoys of 5 viral proteins. The
right bottom panel is same plot
for cellular proteins (acyl carrier
protein/ACP, ubiquitin and
thioredoxin). Not all but in many
cases for these cellular proteins,
Rosetta can generate the decoy
set with modest to very good
correlation between energy
score and RMSD/TM-score (low
energy decoys are also
structurally similar to native
protein). Thioredoxin has very good correlation between low energy and high TM-score (as high as
0.9, RMSD ~1 angstrom) and ubiquitin also performed well. Rosetta did not produced high TM-score
decoys for ACP but did produce the decoys with good correlation between energy and TM-score. On
the other hand, those 5 viral proteins selected for this study do not show such a trend. They show
remarkably different Rosetta energy/TM	
  
-score distributions ranging from completely uncorrelated (HPV E2) to negatively correlated (HIV vpu;
shown only plot for model 1) or sequence dependent (HPV E7). Particularly, E7 shows great variation
in decoy populations by sequences as expected with such a remote sequence identities even within
same protein. On the contrary, HPV E2 shows very uniform population among the 16 sequences
although the sequence identities are as low as 22% (mean 40%). All top10 clusters belong to single
cluster with average pairwise distance 0.500 and 0.552 for average distance from cluster center. With
this poor correlation between energy score and TM-score, choosing low energy decoy for prediction
would not be good strategy for these 5 viral proteins. The overall distributions of TM-scores are
skewed toward low score compared to non-viral proteins. Majority of decoys are TM-score below 0.5,
	
  
Figure 1: Energy vs. TM-score plot of decoys for 5 viral proteins. Five
panels show 5 benchmark viral proteins and right bottom panel represents
plot of cellular proteins, thioredoxin, ubiquitin and ACP (acyl carrier protein).
For viral proteins, 16 most distant sequences are used for decoy generation
and plotted in different color. Viral proteins show poor to reversed
correlation between energy and TM-score such as completely uncorrelated
HPV E2 protein, reversed correlation (low energy is structurally distant) of
HIV vpu or sequence dependent (relatively small overlap among different
sequences) of HPV E7. On the other hand, non-viral proteins show very
good (thioredoxin) to descent (ACP) correlation between energy and TM-
scores.
	
  
which is rough threshold for the same fold [20]. Thus it is difficult to select the decoy with TM-score
beyond 0.5 (against native
structure) for final model.
Clustering of decoys
The clustering of decoys for
these proteins are performed with
hierarchical complete linkage
clustering algorithm since each
cluster should only contain
structurally close enough (same
fold) decoys in order for capturing
consensus cluster later.
Therefore, TM-score 0.5 was set
as threshold for complete linkage
clustering. The number of clusters
varies to a great extent among the
proteins and sequences. HPV E2
has significantly more clusters
(table2) with threshold value 0.5
(mean 276.5) in comparison with
HIV vpr (mean 12.5). Number of
clusters of HPV E7 fluctuates greatly (31-432) by sequence in same way as energy/TM-score plots.
Similar trend is also observed by coverage of Top 10 clusters (Sum of top 10 cluster members
divided by all decoys). Threshold value 0.3 for the clustering Top 10 decoys were determined as the
value giving enough cluster members for choosing “consensus” cluster but not to include too diverse
structural ensemble. The threshold value of 0.5 (used for initial clustering) produced too many
clusters with small number of cluster members among Top 10 decoy set and cannot reach
“consensus” as each cluster has too incomplete coverage of sequences. In order to determine
consensus cluster, TM-score threshold was lowered to yield enough cluster members populated by
most of sequences, but pairwise
distance among the Top 10 decoy set
and average distance from each cluster
center is not far below from 0.5. The
threshold value of TM-score 0.3 appears
to be low but clustering is performed with
complete linkage clustering algorithm so
that even the most distant decoys have
TM-score around or above 0.3. For all 5
viral proteins, the distances are within
the range of same fold as average
pairwise TM-score is 0.49 (0.46-0.51)
and 0.55 (0.51-0.57) for average
distance from cluster center for all Top
10 decoy sets. Thus, it is concluded that
these clusters are representative of
decoys in same fold and clusters with
highest sequence coverage were taken
as consensus cluster and center as
consensus model.
Consensus model strategy
Table 2
p24 vpr vpu E2 E7
Seq# Clust cover. Clust cover. Clust cover. Clust cover. Clust cover.
1 65 0.816 9 0.999 97 0.701 224 0.480 432 0.165
2 62 0.800 8 0.999 60 0.803 411 0.372 54 0.893
3 41 0.802 12 0.997 74 0.810 292 0.322 226 0.610
4 50 0.657 67 0.809 93 0.770 320 0.380 57 0.866
5 30 0.924 8 0.999 164 0.495 145 0.700 46 0.921
6 44 0.866 10 0.999 150 0.533 303 0.323 113 0.767
7 32 0.868 8 0.999 103 0.671 345 0.295 56 0.916
8 51 0.584 12 0.994 135 0.547 237 0.614 43 0.923
9 38 0.843 8 0.999 112 0.663 293 0.360 31 0.927
10 43 0.831 6 0.999 123 0.654 267 0.398 318 0.391
11 147 0.307 19 0.976 110 0.702 235 0.503 30 0.956
12 48 0.836 7 0.999 18 0.987 368 0.469 291 0.339
13 47 0.719 8 0.999 77 0.804 255 0.605 64 0.896
14 29 0.945 6 0.999 161 0.456 257 0.514 37 0.969
15 23 0.965 9 0.999 94 0.693 227 0.720 399 0.349
16 29 0.902 3 0.998 63 0.917 245 0.455 438 0.242
Ave 48.7 0.792 12.5 0.985 102.1 0.700 276.5 0.469 164.7 0.696
S.D. 28.8 0.164 14.9 0.047 39.4 0.147 64.3 0.133 157.9 0.294
	
  
	
  
Figure 2: Aligned predicted models (brown) and solved structures
(white). The consensus models of 5 viral proteins are aligned to
their solved structures by TMalign. In the figure, “Aligned” indicates
the residues within 5 angstrom distances from solved structures.
The solved HIV vpu structure is NMR solution structure with 9
models. Displayed in the figure is model 1 which was best aligned
with consensus model. Other 4 structures are crystal structures.	
  
	
  
The centers of consensus clusters for 5 viral proteins
are classified as consensus models and examined in
comparison with solved crystal/NMR structures. HIV p24
and vpr were most successfully predicted among 5
proteins (figure 2). Although these predictions are not
spectacularly accurate (TM-scores 0.56 and 0.55 for p24
and vpr, respectively) but are good enough to represent
essences of these folds. Since consensus model
approach is not designed to predict an atomic accuracy
but to capture correct overall fold, they are quite
successful. Specially, vpr has 51 residues aligned out of
56 residues within 5 angstrom from native structure. P24
has mostly well aligned prediction but very C-terminal
end is placed in wrong direction, thus lowering TM-
score. HIV vpu are less successful in terms of TM-score
(0.48) but overall fold/topology is well captured. The
prediction for HPV E2 was not great, but also not too far
either. On HPV E7, consensus approach did not work
well. As figure 2 indicates, the α-helix connected to two
consecutive β-sheets is positioned in wrong side. Although there is case such as HPV E7, if we
compare these results with the performance of conventional strategies to select the best model,
overall success of consensus strategy becomes clear. Except HPV E7 that failed to capture native
fold, consensus model approach performed modestly to significantly better on other four viral proteins
(Figure 3). HPV E2 has minimal improvement compared with largest cluster center or lowest energy
decoy since all decoys generated for this protein did not cover large structure space. The sequence
space of HPV E2 only covers very small, relatively similar and degenerated structure space (although
many in cluster numbers) despite large sequence space, thus three methods examined here did not
show any major differences. HPV E7 has too diverse sequence space (as low as 15% identity with
average of 33%) and it appears structure spaces of some sequences are not overlapping. In order for
consensus cluster strategy to work, these most distant sequences need to share some degree of
structure space. HPV E7 appears too diverged in both sequence (even length of domain varies from
44 to 54 residues) and structure space. In fact, the matrix representation of pairwise TM-scores of
HPV E7 indicates there are some pairs of sequences they do not share the predicted structure space
(Figure S1) and the sequences are behaving as if they are different proteins. On the other hand,
consensus model approach performs well with reasonably variable sequence space. It should be
noted that even consensus strategy on HPV E7 did
not underperform in comparison to other strategies
(Figure 3), just there was no improvement. Figure 4
represents distributions of decoy TM-scores
against native structures as histogram. Well
behaving proteins such as p24, vpr or vpu have
two major peaks in distributions with different
degrees of populations. The consensus approach
is capturing these populations with “native like”
structures (in higher TM-score, “close” peaks)
rather than structures in other “distant” peaks. This
observation is reasonable and hoped for as
“distant” peaks are likely to be composed of
multiple populations with greatly varied folds due to
diverse sequence space but “native like” structures
have more stringent constraints to be close to
native structure, and likely to be clustered together
	
  
	
  
Figure 3: The performance of different
strategies for the best model selection from
decoy set of 5 viral proteins. The data for HIV
vpu are average of an ensemble of 9 NMR
structures. The error bars indicate the range of
9 data points.	
  
	
  
Figure 4: The histogram of TM-score distribution of all
decoy sets for 5 viral proteins. Each protein has 16 decoy
sets and distributions are calculated, plotted independently.
The solid lines indicate the TM-scores of consensus models
and broken lines are TM-scores 0.5.
within “close” peak. Figure 4 also indicates that even with HPV E2, consensus model strategy
captures one of the clusters closest to native within generated decoys although all decoys are not
close enough to native structure. In the case of HPV E7, it is even more clearer in this figure that
some sequences are producing only decoys with too distant structures (some sequences have
extremely small population beyond TM-score 0.4) so that there is no way for capturing cluster close to
native structure. Thus, HPV E7 represents the limitation of this strategy that is finding the common
structure among different sequences.
Discussion
Predictive performance
The consensus model approach
works well with viral proteins although not
all protein can be predicted with TM-score
above 0.5. There are two major limitations,
one is technical and another is in principle.
As stated in the last paragraph of Result
section, HPV E7 reveals the fundamental
limitation of this approach. As sequence
space diverges too far, current protocols
cannot converge into same structure
space. It is not the limitation per se, but if
predicted structure space has too small or
no overlap among the sequences in the
sequence space, the strategy fails. It is
also related to another technical limitation.
Rosetta is one of the best de novo
structure prediction tools. Still it cannot
produce native like decoys for HIV vpu,
HPV E2 proteins. It is unclear whether
sampling space is too small, fragment
libraries are biased toward non-viral
proteins or other reasons. Limited
sampling space should not be severe
issue with this approach. The consensus
cluster approach is heavily relying on the
population ensemble rather than rare
discovery of decoy with low energy and
structurally close to native. If 10,000
decoys cannot capture structurally close
population, it is unlikely to capture such a
population even with orders of magnitude
larger decoy set. The similarity calculation
of overlapping 9mers generated from 16
sequences of 5 viral proteins against
Rosetta fragment library for each protein
indicates the frequency of low e-value
fragments for viral proteins are significantly
lower than that of cellular proteins and
available fragments are distributed evenly
from low to high e-values (Figure 5). It is
likely that low number of solved structures
Figure 5: E-value distribution between Fragment library and target
sequences. The 9mer sequences from Rosetta fragment library and
target cellular and viral sequences are compared using ssearch v3.5.
Viral sequences show significantly lower low e-value sequences and
mostly uniform distribution in all range of e-value. Note that as
sequences are short 9mers, e-value over 1 does not necessarily mean no
correlations. With frequently observed residues, e-value over 5 can be
4~5 residue match. Viral proteins show smoother graph as they have
16x more sampling than cellular proteins.
of viral proteins compared with other protein classes results in small population of fragments with low
e-value. Although it is not necessarily true that availability of low e-value fragments guarantees
accurate predictions, opposite would be true. The distant fragment library sequences from actual
protein are likely to have lower structural correlation to actual structures. These libraries had to be
used for fragment assembly and likely to keep decoy structures away from solved structures. As viral
proteins sequence space is generally unique compared to other protein classes, the low number of
available structures for fragment library generation would be to be blamed for the poor performance in
decoy generation of viral proteins. Put together, available sampling space is limited due to the lack of
available solved structures corresponding to the viral sequence space. Another factor related to
uniqueness of sequence can affect the secondary structure prediction used for Rosetta decoy
generation. Secondary structure of HIV vpu was not accurately predicted by all three methods
supported by Rosetta (PSIPRED, JUFO and SAM) and resultant model has extended C-terminus
rather than actual α-helical structure. Considering uniqueness of viral sequence and structure spaces,
it is not surprising that all three secondary structure prediction algorithms do not work for some viral
proteins. If secondary structure has been accurately predicted, it is likely that the result for HIV vpu
was much more accurate and well over TM-score 0.5.
Further improving performance by combining “consensus” approach with sampling of low
Rosetta energy decoys was attempted but unsuccessful. There are 2 cases (E2 and E7) which
improvement was observed by filtering the decoys by energy score but other 3 cases (p24, vpr and
vpu), filtering actually worsened the results (Figure S2). Rosetta generated the population of decoys
for HPV proteins with lower energy that is less distant to native than overall ensemble. Unfortunately,
opposite is true for HIV proteins and filtering by energy score actually worsened results for these
proteins. It is puzzling and not known why Rosetta energy and TM-scores are uncorrelated for viral
proteins. Nonetheless, due to the fact that viral proteins do not have lowest energy states with their
native structure or Rosetta energy function does not represent native viral protein energy, an attempt
to further improve results by taking low energy decoys failed. Although full-atom relaxation step was
not performed, it is unlikely that the Rosetta energy scores dramatically change with full-atom
relaxation. The largest cluster center strategy that is taken frequently also did not work as good as
consensus strategy. The largest cluster does not necessarily represents the decoy population with
smallest distance form native structure (7,6,5,1,2 cases out of 16 for p24, vpr, vpu, E2 and E7,
respectively). It is, therefore, difficult to determine the correct fold by selecting center of largest cluster
too. And again, there is problem of choosing from many different sequence variants. On the other
hand, consensus strategy took advantage of sequence diversity (ranging from 33 to 79% in mean
sequence identities) and successfully captured the cluster with most structurally close to native fold
even the case of overall decoy population is not very close to native (figure 4). An initial assumption
that short viral sequence fragments should have enough overlap with cellular protein sequence space
appears not standing and it is probable that even 9mer fragments are unique for viral sequences
compared with cellular sequences, as indicated by significantly lower number of low e-value and
abundant high e-value fragments are observed in fragment library (Figure 5).
Usefulness
This approach is developed mainly targeting the structure prediction of small viral proteins for
antigen/immunogen design. This strategy captures same or close fold (around or above TM-score
0.45) for 4 out of 5 proteins. The purpose wise, this result is a success in the use of the predicted
model for the evaluation of properties of designed antigen/immunogen such as mapping of epitope
and immunodominant region. For these purposes, atomic level of accuracy is desirable but not
necessary. In fact, this approach is used for predicting the structure of small protein based
immunogen for HIV-1 vaccine and generating quite useful information to interpret immunological and
biochemical data obtained by antigenic and immunogenic studies. Many biochemical data can be
interpreted with low-resolution structure model (correct fold) instead of high-resolution atomic models
for the application such as function or protein family inference [26]. Thus, this approach can produce
the model that is useful for many area of biology although in principle it cannot produce model in
atomic accuracy.
The limitation of approach is clearly illustrated by HPV E7 protein. But this is rare case, such
as sequence identities among HPV E7 C-terminal domain are as low as 15%. At this level of
sequence diversity, consensus approach breaks down. It is also possible that difference in the
sequence length (44 to 54 residues) may have impacted the results. The protein with such sequence
diversity is not so common, and E7 protein has tandem CXXC motif as metal coordination site that is
not included as constraint but limits possible architecture/topologies significantly. Also, its C-terminal
domain may serve only as scaffold (for dimer formation) for the unstructured N-terminal domain with
LXCXE motif for pRb binding that is necessary and sufficient for functionality. Thus it is possible that
the domain’s structure may not be strictly conserved as it may not be essential for its function and
Rosetta’s inability to generate decoys in same fold among different sequences may reflect actual
structural variability.
This strategy can utilize phylogenetic information (as sequence diversity) among related
species instead of genetic variability among same specie. The proteins with known orthologs in
diverse species (with choice of sequence identity threshold depending on sequence diversity) can be
used in identical manner to capture the fold of their protein family. How further it can be applied to
such an approach needs to be tested as the case of HPV E7 revealed the limitation of this strategy
for too diverse sequence space (and potentially dissimilar/diverse structures between orthologs).
Processing time
The generation of decoys was not most time consuming step in this study. It took less than day
for generation of 10,000 decoys with single thread of Rosetta run since computationally expensive
full-atom relaxation step was omitted. As workstation equipped with 8-cores with hyperthreding
capability, decoy set for each protein can be generated in a day (16 threads of Rosetta run
simultaneously). The most time consuming step was structural alignment using TMalign. For the
speed up, TMalign’s fortran source code has been ported to C and optimized for the speed. In-house
code has been written using ported TMalign in the loop, specifically for the pairwise calculation in
interpreter such as shell script is very slow and loop execution takes as much as score calculations.
Use of Intel compiler with the best optimization switches improved the speed but TMalign is
computationally expensive algorithm and hence it is slow. It is chosen despite its speed as it gives
better measure for judging same fold. Since pairwise score calculation is the order of n2
for decoy
number, 10,000 decoys require 50 million calculation of TM-scores. For single sequence, it took
about a week with single thread of calculation. Thus, each protein can be finished in about a week
with 16 sequences. Clustering was quite quick (~30min) compared to previous two steps although
high memory requirement (~5GB for single calculation) limits the number of simultaneous calculations
on same machine. Overall, single protein took about week of calculation time on single Mac Pro with
2.66 GHz dual quad-core Xeon X5500. With modern middle-sized cluster, this calculation should not
take more than a day or two. This approach is “redundant” in terms of generation of decoys per
proteins but current computational power of workstations already allows just the days of calculation
for this approach on single workstation and it can be routinely utilized without investing for the
expensive resources.
Figure S1: Similarity maps of sequences
used for the study
The sequences shown are maximum
distance pairs by multiple alignments and
used for consensus selections. In the maps,
Red > Green > Blue indicate the similarity of
sequence pairs and their cluster centers of
decoys. Some sequences shares very low
similarity with the other sequences although
they are same protein from same specie
(but different strains). In these cases,
selection of consensus is not easy as
clusters do not share same folds within their
sequence spaces (cluster centers do not
overlap among populous clusters)
Figure S2 : The effect of low energy score filtering
The idea to use low energy score filter does not work very well as shown in the figure. It turned out
that some cases, it works and in some cases it actually worsens the prediction results. In the cases of
HPV proteins (E2/E7), high-energy decoy structures actually occupy more ‘native-like’ structures,
thus, filtering these population worsens the results. This is the problem of decoy generation step and
consensus approach cannot solve this issue as it relies on the Rosetta to generate decoys. Mainly, it
appears to be the poor sampling of structures from viral proteins (HPV sequences are very unique
and unlikely to have good representation of sequence/structure mapping in the library. At the same
time, it is further concern that sequence/structure correlation is “biophysical features” or “evolutionary
features” as scarce sequence space does not necessarily means always low match to library. In this
study, HPV and HIV proteins are used as test cases. But for diverse sequence space with little
overlap of sequence spaces (even within same protein in same specie) makes situation little
complicated. The matching of sequence/structure can be potentially somewhat “evolutionary memory”
and not completely “biophysically” determined, more likely the fragments with independent sequence
spaces share the same structure space in some cases (HIV appears to share but not HPV).
References
1.	
  Ho	
  J,	
  Uger	
  RA,	
  Zwick	
  MB,	
  Luscher	
  MA,	
  Barber	
  BH,	
  et	
  al.	
  (2005)	
  Conformational	
  constraints	
  imposed	
  on	
  a	
  pan-­‐neutralizing	
  
HIV-­‐1	
  antibody	
  epitope	
  result	
  in	
  increased	
  antigenicity	
  but	
  not	
  neutralizing	
  response.	
  Vaccine	
  23:	
  1559-­‐1573.	
  
2.	
   Ofek	
   G,	
   Guenaga	
   FJ,	
   Schief	
   WR,	
   Skinner	
   J,	
   Baker	
   D,	
   et	
   al.	
   (2010)	
   Elicitation	
   of	
   structure-­‐specific	
   antibodies	
   by	
   epitope	
  
scaffolds.	
  Proc	
  Natl	
  Acad	
  Sci	
  U	
  S	
  A	
  107:	
  17880-­‐17887.	
  
3.	
  Penn-­‐Nicholson	
  A,	
  Han	
  DP,	
  Kim	
  SJ,	
  Park	
  H,	
  Ansari	
  R,	
  et	
  al.	
  (2008)	
  Assessment	
  of	
  antibody	
  responses	
  against	
  gp41	
  in	
  HIV-­‐1-­‐
infected	
  patients	
  using	
  soluble	
  gp41	
  fusion	
  proteins	
  and	
  peptides	
  derived	
  from	
  M	
  group	
  consensus	
  envelope.	
  Virology	
  
372:	
  442-­‐456.	
  
4.	
  Brussow	
  H	
  (2009)	
  The	
  not	
  so	
  universal	
  tree	
  of	
  life	
  or	
  the	
  place	
  of	
  viruses	
  in	
  the	
  living	
  world.	
  Philos	
  Trans	
  R	
  Soc	
  Lond	
  B	
  Biol	
  
Sci	
  364:	
  2263-­‐2274.	
  
5.	
  Edwards	
  RA,	
  Rohwer	
  F	
  (2005)	
  Viral	
  metagenomics.	
  Nat	
  Rev	
  Microbiol	
  3:	
  504-­‐510.	
  
6.	
  Angly	
  FE,	
  Felts	
  B,	
  Breitbart	
  M,	
  Salamon	
  P,	
  Edwards	
  RA,	
  et	
  al.	
  (2006)	
  The	
  marine	
  viromes	
  of	
  four	
  oceanic	
  regions.	
  PLoS	
  Biol	
  4:	
  
e368.	
  
7.	
  McBurney	
  SP,	
  Ross	
  TM	
  (2008)	
  Viral	
  sequence	
  diversity:	
  challenges	
  for	
  AIDS	
  vaccine	
  designs.	
  Expert	
  Rev	
  Vaccines	
  7:	
  1405-­‐
1417.	
  
8.	
  Palmenberg	
  AC,	
  Rathe	
  JA,	
  Liggett	
  SB	
  (2010)	
  Analysis	
  of	
  the	
  complete	
  genome	
  sequences	
  of	
  human	
  rhinovirus.	
  J	
  Allergy	
  Clin	
  
Immunol	
  125:	
  1190-­‐1199;	
  quiz	
  1200-­‐1191.	
  
9.	
  Chung	
  SY,	
  Subbiah	
  S	
  (1996)	
  A	
  structural	
  explanation	
  for	
  the	
  twilight	
  zone	
  of	
  protein	
  sequence	
  homology.	
  Structure	
  4:	
  1123-­‐
1127.	
  
10.	
  Bamford	
  DH	
  (2003)	
  Do	
  viruses	
  form	
  lineages	
  across	
  different	
  domains	
  of	
  life?	
  Res	
  Microbiol	
  154:	
  231-­‐236.	
  
11.	
  Bystroff	
  C,	
  Baker	
  D	
  (1998)	
  Prediction	
  of	
  local	
  structure	
  in	
  proteins	
  using	
  a	
  library	
  of	
  sequence-­‐structure	
  motifs.	
  J	
  Mol	
  Biol	
  
281:	
  565-­‐577.	
  
12.	
  Sali	
  A,	
  Blundell	
  TL	
  (1993)	
  Comparative	
  protein	
  modelling	
  by	
  satisfaction	
  of	
  spatial	
  restraints.	
  J	
  Mol	
  Biol	
  234:	
  779-­‐815.	
  
13.	
  Ozkan	
  SB,	
  Wu	
  GA,	
  Chodera	
  JD,	
  Dill	
  KA	
  (2007)	
  Protein	
  folding	
  by	
  zipping	
  and	
  assembly.	
  Proc	
  Natl	
  Acad	
  Sci	
  U	
  S	
  A	
  104:	
  
11987-­‐11992.	
  
14.	
  Das	
  R,	
  Baker	
  D	
  (2008)	
  Macromolecular	
  Modeling	
  with	
  Rosetta.	
  Annu	
  Rev	
  Biochem.	
  
15.	
  Roy	
  A,	
  Kucukural	
  A,	
  Zhang	
  Y	
  (2010)	
  I-­‐TASSER:	
  a	
  unified	
  platform	
  for	
  automated	
  protein	
  structure	
  and	
  function	
  prediction.	
  
Nat	
  Protoc	
  5:	
  725-­‐738.	
  
16.	
   Bonneau	
   R,	
   Strauss	
   CE,	
   Baker	
   D	
   (2001)	
   Improving	
   the	
   performance	
   of	
   Rosetta	
   using	
   multiple	
   sequence	
   alignment	
  
information	
  and	
  global	
  measures	
  of	
  hydrophobic	
  core	
  formation.	
  Proteins	
  43:	
  1-­‐11.	
  
17.	
   DeBartolo	
   J,	
   Hocky	
   G,	
   Wilde	
   M,	
   Xu	
   J,	
   Freed	
   KF,	
   et	
   al.	
   (2010)	
   Protein	
   structure	
   prediction	
   enhanced	
   with	
   evolutionary	
  
diversity:	
  SPEED.	
  Protein	
  Sci	
  19:	
  520-­‐534.	
  
18.	
  Zhang	
  Y,	
  Skolnick	
  J	
  (2004)	
  Scoring	
  function	
  for	
  automated	
  assessment	
  of	
  protein	
  structure	
  template	
  quality.	
  Proteins	
  57:	
  
702-­‐710.	
  
19.	
  Zhang	
  Y,	
  Skolnick	
  J	
  (2005)	
  TM-­‐align:	
  a	
  protein	
  structure	
  alignment	
  algorithm	
  based	
  on	
  the	
  TM-­‐score.	
  Nucleic	
  Acids	
  Res	
  33:	
  
2302-­‐2309.	
  
20.	
  Xu	
  J,	
  Zhang	
  Y	
  (2010)	
  How	
  significant	
  is	
  a	
  protein	
  structure	
  similarity	
  with	
  TM-­‐score	
  =	
  0.5?	
  Bioinformatics	
  26:	
  889-­‐895.	
  
21.	
  Benson	
  DA,	
  Karsch-­‐Mizrachi	
  I,	
  Lipman	
  DJ,	
  Ostell	
  J,	
  Wheeler	
  DL	
  (2008)	
  GenBank.	
  Nucleic	
  Acids	
  Res	
  36:	
  D25-­‐30.	
  
22.	
   Larkin	
   MA,	
   Blackshields	
   G,	
   Brown	
   NP,	
   Chenna	
   R,	
   McGettigan	
   PA,	
   et	
   al.	
   (2007)	
   Clustal	
   W	
   and	
   Clustal	
   X	
   version	
   2.0.	
  
Bioinformatics	
  23:	
  2947-­‐2948.	
  
23.	
  Notredame	
  C,	
  Higgins	
  DG,	
  Heringa	
  J	
  (2000)	
  T-­‐Coffee:	
  A	
  novel	
  method	
  for	
  fast	
  and	
  accurate	
  multiple	
  sequence	
  alignment.	
  J	
  
Mol	
  Biol	
  302:	
  205-­‐217.	
  
24.	
  de	
  Hoon	
  MJ,	
  Imoto	
  S,	
  Nolan	
  J,	
  Miyano	
  S	
  (2004)	
  Open	
  source	
  clustering	
  software.	
  Bioinformatics	
  20:	
  1453-­‐1454.	
  
25.	
  Eisen	
  MB,	
  Spellman	
  PT,	
  Brown	
  PO,	
  Botstein	
  D	
  (1998)	
  Cluster	
  analysis	
  and	
  display	
  of	
  genome-­‐wide	
  expression	
  patterns.	
  
Proc	
  Natl	
  Acad	
  Sci	
  U	
  S	
  A	
  95:	
  14863-­‐14868.	
  
26.	
  Zhang	
  Y	
  (2009)	
  Protein	
  structure	
  prediction:	
  when	
  is	
  it	
  useful?	
  Curr	
  Opin	
  Struct	
  Biol	
  19:	
  145-­‐155.	
  
	
  

More Related Content

What's hot

Aptamers based drug delivery
Aptamers based drug deliveryAptamers based drug delivery
Aptamers based drug delivery
Institute of Pharmacy, Nirma University
 
Trends in Biotech. Maggio, Goncalves
Trends in Biotech. Maggio, GoncalvesTrends in Biotech. Maggio, Goncalves
Trends in Biotech. Maggio, GoncalvesManuel Goncalves
 
Protein microarray
Protein microarrayProtein microarray
Protein microarray
Ghalia Nawal
 
ConSurf_an_algorithmic_tool_for_the_iden
ConSurf_an_algorithmic_tool_for_the_idenConSurf_an_algorithmic_tool_for_the_iden
ConSurf_an_algorithmic_tool_for_the_idenRony Armon
 
Summer internship at University of Tokyo
Summer internship at University of TokyoSummer internship at University of Tokyo
Summer internship at University of Tokyo
Vaibhav Kulshrestha
 
Selection and application of ssDNA aptamers to detect active TB from sputum s...
Selection and application of ssDNA aptamers to detect active TB from sputum s...Selection and application of ssDNA aptamers to detect active TB from sputum s...
Selection and application of ssDNA aptamers to detect active TB from sputum s...
Saw Yi
 
Cpf1- a new tool for CRISPR genome editing
Cpf1- a new tool for CRISPR genome editingCpf1- a new tool for CRISPR genome editing
Cpf1- a new tool for CRISPR genome editing
Sachin Bhor
 
zahid hussain ajk
zahid hussain ajkzahid hussain ajk
zahid hussain ajk
Zahid Hussain
 
APTAMERS
APTAMERS APTAMERS
APTAMERS
ROHIT
 
Aptamers
AptamersAptamers
Aptamers
Arif Nadaf
 
Aptamers as targeted therapeutics
Aptamers as targeted therapeuticsAptamers as targeted therapeutics
Aptamers as targeted therapeutics
Dibya Sundar
 
Bioinformatic jc 08_14_2013_formal
Bioinformatic jc 08_14_2013_formalBioinformatic jc 08_14_2013_formal
Bioinformatic jc 08_14_2013_formal
Jennifer Shelton
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
Prof. Wim Van Criekinge
 
Wp mi script_preamp_0613_lr
Wp mi script_preamp_0613_lrWp mi script_preamp_0613_lr
Wp mi script_preamp_0613_lrElsa von Licy
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
BITS
 
Mol Cell Proteomics-2013-Koytiger-1204-13
Mol Cell Proteomics-2013-Koytiger-1204-13Mol Cell Proteomics-2013-Koytiger-1204-13
Mol Cell Proteomics-2013-Koytiger-1204-13Greg Koytiger
 

What's hot (20)

Aptamers based drug delivery
Aptamers based drug deliveryAptamers based drug delivery
Aptamers based drug delivery
 
RapportHicham
RapportHichamRapportHicham
RapportHicham
 
Trends in Biotech. Maggio, Goncalves
Trends in Biotech. Maggio, GoncalvesTrends in Biotech. Maggio, Goncalves
Trends in Biotech. Maggio, Goncalves
 
Gram et al. 2007
Gram et al. 2007Gram et al. 2007
Gram et al. 2007
 
Session 5.3: Levy
Session 5.3: LevySession 5.3: Levy
Session 5.3: Levy
 
Protein microarray
Protein microarrayProtein microarray
Protein microarray
 
ConSurf_an_algorithmic_tool_for_the_iden
ConSurf_an_algorithmic_tool_for_the_idenConSurf_an_algorithmic_tool_for_the_iden
ConSurf_an_algorithmic_tool_for_the_iden
 
Summer internship at University of Tokyo
Summer internship at University of TokyoSummer internship at University of Tokyo
Summer internship at University of Tokyo
 
Selection and application of ssDNA aptamers to detect active TB from sputum s...
Selection and application of ssDNA aptamers to detect active TB from sputum s...Selection and application of ssDNA aptamers to detect active TB from sputum s...
Selection and application of ssDNA aptamers to detect active TB from sputum s...
 
Cpf1- a new tool for CRISPR genome editing
Cpf1- a new tool for CRISPR genome editingCpf1- a new tool for CRISPR genome editing
Cpf1- a new tool for CRISPR genome editing
 
zahid hussain ajk
zahid hussain ajkzahid hussain ajk
zahid hussain ajk
 
APTAMERS
APTAMERS APTAMERS
APTAMERS
 
Aptamers
AptamersAptamers
Aptamers
 
Aptamers as targeted therapeutics
Aptamers as targeted therapeuticsAptamers as targeted therapeutics
Aptamers as targeted therapeutics
 
Bioinformatic jc 08_14_2013_formal
Bioinformatic jc 08_14_2013_formalBioinformatic jc 08_14_2013_formal
Bioinformatic jc 08_14_2013_formal
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Wp mi script_preamp_0613_lr
Wp mi script_preamp_0613_lrWp mi script_preamp_0613_lr
Wp mi script_preamp_0613_lr
 
Aptamer presentation version 4
Aptamer presentation version 4Aptamer presentation version 4
Aptamer presentation version 4
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
Mol Cell Proteomics-2013-Koytiger-1204-13
Mol Cell Proteomics-2013-Koytiger-1204-13Mol Cell Proteomics-2013-Koytiger-1204-13
Mol Cell Proteomics-2013-Koytiger-1204-13
 

Viewers also liked

Sesion b práctica 2 Andrea
Sesion b práctica 2 AndreaSesion b práctica 2 Andrea
Sesion b práctica 2 Andrea
1Paolaku
 
Конкурс Ёлочек
Конкурс ЁлочекКонкурс Ёлочек
Конкурс Ёлочек
School1195
 
Sesión b practica 3 Andrea
Sesión b practica 3 AndreaSesión b practica 3 Andrea
Sesión b practica 3 Andrea
1Paolaku
 
Sesion a Andrea
Sesion a AndreaSesion a Andrea
Sesion a Andrea
1Paolaku
 
Osteosarcoma en can candy
Osteosarcoma en can candyOsteosarcoma en can candy
Osteosarcoma en can candy
Carlos Morales Mendoza
 
Produccion local
Produccion localProduccion local
Produccion local
Billy Escache
 
The godfather
The godfatherThe godfather
The godfather
Xema Pathak
 
zohair_ALGhamdi
zohair_ALGhamdizohair_ALGhamdi
zohair_ALGhamdiMasheed
 
COAR: SHARE UPDATE
COAR: SHARE UPDATECOAR: SHARE UPDATE
COAR: SHARE UPDATE
CASRAI
 
Redes sociales
Redes socialesRedes sociales
Redes sociales
Astrid Meneses Romero
 
دراسات اجتماعية الصف السادس الابتدائى
دراسات اجتماعية الصف السادس الابتدائىدراسات اجتماعية الصف السادس الابتدائى
دراسات اجتماعية الصف السادس الابتدائى
ibrahimaswan
 
Anabolismo y Fotosíntesis
Anabolismo y FotosíntesisAnabolismo y Fotosíntesis
Anabolismo y Fotosíntesis
INSTITUTO TECNOLÓGICO DE SONORA
 
Vaccum Casting
Vaccum CastingVaccum Casting
Vaccum Casting
Aruna c p
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
Dmitry Tolpeko
 
Dirección administrativa
Dirección administrativa Dirección administrativa
Dirección administrativa
ivanna mora
 

Viewers also liked (17)

Sesion b práctica 2 Andrea
Sesion b práctica 2 AndreaSesion b práctica 2 Andrea
Sesion b práctica 2 Andrea
 
Конкурс Ёлочек
Конкурс ЁлочекКонкурс Ёлочек
Конкурс Ёлочек
 
Sesión b practica 3 Andrea
Sesión b practica 3 AndreaSesión b practica 3 Andrea
Sesión b practica 3 Andrea
 
Sesion a Andrea
Sesion a AndreaSesion a Andrea
Sesion a Andrea
 
Ali
Ali Ali
Ali
 
Osteosarcoma en can candy
Osteosarcoma en can candyOsteosarcoma en can candy
Osteosarcoma en can candy
 
Produccion local
Produccion localProduccion local
Produccion local
 
The godfather
The godfatherThe godfather
The godfather
 
zohair_ALGhamdi
zohair_ALGhamdizohair_ALGhamdi
zohair_ALGhamdi
 
PPT Day After
PPT Day AfterPPT Day After
PPT Day After
 
COAR: SHARE UPDATE
COAR: SHARE UPDATECOAR: SHARE UPDATE
COAR: SHARE UPDATE
 
Redes sociales
Redes socialesRedes sociales
Redes sociales
 
دراسات اجتماعية الصف السادس الابتدائى
دراسات اجتماعية الصف السادس الابتدائىدراسات اجتماعية الصف السادس الابتدائى
دراسات اجتماعية الصف السادس الابتدائى
 
Anabolismo y Fotosíntesis
Anabolismo y FotosíntesisAnabolismo y Fotosíntesis
Anabolismo y Fotosíntesis
 
Vaccum Casting
Vaccum CastingVaccum Casting
Vaccum Casting
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
 
Dirección administrativa
Dirección administrativa Dirección administrativa
Dirección administrativa
 

Similar to Viral Protein Structure Predictions - Consensus Strategy

Vector Engineering.pptx
Vector Engineering.pptxVector Engineering.pptx
Vector Engineering.pptx
RenukaVyawahare
 
FabioAmaralProject 3
FabioAmaralProject 3FabioAmaralProject 3
FabioAmaralProject 3Fabio Amaral
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster Identification
Editor IJCATR
 
BIOPHARM_May Article_FINAL eprint
BIOPHARM_May Article_FINAL eprintBIOPHARM_May Article_FINAL eprint
BIOPHARM_May Article_FINAL eprintDoug Mogensen
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
researchinventy
 
Molecular epidmiology
Molecular epidmiologyMolecular epidmiology
Molecular epidmiology
milad shahini
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
Lars Juhl Jensen
 
De novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis meloDe novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis melo
bioejjournal
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
bioejjournal
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
bioejjournal
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676
Robin Gutell
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
VHIR Vall d’Hebron Institut de Recerca
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
Statistics and Bioinformatics (EiB-UB)
 
Bioinformatics.Assignment
Bioinformatics.AssignmentBioinformatics.Assignment
Bioinformatics.AssignmentNaima Tahsin
 
gentic engineering principles
gentic engineering  principlesgentic engineering  principles
gentic engineering principles
Vasanth Kamisetty
 
GENTICS paper by vasanth
GENTICS paper by vasanthGENTICS paper by vasanth
GENTICS paper by vasanth
KamisettyPurnavasant2
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
Ranjan Jyoti Sarma
 
GLUER integrative analysis of single-cell omics and imaging data by deep neur...
GLUER integrative analysis of single-cell omics and imaging data by deep neur...GLUER integrative analysis of single-cell omics and imaging data by deep neur...
GLUER integrative analysis of single-cell omics and imaging data by deep neur...
mallannasuman
 

Similar to Viral Protein Structure Predictions - Consensus Strategy (20)

Vector Engineering.pptx
Vector Engineering.pptxVector Engineering.pptx
Vector Engineering.pptx
 
FabioAmaralProject 3
FabioAmaralProject 3FabioAmaralProject 3
FabioAmaralProject 3
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster Identification
 
BIOPHARM_May Article_FINAL eprint
BIOPHARM_May Article_FINAL eprintBIOPHARM_May Article_FINAL eprint
BIOPHARM_May Article_FINAL eprint
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Molecular epidmiology
Molecular epidmiologyMolecular epidmiology
Molecular epidmiology
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
De novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis meloDe novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis melo
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
 
Bioinformatics.Assignment
Bioinformatics.AssignmentBioinformatics.Assignment
Bioinformatics.Assignment
 
gentic engineering principles
gentic engineering  principlesgentic engineering  principles
gentic engineering principles
 
GENTICS paper by vasanth
GENTICS paper by vasanthGENTICS paper by vasanth
GENTICS paper by vasanth
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
GLUER integrative analysis of single-cell omics and imaging data by deep neur...
GLUER integrative analysis of single-cell omics and imaging data by deep neur...GLUER integrative analysis of single-cell omics and imaging data by deep neur...
GLUER integrative analysis of single-cell omics and imaging data by deep neur...
 

More from Keiji Takamoto

Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...Keiji Takamoto
 
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Keiji Takamoto
 
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Keiji Takamoto
 
Principles of RNA compaction : insights from equilibrium folding pathway of p...
Principles of RNA compaction : insights from equilibrium folding pathway of p...Principles of RNA compaction : insights from equilibrium folding pathway of p...
Principles of RNA compaction : insights from equilibrium folding pathway of p...
Keiji Takamoto
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Keiji Takamoto
 
Payload Attachment Chemistry and Payload Design
Payload Attachment Chemistry and Payload DesignPayload Attachment Chemistry and Payload Design
Payload Attachment Chemistry and Payload DesignKeiji Takamoto
 
The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1Keiji Takamoto
 
Novel Strategy for Small Viral Protein Structure Predictions
Novel Strategy for Small Viral Protein Structure PredictionsNovel Strategy for Small Viral Protein Structure Predictions
Novel Strategy for Small Viral Protein Structure PredictionsKeiji Takamoto
 
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...Keiji Takamoto
 
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...Keiji Takamoto
 
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...Keiji Takamoto
 
Controlled Formation of Low-Volume Liquid Pillars between Plates with Lattic...
Controlled Formation of Low-Volume Liquid Pillars  between Plates with Lattic...Controlled Formation of Low-Volume Liquid Pillars  between Plates with Lattic...
Controlled Formation of Low-Volume Liquid Pillars between Plates with Lattic...Keiji Takamoto
 
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...Keiji Takamoto
 
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...Keiji Takamoto
 
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...Keiji Takamoto
 

More from Keiji Takamoto (15)

Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
 
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
 
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
 
Principles of RNA compaction : insights from equilibrium folding pathway of p...
Principles of RNA compaction : insights from equilibrium folding pathway of p...Principles of RNA compaction : insights from equilibrium folding pathway of p...
Principles of RNA compaction : insights from equilibrium folding pathway of p...
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
 
Payload Attachment Chemistry and Payload Design
Payload Attachment Chemistry and Payload DesignPayload Attachment Chemistry and Payload Design
Payload Attachment Chemistry and Payload Design
 
The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1
 
Novel Strategy for Small Viral Protein Structure Predictions
Novel Strategy for Small Viral Protein Structure PredictionsNovel Strategy for Small Viral Protein Structure Predictions
Novel Strategy for Small Viral Protein Structure Predictions
 
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
 
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
 
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
 
Controlled Formation of Low-Volume Liquid Pillars between Plates with Lattic...
Controlled Formation of Low-Volume Liquid Pillars  between Plates with Lattic...Controlled Formation of Low-Volume Liquid Pillars  between Plates with Lattic...
Controlled Formation of Low-Volume Liquid Pillars between Plates with Lattic...
 
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
 
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
 
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
 

Viral Protein Structure Predictions - Consensus Strategy

  • 1. New strategy for the viral protein structure predictions: “Consensus model” approach to take advantage of sequence diversity Introduction The viral surface proteins are good targets for the vaccine development. They are major targets for neutralizing antibody against viruses but elicitation of broadly reactive neutralizing antibody (brNAb) is proved to be difficult. The structural information for these viral surface proteins is extremely important for rational design for immunogen targeting these viruses in order to understand conserved surface residues. Although drug targets such as viral proteases, reverse transcriptases are well studied by crystallographic or NMR structural determinations in order to design effective drug against these viruses, surface proteins are not as much studied as drug target proteins. Some well-studied surface proteins such as HIV gp120 and influenza hemagglutinin show significant sequence diversity and shown to be challenging target for elicitation of brNAb. The peptide based immunogen and epitope scaffold-based approaches utilizing known brNAb epitope are not successful to elicit brNAb against HIV-1 and influenza so far [1,2], and the use of minimal-sized rigid protein of conserved sequence region in the native structural context would be necessary [3]. In order to rationally design such immunogen, structural information for the region of protein that can fold independently is crucial. With insufficient solved structures, we need to rely on the computational structure prediction on these proteins. The structure prediction of viral proteins is not easy task. This class of proteins does not share high sequence similarity with known cellular proteins [4,5,6] and also solved structures of viral proteins are sparse compared to other protein families. Thus, it is poor target for the comparative modeling strategy. In addition to that, the sequence diversity within the same protein is extremely high [7,8] with their high mutation rates for virus’ immune evasion and makes it a moving target for the structure prediction (the choice of sequence may affect outcome). Some proteins show below “twilight zone” sequence identities [9] among same protein from different strains but same specie. In these cases, first of all, even it is not clear what is the principle of choosing target sequence for prediction, as even consensus sequence is just another variation. Thus, it is also true that these proteins are difficult target for de novo prediction. In addition to extensive diversity within same protein, the sequence space of viral proteins is vastly large yet not overlapping enough with “cellular” (prokaryotic/archaeal/eukaryotic) sequence spaces [5,6,10]. Thus, predicting the protein structure with knowledge-based algorithms [11,12] appears to be difficult including both comparative and fragment assembly based free modeling strategies. As completely physics-based modeling is still not feasible or practical at this time [13], in order to predict the structure of viral proteins, we need to use currently available algorithms [14,15]. Although overall sequence spaces of cellular and viral proteins may not overlap very much and unique to each other except some exchanged genes between two, still fragment library has short “viral” sequences (such as 9mers for Rosetta fragment library) in some degree even assembled from database composed of mostly cellular proteins. It is assumed that due to physicochemical constraints, same sequence in both cellular and viral proteins should share similar ensemble of structures as fragments. For this reason, physics-based fragment assembly algorithms such as Rosetta should be able to produce decoys with same fold as native protein, probably with low frequency.   The strategy for computational structure prediction of those viral proteins was developed using simple principle in order to address difficulty associated with high diversity and uniqueness of viral sequence space. Despite genetic distance, same protein must retain its function, thus same structure (similar enough to have intact functionality). Although human papillomavirus E7 protein C-terminal domain only shares as low as 15% sequence identity among strains (data not shown), all variants are functional protein from isolates. Thus, in principle, they should share same fold among genetically distant sequences, even each member of assembled sequences would generate considerably different set of decoy population. By incorporating farthest sequence pairs among the sequences from same protein in order to minimize overlapped structure space belonging to each sequence, this strategy is designed to capture the common fold being populated by maximum number of sequences.
  • 2. It may not be abundantly populated members nor low energy structure, but common among all sequence space and it is the “consensus model”. The multiple sequence alignment information has been used to obtain the hydrophobic core forming residues or to improve secondary/tertiary structure predictions and reported successful improvement of predicted structures [16,17] but the use of different sequences within same protein directly has not been reported, appears to be mainly due to negligible sequence diversity for cellular proteins. Utilizing Rosetta for decoy generation, 5 small viral proteins with solved structures were used as the benchmark proteins for evaluation of this strategy. Increase of computational power in workstation allows such “redundant” approach with manageable time scale with affordable resources. Total of 800,000 decoys (10,000x16x5) were generated and clustered the decoys with their pairwise TM-score [18] using just two 8-core CPU equipped workstations. For this approach, we have chosen to use computationally expensive TM-score instead of RMSD (this step is the most time consuming step for all process) in order to capture the overall “fold” [19,20] rather than atomic details. In this report, first, we show the results of Rosetta generated decoys and TM-score/energy score relations of these benchmark viral proteins then show the performance of “consensus” approach. Methods Selection of benchmark proteins The 5 small viral proteins with known structure and variable degree of sequence diversity have been chosen for this study to test “consensus” approach. Availability of solved structure, sequence length below 100 residues and extremely to mildly diverse sequences are the criteria for selection of target proteins. We have chosen HIV p24 C-terminal dimerization domain, vpr and vpu cytoplasmic domain, HPV E2 DNA binding and E7 C-terminal domains that matches above defined criteria. The 16 sequences from each protein are chosen to be most distant in pairwise and in average. Pairwise sequence identity among the target sequences ranges from 15 to 92% and averages pairwise identity ranges are from 33 to 79% among same protein sequences. Table 1 illustrates sequence statistics of target proteins. Sequences were obtained from Los Alamos HIV database (http://www.hiv.lanl.gov/) (HIV p24/vpr/vpu) or manually collected from GenBank [21] (HPV E2/E7) and subjected to multiple alignment by Clustal W [22] and T-Coffee [23]. Sequences are then trimmed down to the domains with structure assessments and N- or C- terminal flexible regions in solution structure were removed from target sequences. Decoy generation, calculation of pairwise TM-scores and clustering The sets of 10,000 decoys are generated for each sequence using Rosetta 2.2 without full- atom relaxation step. Each decoy dataset was then subjected to the calculation of pairwise TM-score of all decoy combinations. TM-scores were calculated by in-house program written in C (with TM- align source code ported to C). Then decoys were clustered by decoy-decoy pairwise TM-scores. Clustering was done with in-house program based on algorithm from Cluster 3.0 software source code [24,25] by complete linkage clustering with threshold of TM-score 0.5. The clusters are further examined for distances among all cluster members of two clusters. If all distance among members are above threshold, they were merged. TM-scores against solved structures (structures used as native are following; p24:2JYL, vpr:1M8L, vpu:1VPU, E2:1A7G, E7:2B9D) were calculated for all decoys (1VPU has 9 models in PDB file so that all 9 models were used for the calculation) and Rosetta energy scores were extracted From Rosetta silent files. All calculations were executed on 2 of 2.66GHz dual quad-core Xeon equipped Mac Pro with Mac OSX 10.6 and all programs were compiled with Intel compiler v11.1 with full optimization flags. Calculation of consensus cluster Table 1 Sequence ID (%)Length mean Max Min P24 70 79.3 92 62 vpr 56 72.4 89 50 vpu 81 57.7 82 41 E2 84 40.1 90 22 E7 44-54 33.3 64 15  
  • 3. Top 10 largest clusters for each sequence are taken and cluster center decoys were assembled as top 10 decoy set for each protein. Top 10 decoy set was then subjected to the calculation of pairwise TM-scores and clustered again with TM-score threshold of 0.3. The largest cluster of top 10 decoy set was assigned as consensus cluster and its cluster center is assumed as consensus model. Even not the largest cluster, the cluster with highest sequence coverage was considered to be a consensus cluster in order to cover larger sequence space rather than decoy population. Calculation of e-values between viral proteins and their fragment libraries Rosetta 9mer fragment libraries for 5 viral and 3 cellular proteins were parsed and assembled into plain fasta-formatted files by script. All viral sequences were also converted to overlapping 9mers fragments in fasta-formatted file. The viral 9mer fragments and Rosetta fragment library 9mers were then compared by ssearch version 3.5 and e-values were assembled for all 16 sequences. Depending sequence length, total number of fragments varies so that frequencies were calculated as relative values for total fragments. Results Decoys of 5 viral proteins The decoys for viral proteins generated by Rosetta 2.2 did not show good correlation between Rosetta energy and TM-score. Figure 1 indicates Rosetta energy score vs. TM-score plots of decoys of 5 viral proteins. The right bottom panel is same plot for cellular proteins (acyl carrier protein/ACP, ubiquitin and thioredoxin). Not all but in many cases for these cellular proteins, Rosetta can generate the decoy set with modest to very good correlation between energy score and RMSD/TM-score (low energy decoys are also structurally similar to native protein). Thioredoxin has very good correlation between low energy and high TM-score (as high as 0.9, RMSD ~1 angstrom) and ubiquitin also performed well. Rosetta did not produced high TM-score decoys for ACP but did produce the decoys with good correlation between energy and TM-score. On the other hand, those 5 viral proteins selected for this study do not show such a trend. They show remarkably different Rosetta energy/TM   -score distributions ranging from completely uncorrelated (HPV E2) to negatively correlated (HIV vpu; shown only plot for model 1) or sequence dependent (HPV E7). Particularly, E7 shows great variation in decoy populations by sequences as expected with such a remote sequence identities even within same protein. On the contrary, HPV E2 shows very uniform population among the 16 sequences although the sequence identities are as low as 22% (mean 40%). All top10 clusters belong to single cluster with average pairwise distance 0.500 and 0.552 for average distance from cluster center. With this poor correlation between energy score and TM-score, choosing low energy decoy for prediction would not be good strategy for these 5 viral proteins. The overall distributions of TM-scores are skewed toward low score compared to non-viral proteins. Majority of decoys are TM-score below 0.5,   Figure 1: Energy vs. TM-score plot of decoys for 5 viral proteins. Five panels show 5 benchmark viral proteins and right bottom panel represents plot of cellular proteins, thioredoxin, ubiquitin and ACP (acyl carrier protein). For viral proteins, 16 most distant sequences are used for decoy generation and plotted in different color. Viral proteins show poor to reversed correlation between energy and TM-score such as completely uncorrelated HPV E2 protein, reversed correlation (low energy is structurally distant) of HIV vpu or sequence dependent (relatively small overlap among different sequences) of HPV E7. On the other hand, non-viral proteins show very good (thioredoxin) to descent (ACP) correlation between energy and TM- scores.  
  • 4. which is rough threshold for the same fold [20]. Thus it is difficult to select the decoy with TM-score beyond 0.5 (against native structure) for final model. Clustering of decoys The clustering of decoys for these proteins are performed with hierarchical complete linkage clustering algorithm since each cluster should only contain structurally close enough (same fold) decoys in order for capturing consensus cluster later. Therefore, TM-score 0.5 was set as threshold for complete linkage clustering. The number of clusters varies to a great extent among the proteins and sequences. HPV E2 has significantly more clusters (table2) with threshold value 0.5 (mean 276.5) in comparison with HIV vpr (mean 12.5). Number of clusters of HPV E7 fluctuates greatly (31-432) by sequence in same way as energy/TM-score plots. Similar trend is also observed by coverage of Top 10 clusters (Sum of top 10 cluster members divided by all decoys). Threshold value 0.3 for the clustering Top 10 decoys were determined as the value giving enough cluster members for choosing “consensus” cluster but not to include too diverse structural ensemble. The threshold value of 0.5 (used for initial clustering) produced too many clusters with small number of cluster members among Top 10 decoy set and cannot reach “consensus” as each cluster has too incomplete coverage of sequences. In order to determine consensus cluster, TM-score threshold was lowered to yield enough cluster members populated by most of sequences, but pairwise distance among the Top 10 decoy set and average distance from each cluster center is not far below from 0.5. The threshold value of TM-score 0.3 appears to be low but clustering is performed with complete linkage clustering algorithm so that even the most distant decoys have TM-score around or above 0.3. For all 5 viral proteins, the distances are within the range of same fold as average pairwise TM-score is 0.49 (0.46-0.51) and 0.55 (0.51-0.57) for average distance from cluster center for all Top 10 decoy sets. Thus, it is concluded that these clusters are representative of decoys in same fold and clusters with highest sequence coverage were taken as consensus cluster and center as consensus model. Consensus model strategy Table 2 p24 vpr vpu E2 E7 Seq# Clust cover. Clust cover. Clust cover. Clust cover. Clust cover. 1 65 0.816 9 0.999 97 0.701 224 0.480 432 0.165 2 62 0.800 8 0.999 60 0.803 411 0.372 54 0.893 3 41 0.802 12 0.997 74 0.810 292 0.322 226 0.610 4 50 0.657 67 0.809 93 0.770 320 0.380 57 0.866 5 30 0.924 8 0.999 164 0.495 145 0.700 46 0.921 6 44 0.866 10 0.999 150 0.533 303 0.323 113 0.767 7 32 0.868 8 0.999 103 0.671 345 0.295 56 0.916 8 51 0.584 12 0.994 135 0.547 237 0.614 43 0.923 9 38 0.843 8 0.999 112 0.663 293 0.360 31 0.927 10 43 0.831 6 0.999 123 0.654 267 0.398 318 0.391 11 147 0.307 19 0.976 110 0.702 235 0.503 30 0.956 12 48 0.836 7 0.999 18 0.987 368 0.469 291 0.339 13 47 0.719 8 0.999 77 0.804 255 0.605 64 0.896 14 29 0.945 6 0.999 161 0.456 257 0.514 37 0.969 15 23 0.965 9 0.999 94 0.693 227 0.720 399 0.349 16 29 0.902 3 0.998 63 0.917 245 0.455 438 0.242 Ave 48.7 0.792 12.5 0.985 102.1 0.700 276.5 0.469 164.7 0.696 S.D. 28.8 0.164 14.9 0.047 39.4 0.147 64.3 0.133 157.9 0.294     Figure 2: Aligned predicted models (brown) and solved structures (white). The consensus models of 5 viral proteins are aligned to their solved structures by TMalign. In the figure, “Aligned” indicates the residues within 5 angstrom distances from solved structures. The solved HIV vpu structure is NMR solution structure with 9 models. Displayed in the figure is model 1 which was best aligned with consensus model. Other 4 structures are crystal structures.    
  • 5. The centers of consensus clusters for 5 viral proteins are classified as consensus models and examined in comparison with solved crystal/NMR structures. HIV p24 and vpr were most successfully predicted among 5 proteins (figure 2). Although these predictions are not spectacularly accurate (TM-scores 0.56 and 0.55 for p24 and vpr, respectively) but are good enough to represent essences of these folds. Since consensus model approach is not designed to predict an atomic accuracy but to capture correct overall fold, they are quite successful. Specially, vpr has 51 residues aligned out of 56 residues within 5 angstrom from native structure. P24 has mostly well aligned prediction but very C-terminal end is placed in wrong direction, thus lowering TM- score. HIV vpu are less successful in terms of TM-score (0.48) but overall fold/topology is well captured. The prediction for HPV E2 was not great, but also not too far either. On HPV E7, consensus approach did not work well. As figure 2 indicates, the α-helix connected to two consecutive β-sheets is positioned in wrong side. Although there is case such as HPV E7, if we compare these results with the performance of conventional strategies to select the best model, overall success of consensus strategy becomes clear. Except HPV E7 that failed to capture native fold, consensus model approach performed modestly to significantly better on other four viral proteins (Figure 3). HPV E2 has minimal improvement compared with largest cluster center or lowest energy decoy since all decoys generated for this protein did not cover large structure space. The sequence space of HPV E2 only covers very small, relatively similar and degenerated structure space (although many in cluster numbers) despite large sequence space, thus three methods examined here did not show any major differences. HPV E7 has too diverse sequence space (as low as 15% identity with average of 33%) and it appears structure spaces of some sequences are not overlapping. In order for consensus cluster strategy to work, these most distant sequences need to share some degree of structure space. HPV E7 appears too diverged in both sequence (even length of domain varies from 44 to 54 residues) and structure space. In fact, the matrix representation of pairwise TM-scores of HPV E7 indicates there are some pairs of sequences they do not share the predicted structure space (Figure S1) and the sequences are behaving as if they are different proteins. On the other hand, consensus model approach performs well with reasonably variable sequence space. It should be noted that even consensus strategy on HPV E7 did not underperform in comparison to other strategies (Figure 3), just there was no improvement. Figure 4 represents distributions of decoy TM-scores against native structures as histogram. Well behaving proteins such as p24, vpr or vpu have two major peaks in distributions with different degrees of populations. The consensus approach is capturing these populations with “native like” structures (in higher TM-score, “close” peaks) rather than structures in other “distant” peaks. This observation is reasonable and hoped for as “distant” peaks are likely to be composed of multiple populations with greatly varied folds due to diverse sequence space but “native like” structures have more stringent constraints to be close to native structure, and likely to be clustered together     Figure 3: The performance of different strategies for the best model selection from decoy set of 5 viral proteins. The data for HIV vpu are average of an ensemble of 9 NMR structures. The error bars indicate the range of 9 data points.     Figure 4: The histogram of TM-score distribution of all decoy sets for 5 viral proteins. Each protein has 16 decoy sets and distributions are calculated, plotted independently. The solid lines indicate the TM-scores of consensus models and broken lines are TM-scores 0.5.
  • 6. within “close” peak. Figure 4 also indicates that even with HPV E2, consensus model strategy captures one of the clusters closest to native within generated decoys although all decoys are not close enough to native structure. In the case of HPV E7, it is even more clearer in this figure that some sequences are producing only decoys with too distant structures (some sequences have extremely small population beyond TM-score 0.4) so that there is no way for capturing cluster close to native structure. Thus, HPV E7 represents the limitation of this strategy that is finding the common structure among different sequences. Discussion Predictive performance The consensus model approach works well with viral proteins although not all protein can be predicted with TM-score above 0.5. There are two major limitations, one is technical and another is in principle. As stated in the last paragraph of Result section, HPV E7 reveals the fundamental limitation of this approach. As sequence space diverges too far, current protocols cannot converge into same structure space. It is not the limitation per se, but if predicted structure space has too small or no overlap among the sequences in the sequence space, the strategy fails. It is also related to another technical limitation. Rosetta is one of the best de novo structure prediction tools. Still it cannot produce native like decoys for HIV vpu, HPV E2 proteins. It is unclear whether sampling space is too small, fragment libraries are biased toward non-viral proteins or other reasons. Limited sampling space should not be severe issue with this approach. The consensus cluster approach is heavily relying on the population ensemble rather than rare discovery of decoy with low energy and structurally close to native. If 10,000 decoys cannot capture structurally close population, it is unlikely to capture such a population even with orders of magnitude larger decoy set. The similarity calculation of overlapping 9mers generated from 16 sequences of 5 viral proteins against Rosetta fragment library for each protein indicates the frequency of low e-value fragments for viral proteins are significantly lower than that of cellular proteins and available fragments are distributed evenly from low to high e-values (Figure 5). It is likely that low number of solved structures Figure 5: E-value distribution between Fragment library and target sequences. The 9mer sequences from Rosetta fragment library and target cellular and viral sequences are compared using ssearch v3.5. Viral sequences show significantly lower low e-value sequences and mostly uniform distribution in all range of e-value. Note that as sequences are short 9mers, e-value over 1 does not necessarily mean no correlations. With frequently observed residues, e-value over 5 can be 4~5 residue match. Viral proteins show smoother graph as they have 16x more sampling than cellular proteins.
  • 7. of viral proteins compared with other protein classes results in small population of fragments with low e-value. Although it is not necessarily true that availability of low e-value fragments guarantees accurate predictions, opposite would be true. The distant fragment library sequences from actual protein are likely to have lower structural correlation to actual structures. These libraries had to be used for fragment assembly and likely to keep decoy structures away from solved structures. As viral proteins sequence space is generally unique compared to other protein classes, the low number of available structures for fragment library generation would be to be blamed for the poor performance in decoy generation of viral proteins. Put together, available sampling space is limited due to the lack of available solved structures corresponding to the viral sequence space. Another factor related to uniqueness of sequence can affect the secondary structure prediction used for Rosetta decoy generation. Secondary structure of HIV vpu was not accurately predicted by all three methods supported by Rosetta (PSIPRED, JUFO and SAM) and resultant model has extended C-terminus rather than actual α-helical structure. Considering uniqueness of viral sequence and structure spaces, it is not surprising that all three secondary structure prediction algorithms do not work for some viral proteins. If secondary structure has been accurately predicted, it is likely that the result for HIV vpu was much more accurate and well over TM-score 0.5. Further improving performance by combining “consensus” approach with sampling of low Rosetta energy decoys was attempted but unsuccessful. There are 2 cases (E2 and E7) which improvement was observed by filtering the decoys by energy score but other 3 cases (p24, vpr and vpu), filtering actually worsened the results (Figure S2). Rosetta generated the population of decoys for HPV proteins with lower energy that is less distant to native than overall ensemble. Unfortunately, opposite is true for HIV proteins and filtering by energy score actually worsened results for these proteins. It is puzzling and not known why Rosetta energy and TM-scores are uncorrelated for viral proteins. Nonetheless, due to the fact that viral proteins do not have lowest energy states with their native structure or Rosetta energy function does not represent native viral protein energy, an attempt to further improve results by taking low energy decoys failed. Although full-atom relaxation step was not performed, it is unlikely that the Rosetta energy scores dramatically change with full-atom relaxation. The largest cluster center strategy that is taken frequently also did not work as good as consensus strategy. The largest cluster does not necessarily represents the decoy population with smallest distance form native structure (7,6,5,1,2 cases out of 16 for p24, vpr, vpu, E2 and E7, respectively). It is, therefore, difficult to determine the correct fold by selecting center of largest cluster too. And again, there is problem of choosing from many different sequence variants. On the other hand, consensus strategy took advantage of sequence diversity (ranging from 33 to 79% in mean sequence identities) and successfully captured the cluster with most structurally close to native fold even the case of overall decoy population is not very close to native (figure 4). An initial assumption that short viral sequence fragments should have enough overlap with cellular protein sequence space appears not standing and it is probable that even 9mer fragments are unique for viral sequences compared with cellular sequences, as indicated by significantly lower number of low e-value and abundant high e-value fragments are observed in fragment library (Figure 5). Usefulness This approach is developed mainly targeting the structure prediction of small viral proteins for antigen/immunogen design. This strategy captures same or close fold (around or above TM-score 0.45) for 4 out of 5 proteins. The purpose wise, this result is a success in the use of the predicted model for the evaluation of properties of designed antigen/immunogen such as mapping of epitope and immunodominant region. For these purposes, atomic level of accuracy is desirable but not necessary. In fact, this approach is used for predicting the structure of small protein based immunogen for HIV-1 vaccine and generating quite useful information to interpret immunological and biochemical data obtained by antigenic and immunogenic studies. Many biochemical data can be interpreted with low-resolution structure model (correct fold) instead of high-resolution atomic models for the application such as function or protein family inference [26]. Thus, this approach can produce
  • 8. the model that is useful for many area of biology although in principle it cannot produce model in atomic accuracy. The limitation of approach is clearly illustrated by HPV E7 protein. But this is rare case, such as sequence identities among HPV E7 C-terminal domain are as low as 15%. At this level of sequence diversity, consensus approach breaks down. It is also possible that difference in the sequence length (44 to 54 residues) may have impacted the results. The protein with such sequence diversity is not so common, and E7 protein has tandem CXXC motif as metal coordination site that is not included as constraint but limits possible architecture/topologies significantly. Also, its C-terminal domain may serve only as scaffold (for dimer formation) for the unstructured N-terminal domain with LXCXE motif for pRb binding that is necessary and sufficient for functionality. Thus it is possible that the domain’s structure may not be strictly conserved as it may not be essential for its function and Rosetta’s inability to generate decoys in same fold among different sequences may reflect actual structural variability. This strategy can utilize phylogenetic information (as sequence diversity) among related species instead of genetic variability among same specie. The proteins with known orthologs in diverse species (with choice of sequence identity threshold depending on sequence diversity) can be used in identical manner to capture the fold of their protein family. How further it can be applied to such an approach needs to be tested as the case of HPV E7 revealed the limitation of this strategy for too diverse sequence space (and potentially dissimilar/diverse structures between orthologs). Processing time The generation of decoys was not most time consuming step in this study. It took less than day for generation of 10,000 decoys with single thread of Rosetta run since computationally expensive full-atom relaxation step was omitted. As workstation equipped with 8-cores with hyperthreding capability, decoy set for each protein can be generated in a day (16 threads of Rosetta run simultaneously). The most time consuming step was structural alignment using TMalign. For the speed up, TMalign’s fortran source code has been ported to C and optimized for the speed. In-house code has been written using ported TMalign in the loop, specifically for the pairwise calculation in interpreter such as shell script is very slow and loop execution takes as much as score calculations. Use of Intel compiler with the best optimization switches improved the speed but TMalign is computationally expensive algorithm and hence it is slow. It is chosen despite its speed as it gives better measure for judging same fold. Since pairwise score calculation is the order of n2 for decoy number, 10,000 decoys require 50 million calculation of TM-scores. For single sequence, it took about a week with single thread of calculation. Thus, each protein can be finished in about a week with 16 sequences. Clustering was quite quick (~30min) compared to previous two steps although high memory requirement (~5GB for single calculation) limits the number of simultaneous calculations on same machine. Overall, single protein took about week of calculation time on single Mac Pro with 2.66 GHz dual quad-core Xeon X5500. With modern middle-sized cluster, this calculation should not take more than a day or two. This approach is “redundant” in terms of generation of decoys per proteins but current computational power of workstations already allows just the days of calculation for this approach on single workstation and it can be routinely utilized without investing for the expensive resources.
  • 9. Figure S1: Similarity maps of sequences used for the study The sequences shown are maximum distance pairs by multiple alignments and used for consensus selections. In the maps, Red > Green > Blue indicate the similarity of sequence pairs and their cluster centers of decoys. Some sequences shares very low similarity with the other sequences although they are same protein from same specie (but different strains). In these cases, selection of consensus is not easy as clusters do not share same folds within their sequence spaces (cluster centers do not overlap among populous clusters)
  • 10. Figure S2 : The effect of low energy score filtering The idea to use low energy score filter does not work very well as shown in the figure. It turned out that some cases, it works and in some cases it actually worsens the prediction results. In the cases of HPV proteins (E2/E7), high-energy decoy structures actually occupy more ‘native-like’ structures, thus, filtering these population worsens the results. This is the problem of decoy generation step and consensus approach cannot solve this issue as it relies on the Rosetta to generate decoys. Mainly, it appears to be the poor sampling of structures from viral proteins (HPV sequences are very unique and unlikely to have good representation of sequence/structure mapping in the library. At the same time, it is further concern that sequence/structure correlation is “biophysical features” or “evolutionary features” as scarce sequence space does not necessarily means always low match to library. In this study, HPV and HIV proteins are used as test cases. But for diverse sequence space with little overlap of sequence spaces (even within same protein in same specie) makes situation little complicated. The matching of sequence/structure can be potentially somewhat “evolutionary memory” and not completely “biophysically” determined, more likely the fragments with independent sequence spaces share the same structure space in some cases (HIV appears to share but not HPV).
  • 11. References 1.  Ho  J,  Uger  RA,  Zwick  MB,  Luscher  MA,  Barber  BH,  et  al.  (2005)  Conformational  constraints  imposed  on  a  pan-­‐neutralizing   HIV-­‐1  antibody  epitope  result  in  increased  antigenicity  but  not  neutralizing  response.  Vaccine  23:  1559-­‐1573.   2.   Ofek   G,   Guenaga   FJ,   Schief   WR,   Skinner   J,   Baker   D,   et   al.   (2010)   Elicitation   of   structure-­‐specific   antibodies   by   epitope   scaffolds.  Proc  Natl  Acad  Sci  U  S  A  107:  17880-­‐17887.   3.  Penn-­‐Nicholson  A,  Han  DP,  Kim  SJ,  Park  H,  Ansari  R,  et  al.  (2008)  Assessment  of  antibody  responses  against  gp41  in  HIV-­‐1-­‐ infected  patients  using  soluble  gp41  fusion  proteins  and  peptides  derived  from  M  group  consensus  envelope.  Virology   372:  442-­‐456.   4.  Brussow  H  (2009)  The  not  so  universal  tree  of  life  or  the  place  of  viruses  in  the  living  world.  Philos  Trans  R  Soc  Lond  B  Biol   Sci  364:  2263-­‐2274.   5.  Edwards  RA,  Rohwer  F  (2005)  Viral  metagenomics.  Nat  Rev  Microbiol  3:  504-­‐510.   6.  Angly  FE,  Felts  B,  Breitbart  M,  Salamon  P,  Edwards  RA,  et  al.  (2006)  The  marine  viromes  of  four  oceanic  regions.  PLoS  Biol  4:   e368.   7.  McBurney  SP,  Ross  TM  (2008)  Viral  sequence  diversity:  challenges  for  AIDS  vaccine  designs.  Expert  Rev  Vaccines  7:  1405-­‐ 1417.   8.  Palmenberg  AC,  Rathe  JA,  Liggett  SB  (2010)  Analysis  of  the  complete  genome  sequences  of  human  rhinovirus.  J  Allergy  Clin   Immunol  125:  1190-­‐1199;  quiz  1200-­‐1191.   9.  Chung  SY,  Subbiah  S  (1996)  A  structural  explanation  for  the  twilight  zone  of  protein  sequence  homology.  Structure  4:  1123-­‐ 1127.   10.  Bamford  DH  (2003)  Do  viruses  form  lineages  across  different  domains  of  life?  Res  Microbiol  154:  231-­‐236.   11.  Bystroff  C,  Baker  D  (1998)  Prediction  of  local  structure  in  proteins  using  a  library  of  sequence-­‐structure  motifs.  J  Mol  Biol   281:  565-­‐577.   12.  Sali  A,  Blundell  TL  (1993)  Comparative  protein  modelling  by  satisfaction  of  spatial  restraints.  J  Mol  Biol  234:  779-­‐815.   13.  Ozkan  SB,  Wu  GA,  Chodera  JD,  Dill  KA  (2007)  Protein  folding  by  zipping  and  assembly.  Proc  Natl  Acad  Sci  U  S  A  104:   11987-­‐11992.   14.  Das  R,  Baker  D  (2008)  Macromolecular  Modeling  with  Rosetta.  Annu  Rev  Biochem.   15.  Roy  A,  Kucukural  A,  Zhang  Y  (2010)  I-­‐TASSER:  a  unified  platform  for  automated  protein  structure  and  function  prediction.   Nat  Protoc  5:  725-­‐738.   16.   Bonneau   R,   Strauss   CE,   Baker   D   (2001)   Improving   the   performance   of   Rosetta   using   multiple   sequence   alignment   information  and  global  measures  of  hydrophobic  core  formation.  Proteins  43:  1-­‐11.   17.   DeBartolo   J,   Hocky   G,   Wilde   M,   Xu   J,   Freed   KF,   et   al.   (2010)   Protein   structure   prediction   enhanced   with   evolutionary   diversity:  SPEED.  Protein  Sci  19:  520-­‐534.   18.  Zhang  Y,  Skolnick  J  (2004)  Scoring  function  for  automated  assessment  of  protein  structure  template  quality.  Proteins  57:   702-­‐710.   19.  Zhang  Y,  Skolnick  J  (2005)  TM-­‐align:  a  protein  structure  alignment  algorithm  based  on  the  TM-­‐score.  Nucleic  Acids  Res  33:   2302-­‐2309.   20.  Xu  J,  Zhang  Y  (2010)  How  significant  is  a  protein  structure  similarity  with  TM-­‐score  =  0.5?  Bioinformatics  26:  889-­‐895.   21.  Benson  DA,  Karsch-­‐Mizrachi  I,  Lipman  DJ,  Ostell  J,  Wheeler  DL  (2008)  GenBank.  Nucleic  Acids  Res  36:  D25-­‐30.   22.   Larkin   MA,   Blackshields   G,   Brown   NP,   Chenna   R,   McGettigan   PA,   et   al.   (2007)   Clustal   W   and   Clustal   X   version   2.0.   Bioinformatics  23:  2947-­‐2948.   23.  Notredame  C,  Higgins  DG,  Heringa  J  (2000)  T-­‐Coffee:  A  novel  method  for  fast  and  accurate  multiple  sequence  alignment.  J   Mol  Biol  302:  205-­‐217.   24.  de  Hoon  MJ,  Imoto  S,  Nolan  J,  Miyano  S  (2004)  Open  source  clustering  software.  Bioinformatics  20:  1453-­‐1454.   25.  Eisen  MB,  Spellman  PT,  Brown  PO,  Botstein  D  (1998)  Cluster  analysis  and  display  of  genome-­‐wide  expression  patterns.   Proc  Natl  Acad  Sci  U  S  A  95:  14863-­‐14868.   26.  Zhang  Y  (2009)  Protein  structure  prediction:  when  is  it  useful?  Curr  Opin  Struct  Biol  19:  145-­‐155.