Viral Protein Structure Predictions - Consensus Strategy

New strategy for the viral protein structure predictions: “Consensus
model” approach to take advantage of sequence diversity
Introduction
The viral surface proteins are good targets for the vaccine development. They are major
targets for neutralizing antibody against viruses but elicitation of broadly reactive neutralizing antibody
(brNAb) is proved to be difficult. The structural information for these viral surface proteins is extremely
important for rational design for immunogen targeting these viruses in order to understand conserved
surface residues. Although drug targets such as viral proteases, reverse transcriptases are well
studied by crystallographic or NMR structural determinations in order to design effective drug against
these viruses, surface proteins are not as much studied as drug target proteins. Some well-studied
surface proteins such as HIV gp120 and influenza hemagglutinin show significant sequence diversity
and shown to be challenging target for elicitation of brNAb. The peptide based immunogen and
epitope scaffold-based approaches utilizing known brNAb epitope are not successful to elicit brNAb
against HIV-1 and influenza so far [1,2], and the use of minimal-sized rigid protein of conserved
sequence region in the native structural context would be necessary [3]. In order to rationally design
such immunogen, structural information for the region of protein that can fold independently is crucial.
With insufficient solved structures, we need to rely on the computational structure prediction on these
proteins. The structure prediction of viral proteins is not easy task. This class of proteins does not
share high sequence similarity with known cellular proteins [4,5,6] and also solved structures of viral
proteins are sparse compared to other protein families. Thus, it is poor target for the comparative
modeling strategy. In addition to that, the sequence diversity within the same protein is extremely high
[7,8] with their high mutation rates for virus’ immune evasion and makes it a moving target for the
structure prediction (the choice of sequence may affect outcome). Some proteins show below “twilight
zone” sequence identities [9] among same protein from different strains but same specie. In these
cases, first of all, even it is not clear what is the principle of choosing target sequence for prediction,
as even consensus sequence is just another variation. Thus, it is also true that these proteins are
difficult target for de novo prediction. In addition to extensive diversity within same protein, the
sequence space of viral proteins is vastly large yet not overlapping enough with “cellular”
(prokaryotic/archaeal/eukaryotic) sequence spaces [5,6,10]. Thus, predicting the protein structure
with knowledge-based algorithms [11,12] appears to be difficult including both comparative and
fragment assembly based free modeling strategies. As completely physics-based modeling is still not
feasible or practical at this time [13], in order to predict the structure of viral proteins, we need to use
currently available algorithms [14,15]. Although overall sequence spaces of cellular and viral proteins
may not overlap very much and unique to each other except some exchanged genes between two,
still fragment library has short “viral” sequences (such as 9mers for Rosetta fragment library) in some
degree even assembled from database composed of mostly cellular proteins. It is assumed that due
to physicochemical constraints, same sequence in both cellular and viral proteins should share similar
ensemble of structures as fragments. For this reason, physics-based fragment assembly algorithms
such as Rosetta should be able to produce decoys with same fold as native protein, probably with low
frequency.

The strategy for computational structure prediction of those viral proteins was developed using
simple principle in order to address difficulty associated with high diversity and uniqueness of viral
sequence space. Despite genetic distance, same protein must retain its function, thus same structure
(similar enough to have intact functionality). Although human papillomavirus E7 protein C-terminal
domain only shares as low as 15% sequence identity among strains (data not shown), all variants are
functional protein from isolates. Thus, in principle, they should share same fold among genetically
distant sequences, even each member of assembled sequences would generate considerably
different set of decoy population. By incorporating farthest sequence pairs among the sequences
from same protein in order to minimize overlapped structure space belonging to each sequence, this
strategy is designed to capture the common fold being populated by maximum number of sequences.

It may not be abundantly populated members nor low energy structure, but common among all
sequence space and it is the “consensus model”. The multiple sequence alignment information has
been used to obtain the hydrophobic core forming residues or to improve secondary/tertiary structure
predictions and reported successful improvement of predicted structures [16,17] but the use of
different sequences within same protein directly has not been reported, appears to be mainly due to
negligible sequence diversity for cellular proteins. Utilizing Rosetta for decoy generation, 5 small viral
proteins with solved structures were used as the benchmark proteins for evaluation of this strategy.
Increase of computational power in workstation allows such “redundant” approach with manageable
time scale with affordable resources. Total of 800,000 decoys (10,000x16x5) were generated and
clustered the decoys with their pairwise TM-score [18] using just two 8-core CPU equipped
workstations. For this approach, we have chosen to use computationally expensive TM-score instead
of RMSD (this step is the most time consuming step for all process) in order to capture the overall
“fold” [19,20] rather than atomic details. In this report, first, we show the results of Rosetta generated
decoys and TM-score/energy score relations of these benchmark viral proteins then show the
performance of “consensus” approach.
Methods
Selection of benchmark proteins
The 5 small viral proteins with known structure
and variable degree of sequence diversity have been
chosen for this study to test “consensus” approach.
Availability of solved structure, sequence length below
100 residues and extremely to mildly diverse
sequences are the criteria for selection of target
proteins. We have chosen HIV p24 C-terminal
dimerization domain, vpr and vpu cytoplasmic domain,
HPV E2 DNA binding and E7 C-terminal domains that
matches above defined criteria. The 16 sequences
from each protein are chosen to be most distant in pairwise and in average. Pairwise sequence
identity among the target sequences ranges from 15 to 92% and averages pairwise identity ranges
are from 33 to 79% among same protein sequences. Table 1 illustrates sequence statistics of target
proteins. Sequences were obtained from Los Alamos HIV database (http://www.hiv.lanl.gov/)
(HIV p24/vpr/vpu) or manually collected from GenBank [21] (HPV E2/E7) and subjected to multiple
alignment by Clustal W [22] and T-Coffee [23]. Sequences are then trimmed down to the domains
with structure assessments and N- or C- terminal flexible regions in solution structure were removed
from target sequences.
Decoy generation, calculation of pairwise TM-scores and clustering
The sets of 10,000 decoys are generated for each sequence using Rosetta 2.2 without full-
atom relaxation step. Each decoy dataset was then subjected to the calculation of pairwise TM-score
of all decoy combinations. TM-scores were calculated by in-house program written in C (with TM-
align source code ported to C). Then decoys were clustered by decoy-decoy pairwise TM-scores.
Clustering was done with in-house program based on algorithm from Cluster 3.0 software source
code [24,25] by complete linkage clustering with threshold of TM-score 0.5. The clusters are further
examined for distances among all cluster members of two clusters. If all distance among members
are above threshold, they were merged. TM-scores against solved structures (structures used as
native are following; p24:2JYL, vpr:1M8L, vpu:1VPU, E2:1A7G, E7:2B9D) were calculated for all
decoys (1VPU has 9 models in PDB file so that all 9 models were used for the calculation) and
Rosetta energy scores were extracted From Rosetta silent files. All calculations were executed on 2
of 2.66GHz dual quad-core Xeon equipped Mac Pro with Mac OSX 10.6 and all programs were
compiled with Intel compiler v11.1 with full optimization flags.
Calculation of consensus cluster
Table 1
Sequence ID (%)Length
mean Max Min
P24 70 79.3 92 62
vpr 56 72.4 89 50
vpu 81 57.7 82 41
E2 84 40.1 90 22
E7 44-54 33.3 64 15

Top 10 largest clusters for each sequence are taken and cluster center decoys were
assembled as top 10 decoy set for each protein. Top 10 decoy set was then subjected to the
calculation of pairwise TM-scores and clustered again with TM-score threshold of 0.3. The largest
cluster of top 10 decoy set was assigned as consensus cluster and its cluster center is assumed as
consensus model. Even not the largest cluster, the cluster with highest sequence coverage was
considered to be a consensus cluster in order to cover larger sequence space rather than decoy
population.
Calculation of e-values between viral proteins and their fragment libraries
Rosetta 9mer fragment libraries for 5 viral and 3 cellular proteins were parsed and assembled
into plain fasta-formatted files by script. All viral sequences were also converted to overlapping 9mers
fragments in fasta-formatted file. The viral 9mer fragments and Rosetta fragment library 9mers were
then compared by ssearch version 3.5 and e-values were assembled for all 16 sequences.
Depending sequence length, total number of fragments varies so that frequencies were calculated as
relative values for total fragments.
Results
Decoys of 5 viral proteins
The decoys for viral proteins
generated by Rosetta 2.2 did not
show good correlation between
Rosetta energy and TM-score.
Figure 1 indicates Rosetta
energy score vs. TM-score plots
of decoys of 5 viral proteins. The
right bottom panel is same plot
for cellular proteins (acyl carrier
protein/ACP, ubiquitin and
thioredoxin). Not all but in many
cases for these cellular proteins,
Rosetta can generate the decoy
set with modest to very good
correlation between energy
score and RMSD/TM-score (low
energy decoys are also
structurally similar to native
protein). Thioredoxin has very good correlation between low energy and high TM-score (as high as
0.9, RMSD ~1 angstrom) and ubiquitin also performed well. Rosetta did not produced high TM-score
decoys for ACP but did produce the decoys with good correlation between energy and TM-score. On
the other hand, those 5 viral proteins selected for this study do not show such a trend. They show
remarkably different Rosetta energy/TM

-score distributions ranging from completely uncorrelated (HPV E2) to negatively correlated (HIV vpu;
shown only plot for model 1) or sequence dependent (HPV E7). Particularly, E7 shows great variation
in decoy populations by sequences as expected with such a remote sequence identities even within
same protein. On the contrary, HPV E2 shows very uniform population among the 16 sequences
although the sequence identities are as low as 22% (mean 40%). All top10 clusters belong to single
cluster with average pairwise distance 0.500 and 0.552 for average distance from cluster center. With
this poor correlation between energy score and TM-score, choosing low energy decoy for prediction
would not be good strategy for these 5 viral proteins. The overall distributions of TM-scores are
skewed toward low score compared to non-viral proteins. Majority of decoys are TM-score below 0.5,

Figure 1: Energy vs. TM-score plot of decoys for 5 viral proteins. Five
panels show 5 benchmark viral proteins and right bottom panel represents
plot of cellular proteins, thioredoxin, ubiquitin and ACP (acyl carrier protein).
For viral proteins, 16 most distant sequences are used for decoy generation
and plotted in different color. Viral proteins show poor to reversed
correlation between energy and TM-score such as completely uncorrelated
HPV E2 protein, reversed correlation (low energy is structurally distant) of
HIV vpu or sequence dependent (relatively small overlap among different
sequences) of HPV E7. On the other hand, non-viral proteins show very
good (thioredoxin) to descent (ACP) correlation between energy and TM-
scores.

which is rough threshold for the same fold [20]. Thus it is difficult to select the decoy with TM-score
beyond 0.5 (against native
structure) for final model.
Clustering of decoys
The clustering of decoys for
these proteins are performed with
hierarchical complete linkage
clustering algorithm since each
cluster should only contain
structurally close enough (same
fold) decoys in order for capturing
consensus cluster later.
Therefore, TM-score 0.5 was set
as threshold for complete linkage
clustering. The number of clusters
varies to a great extent among the
proteins and sequences. HPV E2
has significantly more clusters
(table2) with threshold value 0.5
(mean 276.5) in comparison with
HIV vpr (mean 12.5). Number of
clusters of HPV E7 fluctuates greatly (31-432) by sequence in same way as energy/TM-score plots.
Similar trend is also observed by coverage of Top 10 clusters (Sum of top 10 cluster members
divided by all decoys). Threshold value 0.3 for the clustering Top 10 decoys were determined as the
value giving enough cluster members for choosing “consensus” cluster but not to include too diverse
structural ensemble. The threshold value of 0.5 (used for initial clustering) produced too many
clusters with small number of cluster members among Top 10 decoy set and cannot reach
“consensus” as each cluster has too incomplete coverage of sequences. In order to determine
consensus cluster, TM-score threshold was lowered to yield enough cluster members populated by
most of sequences, but pairwise
distance among the Top 10 decoy set
and average distance from each cluster
center is not far below from 0.5. The
threshold value of TM-score 0.3 appears
to be low but clustering is performed with
complete linkage clustering algorithm so
that even the most distant decoys have
TM-score around or above 0.3. For all 5
viral proteins, the distances are within
the range of same fold as average
pairwise TM-score is 0.49 (0.46-0.51)
and 0.55 (0.51-0.57) for average
distance from cluster center for all Top
10 decoy sets. Thus, it is concluded that
these clusters are representative of
decoys in same fold and clusters with
highest sequence coverage were taken
as consensus cluster and center as
consensus model.
Consensus model strategy
Table 2
p24 vpr vpu E2 E7
Seq# Clust cover. Clust cover. Clust cover. Clust cover. Clust cover.
1 65 0.816 9 0.999 97 0.701 224 0.480 432 0.165
2 62 0.800 8 0.999 60 0.803 411 0.372 54 0.893
3 41 0.802 12 0.997 74 0.810 292 0.322 226 0.610
4 50 0.657 67 0.809 93 0.770 320 0.380 57 0.866
5 30 0.924 8 0.999 164 0.495 145 0.700 46 0.921
6 44 0.866 10 0.999 150 0.533 303 0.323 113 0.767
7 32 0.868 8 0.999 103 0.671 345 0.295 56 0.916
8 51 0.584 12 0.994 135 0.547 237 0.614 43 0.923
9 38 0.843 8 0.999 112 0.663 293 0.360 31 0.927
10 43 0.831 6 0.999 123 0.654 267 0.398 318 0.391
11 147 0.307 19 0.976 110 0.702 235 0.503 30 0.956
12 48 0.836 7 0.999 18 0.987 368 0.469 291 0.339
13 47 0.719 8 0.999 77 0.804 255 0.605 64 0.896
14 29 0.945 6 0.999 161 0.456 257 0.514 37 0.969
15 23 0.965 9 0.999 94 0.693 227 0.720 399 0.349
16 29 0.902 3 0.998 63 0.917 245 0.455 438 0.242
Ave 48.7 0.792 12.5 0.985 102.1 0.700 276.5 0.469 164.7 0.696
S.D. 28.8 0.164 14.9 0.047 39.4 0.147 64.3 0.133 157.9 0.294

Figure 2: Aligned predicted models (brown) and solved structures
(white). The consensus models of 5 viral proteins are aligned to
their solved structures by TMalign. In the figure, “Aligned” indicates
the residues within 5 angstrom distances from solved structures.
The solved HIV vpu structure is NMR solution structure with 9
models. Displayed in the figure is model 1 which was best aligned
with consensus model. Other 4 structures are crystal structures.

The centers of consensus clusters for 5 viral proteins
are classified as consensus models and examined in
comparison with solved crystal/NMR structures. HIV p24
and vpr were most successfully predicted among 5
proteins (figure 2). Although these predictions are not
spectacularly accurate (TM-scores 0.56 and 0.55 for p24
and vpr, respectively) but are good enough to represent
essences of these folds. Since consensus model
approach is not designed to predict an atomic accuracy
but to capture correct overall fold, they are quite
successful. Specially, vpr has 51 residues aligned out of
56 residues within 5 angstrom from native structure. P24
has mostly well aligned prediction but very C-terminal
end is placed in wrong direction, thus lowering TM-
score. HIV vpu are less successful in terms of TM-score
(0.48) but overall fold/topology is well captured. The
prediction for HPV E2 was not great, but also not too far
either. On HPV E7, consensus approach did not work
well. As figure 2 indicates, the α-helix connected to two
consecutive β-sheets is positioned in wrong side. Although there is case such as HPV E7, if we
compare these results with the performance of conventional strategies to select the best model,
overall success of consensus strategy becomes clear. Except HPV E7 that failed to capture native
fold, consensus model approach performed modestly to significantly better on other four viral proteins
(Figure 3). HPV E2 has minimal improvement compared with largest cluster center or lowest energy
decoy since all decoys generated for this protein did not cover large structure space. The sequence
space of HPV E2 only covers very small, relatively similar and degenerated structure space (although
many in cluster numbers) despite large sequence space, thus three methods examined here did not
show any major differences. HPV E7 has too diverse sequence space (as low as 15% identity with
average of 33%) and it appears structure spaces of some sequences are not overlapping. In order for
consensus cluster strategy to work, these most distant sequences need to share some degree of
structure space. HPV E7 appears too diverged in both sequence (even length of domain varies from
44 to 54 residues) and structure space. In fact, the matrix representation of pairwise TM-scores of
HPV E7 indicates there are some pairs of sequences they do not share the predicted structure space
(Figure S1) and the sequences are behaving as if they are different proteins. On the other hand,
consensus model approach performs well with reasonably variable sequence space. It should be
noted that even consensus strategy on HPV E7 did
not underperform in comparison to other strategies
(Figure 3), just there was no improvement. Figure 4
represents distributions of decoy TM-scores
against native structures as histogram. Well
behaving proteins such as p24, vpr or vpu have
two major peaks in distributions with different
degrees of populations. The consensus approach
is capturing these populations with “native like”
structures (in higher TM-score, “close” peaks)
rather than structures in other “distant” peaks. This
observation is reasonable and hoped for as
“distant” peaks are likely to be composed of
multiple populations with greatly varied folds due to
diverse sequence space but “native like” structures
have more stringent constraints to be close to
native structure, and likely to be clustered together

Figure 3: The performance of different
strategies for the best model selection from
decoy set of 5 viral proteins. The data for HIV
vpu are average of an ensemble of 9 NMR
structures. The error bars indicate the range of
9 data points.

Figure 4: The histogram of TM-score distribution of all
decoy sets for 5 viral proteins. Each protein has 16 decoy
sets and distributions are calculated, plotted independently.
The solid lines indicate the TM-scores of consensus models
and broken lines are TM-scores 0.5.

within “close” peak. Figure 4 also indicates that even with HPV E2, consensus model strategy
captures one of the clusters closest to native within generated decoys although all decoys are not
close enough to native structure. In the case of HPV E7, it is even more clearer in this figure that
some sequences are producing only decoys with too distant structures (some sequences have
extremely small population beyond TM-score 0.4) so that there is no way for capturing cluster close to
native structure. Thus, HPV E7 represents the limitation of this strategy that is finding the common
structure among different sequences.
Discussion
Predictive performance
The consensus model approach
works well with viral proteins although not
all protein can be predicted with TM-score
above 0.5. There are two major limitations,
one is technical and another is in principle.
As stated in the last paragraph of Result
section, HPV E7 reveals the fundamental
limitation of this approach. As sequence
space diverges too far, current protocols
cannot converge into same structure
space. It is not the limitation per se, but if
predicted structure space has too small or
no overlap among the sequences in the
sequence space, the strategy fails. It is
also related to another technical limitation.
Rosetta is one of the best de novo
structure prediction tools. Still it cannot
produce native like decoys for HIV vpu,
HPV E2 proteins. It is unclear whether
sampling space is too small, fragment
libraries are biased toward non-viral
proteins or other reasons. Limited
sampling space should not be severe
issue with this approach. The consensus
cluster approach is heavily relying on the
population ensemble rather than rare
discovery of decoy with low energy and
structurally close to native. If 10,000
decoys cannot capture structurally close
population, it is unlikely to capture such a
population even with orders of magnitude
larger decoy set. The similarity calculation
of overlapping 9mers generated from 16
sequences of 5 viral proteins against
Rosetta fragment library for each protein
indicates the frequency of low e-value
fragments for viral proteins are significantly
lower than that of cellular proteins and
available fragments are distributed evenly
from low to high e-values (Figure 5). It is
likely that low number of solved structures
Figure 5: E-value distribution between Fragment library and target
sequences. The 9mer sequences from Rosetta fragment library and
target cellular and viral sequences are compared using ssearch v3.5.
Viral sequences show significantly lower low e-value sequences and
mostly uniform distribution in all range of e-value. Note that as
sequences are short 9mers, e-value over 1 does not necessarily mean no
correlations. With frequently observed residues, e-value over 5 can be
4~5 residue match. Viral proteins show smoother graph as they have
16x more sampling than cellular proteins.

of viral proteins compared with other protein classes results in small population of fragments with low
e-value. Although it is not necessarily true that availability of low e-value fragments guarantees
accurate predictions, opposite would be true. The distant fragment library sequences from actual
protein are likely to have lower structural correlation to actual structures. These libraries had to be
used for fragment assembly and likely to keep decoy structures away from solved structures. As viral
proteins sequence space is generally unique compared to other protein classes, the low number of
available structures for fragment library generation would be to be blamed for the poor performance in
decoy generation of viral proteins. Put together, available sampling space is limited due to the lack of
available solved structures corresponding to the viral sequence space. Another factor related to
uniqueness of sequence can affect the secondary structure prediction used for Rosetta decoy
generation. Secondary structure of HIV vpu was not accurately predicted by all three methods
supported by Rosetta (PSIPRED, JUFO and SAM) and resultant model has extended C-terminus
rather than actual α-helical structure. Considering uniqueness of viral sequence and structure spaces,
it is not surprising that all three secondary structure prediction algorithms do not work for some viral
proteins. If secondary structure has been accurately predicted, it is likely that the result for HIV vpu
was much more accurate and well over TM-score 0.5.
Further improving performance by combining “consensus” approach with sampling of low
Rosetta energy decoys was attempted but unsuccessful. There are 2 cases (E2 and E7) which
improvement was observed by filtering the decoys by energy score but other 3 cases (p24, vpr and
vpu), filtering actually worsened the results (Figure S2). Rosetta generated the population of decoys
for HPV proteins with lower energy that is less distant to native than overall ensemble. Unfortunately,
opposite is true for HIV proteins and filtering by energy score actually worsened results for these
proteins. It is puzzling and not known why Rosetta energy and TM-scores are uncorrelated for viral
proteins. Nonetheless, due to the fact that viral proteins do not have lowest energy states with their
native structure or Rosetta energy function does not represent native viral protein energy, an attempt
to further improve results by taking low energy decoys failed. Although full-atom relaxation step was
not performed, it is unlikely that the Rosetta energy scores dramatically change with full-atom
relaxation. The largest cluster center strategy that is taken frequently also did not work as good as
consensus strategy. The largest cluster does not necessarily represents the decoy population with
smallest distance form native structure (7,6,5,1,2 cases out of 16 for p24, vpr, vpu, E2 and E7,
respectively). It is, therefore, difficult to determine the correct fold by selecting center of largest cluster
too. And again, there is problem of choosing from many different sequence variants. On the other
hand, consensus strategy took advantage of sequence diversity (ranging from 33 to 79% in mean
sequence identities) and successfully captured the cluster with most structurally close to native fold
even the case of overall decoy population is not very close to native (figure 4). An initial assumption
that short viral sequence fragments should have enough overlap with cellular protein sequence space
appears not standing and it is probable that even 9mer fragments are unique for viral sequences
compared with cellular sequences, as indicated by significantly lower number of low e-value and
abundant high e-value fragments are observed in fragment library (Figure 5).
Usefulness
This approach is developed mainly targeting the structure prediction of small viral proteins for
antigen/immunogen design. This strategy captures same or close fold (around or above TM-score
0.45) for 4 out of 5 proteins. The purpose wise, this result is a success in the use of the predicted
model for the evaluation of properties of designed antigen/immunogen such as mapping of epitope
and immunodominant region. For these purposes, atomic level of accuracy is desirable but not
necessary. In fact, this approach is used for predicting the structure of small protein based
immunogen for HIV-1 vaccine and generating quite useful information to interpret immunological and
biochemical data obtained by antigenic and immunogenic studies. Many biochemical data can be
interpreted with low-resolution structure model (correct fold) instead of high-resolution atomic models
for the application such as function or protein family inference [26]. Thus, this approach can produce

the model that is useful for many area of biology although in principle it cannot produce model in
atomic accuracy.
The limitation of approach is clearly illustrated by HPV E7 protein. But this is rare case, such
as sequence identities among HPV E7 C-terminal domain are as low as 15%. At this level of
sequence diversity, consensus approach breaks down. It is also possible that difference in the
sequence length (44 to 54 residues) may have impacted the results. The protein with such sequence
diversity is not so common, and E7 protein has tandem CXXC motif as metal coordination site that is
not included as constraint but limits possible architecture/topologies significantly. Also, its C-terminal
domain may serve only as scaffold (for dimer formation) for the unstructured N-terminal domain with
LXCXE motif for pRb binding that is necessary and sufficient for functionality. Thus it is possible that
the domain’s structure may not be strictly conserved as it may not be essential for its function and
Rosetta’s inability to generate decoys in same fold among different sequences may reflect actual
structural variability.
This strategy can utilize phylogenetic information (as sequence diversity) among related
species instead of genetic variability among same specie. The proteins with known orthologs in
diverse species (with choice of sequence identity threshold depending on sequence diversity) can be
used in identical manner to capture the fold of their protein family. How further it can be applied to
such an approach needs to be tested as the case of HPV E7 revealed the limitation of this strategy
for too diverse sequence space (and potentially dissimilar/diverse structures between orthologs).
Processing time
The generation of decoys was not most time consuming step in this study. It took less than day
for generation of 10,000 decoys with single thread of Rosetta run since computationally expensive
full-atom relaxation step was omitted. As workstation equipped with 8-cores with hyperthreding
capability, decoy set for each protein can be generated in a day (16 threads of Rosetta run
simultaneously). The most time consuming step was structural alignment using TMalign. For the
speed up, TMalign’s fortran source code has been ported to C and optimized for the speed. In-house
code has been written using ported TMalign in the loop, specifically for the pairwise calculation in
interpreter such as shell script is very slow and loop execution takes as much as score calculations.
Use of Intel compiler with the best optimization switches improved the speed but TMalign is
computationally expensive algorithm and hence it is slow. It is chosen despite its speed as it gives
better measure for judging same fold. Since pairwise score calculation is the order of n2
for decoy
number, 10,000 decoys require 50 million calculation of TM-scores. For single sequence, it took
about a week with single thread of calculation. Thus, each protein can be finished in about a week
with 16 sequences. Clustering was quite quick (~30min) compared to previous two steps although
high memory requirement (~5GB for single calculation) limits the number of simultaneous calculations
on same machine. Overall, single protein took about week of calculation time on single Mac Pro with
2.66 GHz dual quad-core Xeon X5500. With modern middle-sized cluster, this calculation should not
take more than a day or two. This approach is “redundant” in terms of generation of decoys per
proteins but current computational power of workstations already allows just the days of calculation
for this approach on single workstation and it can be routinely utilized without investing for the
expensive resources.

Figure S1: Similarity maps of sequences
used for the study
The sequences shown are maximum
distance pairs by multiple alignments and
used for consensus selections. In the maps,
Red > Green > Blue indicate the similarity of
sequence pairs and their cluster centers of
decoys. Some sequences shares very low
similarity with the other sequences although
they are same protein from same specie
(but different strains). In these cases,
selection of consensus is not easy as
clusters do not share same folds within their
sequence spaces (cluster centers do not
overlap among populous clusters)

Figure S2 : The effect of low energy score filtering
The idea to use low energy score filter does not work very well as shown in the figure. It turned out
that some cases, it works and in some cases it actually worsens the prediction results. In the cases of
HPV proteins (E2/E7), high-energy decoy structures actually occupy more ‘native-like’ structures,
thus, filtering these population worsens the results. This is the problem of decoy generation step and
consensus approach cannot solve this issue as it relies on the Rosetta to generate decoys. Mainly, it
appears to be the poor sampling of structures from viral proteins (HPV sequences are very unique
and unlikely to have good representation of sequence/structure mapping in the library. At the same
time, it is further concern that sequence/structure correlation is “biophysical features” or “evolutionary
features” as scarce sequence space does not necessarily means always low match to library. In this
study, HPV and HIV proteins are used as test cases. But for diverse sequence space with little
overlap of sequence spaces (even within same protein in same specie) makes situation little
complicated. The matching of sequence/structure can be potentially somewhat “evolutionary memory”
and not completely “biophysically” determined, more likely the fragments with independent sequence
spaces share the same structure space in some cases (HIV appears to share but not HPV).

References
1.
Ho
J,
Uger
RA,
Zwick
MB,
Luscher
MA,
Barber
BH,
et
al.
(2005)
Conformational
constraints
imposed
on
a
pan-‐neutralizing

HIV-‐1
antibody
epitope
result
in
increased
antigenicity
but
not
neutralizing
response.
Vaccine
23:
1559-‐1573.

2.
Ofek
G,
Guenaga
FJ,
Schief
WR,
Skinner
J,
Baker
D,
et
al.
(2010)
Elicitation
of
structure-‐specific
antibodies
by
epitope

scaffolds.
Proc
Natl
Acad
Sci
U
S
A
107:
17880-‐17887.

3.
Penn-‐Nicholson
A,
Han
DP,
Kim
SJ,
Park
H,
Ansari
R,
et
al.
(2008)
Assessment
of
antibody
responses
against
gp41
in
HIV-‐1-‐
infected
patients
using
soluble
gp41
fusion
proteins
and
peptides
derived
from
M
group
consensus
envelope.
Virology

372:
442-‐456.

4.
Brussow
H
(2009)
The
not
so
universal
tree
of
life
or
the
place
of
viruses
in
the
living
world.
Philos
Trans
R
Soc
Lond
B
Biol

Sci
364:
2263-‐2274.

5.
Edwards
RA,
Rohwer
F
(2005)
Viral
metagenomics.
Nat
Rev
Microbiol
3:
504-‐510.

6.
Angly
FE,
Felts
B,
Breitbart
M,
Salamon
P,
Edwards
RA,
et
al.
(2006)
The
marine
viromes
of
four
oceanic
regions.
PLoS
Biol
4:

e368.

7.
McBurney
SP,
Ross
TM
(2008)
Viral
sequence
diversity:
challenges
for
AIDS
vaccine
designs.
Expert
Rev
Vaccines
7:
1405-‐
1417.

8.
Palmenberg
AC,
Rathe
JA,
Liggett
SB
(2010)
Analysis
of
the
complete
genome
sequences
of
human
rhinovirus.
J
Allergy
Clin

Immunol
125:
1190-‐1199;
quiz
1200-‐1191.

9.
Chung
SY,
Subbiah
S
(1996)
A
structural
explanation
for
the
twilight
zone
of
protein
sequence
homology.
Structure
4:
1123-‐
1127.

10.
Bamford
DH
(2003)
Do
viruses
form
lineages
across
different
domains
of
life?
Res
Microbiol
154:
231-‐236.

11.
Bystroff
C,
Baker
D
(1998)
Prediction
of
local
structure
in
proteins
using
a
library
of
sequence-‐structure
motifs.
J
Mol
Biol

281:
565-‐577.

12.
Sali
A,
Blundell
TL
(1993)
Comparative
protein
modelling
by
satisfaction
of
spatial
restraints.
J
Mol
Biol
234:
779-‐815.

13.
Ozkan
SB,
Wu
GA,
Chodera
JD,
Dill
KA
(2007)
Protein
folding
by
zipping
and
assembly.
Proc
Natl
Acad
Sci
U
S
A
104:

11987-‐11992.

14.
Das
R,
Baker
D
(2008)
Macromolecular
Modeling
with
Rosetta.
Annu
Rev
Biochem.

15.
Roy
A,
Kucukural
A,
Zhang
Y
(2010)
I-‐TASSER:
a
unified
platform
for
automated
protein
structure
and
function
prediction.

Nat
Protoc
5:
725-‐738.

16.
Bonneau
R,
Strauss
CE,
Baker
D
(2001)
Improving
the
performance
of
Rosetta
using
multiple
sequence
alignment

information
and
global
measures
of
hydrophobic
core
formation.
Proteins
43:
1-‐11.

17.
DeBartolo
J,
Hocky
G,
Wilde
M,
Xu
J,
Freed
KF,
et
al.
(2010)
Protein
structure
prediction
enhanced
with
evolutionary

diversity:
SPEED.
Protein
Sci
19:
520-‐534.

18.
Zhang
Y,
Skolnick
J
(2004)
Scoring
function
for
automated
assessment
of
protein
structure
template
quality.
Proteins
57:

702-‐710.

19.
Zhang
Y,
Skolnick
J
(2005)
TM-‐align:
a
protein
structure
alignment
algorithm
based
on
the
TM-‐score.
Nucleic
Acids
Res
33:

2302-‐2309.

20.
Xu
J,
Zhang
Y
(2010)
How
significant
is
a
protein
structure
similarity
with
TM-‐score
=
0.5?
Bioinformatics
26:
889-‐895.

21.
Benson
DA,
Karsch-‐Mizrachi
I,
Lipman
DJ,
Ostell
J,
Wheeler
DL
(2008)
GenBank.
Nucleic
Acids
Res
36:
D25-‐30.

22.
Larkin
MA,
Blackshields
G,
Brown
NP,
Chenna
R,
McGettigan
PA,
et
al.
(2007)
Clustal
W
and
Clustal
X
version
2.0.

Bioinformatics
23:
2947-‐2948.

23.
Notredame
C,
Higgins
DG,
Heringa
J
(2000)
T-‐Coffee:
A
novel
method
for
fast
and
accurate
multiple
sequence
alignment.
J

Mol
Biol
302:
205-‐217.

24.
de
Hoon
MJ,
Imoto
S,
Nolan
J,
Miyano
S
(2004)
Open
source
clustering
software.
Bioinformatics
20:
1453-‐1454.

25.
Eisen
MB,
Spellman
PT,
Brown
PO,
Botstein
D
(1998)
Cluster
analysis
and
display
of
genome-‐wide
expression
patterns.

Proc
Natl
Acad
Sci
U
S
A
95:
14863-‐14868.

26.
Zhang
Y
(2009)
Protein
structure
prediction:
when
is
it
useful?
Curr
Opin
Struct
Biol
19:
145-‐155.

Viral Protein Structure Predictions - Consensus Strategy

More Related Content

What's hot

Viewers also liked

Similar to Viral Protein Structure Predictions - Consensus Strategy

More from Keiji Takamoto

Viral Protein Structure Predictions - Consensus Strategy