Taking a walk on the W-side:
Comparing Epitopes on HIV-1
with the W-curve & TSP.
Douglas J. Cork1,2,4, Steven Lembark3, Bruce K. Brown1,4, Victoria
R. Polonis1,4, Jerome Kim1,4, Nelson L. Michael5
US Military HIV Research Program (MHRP)/Henry Jackson
Foundation(HJF)1, Rockville, MD., Illinois Institute of Technology2,
Chicago, IL., Workhorse Computing3, Woodhaven, NY., Walter Reed
Army Institute For Research4, Rockville, MD., Walter Reed Army
Institute for Research, Washington, DC5
Statistically, HIV1 is a problem.
● One of the major problems in studying HIV1 is
the apparent randomness of clinical response.
● Tests using clades based on genome sequences
show no correlation with immune response.
● Part of the answer may be clades based on
smaller, clinicallyspecific sequences.
● HIV1 mutates 10,000 times faster than people.
● Existing clades end up including too much white
noise to correlate well with anything.
The Structure of HIV1
● gp120 is the
● gp120 and
gp41 make up
Standard Clades vs. Neutralization Data
● Standard clades of HIV1 are based on
phylogenetic trees of the genome.
● They do not correlate well with neutralization data.
● Between and withinclade have similar variability.
● Antibody and Cell studies have low correlation for
● Lack of a correlation prevents developing any
broadly neutralizing treatments.
● Today we have to sequence the virus to treat it.
Example: Crossclade neutralization shows no
useful pattern in Peripheral Blood Mononuclear
Cell or Pseudovirus Assay studies.
● Distribution of
HIV1 Genetics Complicate Analysis
● Genes and proteins are normally reported with
respect to a single strain, HXB2.
● Hard to compare local features between strains.
● Need to rediscover them for each study.
● Neutralization data are specific to gp120.
● Variable regions in gp120 leave corresponding
locations in different samples off by 10's of bases.
● Antibody binding sites (epitopes) are only a few
bases long, with a majority in the variable regions.
Another approach: Wcurves
● The Wcurve is based on chaos and game
● It abstracts a sequence of DNA into a three
● Originally designed for visualization, we have now
adapted it for machine comparison.
● Geometric analysis of the curves allows for
piecewise comparison of the sequences.
● Start with a square at the origin and a discrete
Zaxis matching the sequence base numbers.
● Each point moves halfway towards the corner
for the next base.
● All curves
● The curve
“C” then “G”
● Converge by
base 7 after a
SNP at base3.
is quick even
● Curves converge as SNP's do but with a phase
● Approximating the
distance smooths over
● Smaller angles reduce
angles add them.
Needle in a Haystack: CD4 Epitope
● The CD4 epitopes occupy only a few, widely
dispersed locations on gp120.
● Locating portions of the discontinuous epitope
● Variable regions between them change the
locations between samples.
● Portions of the epitope within the variable region
can be hidden by nearby changes.
Analyzing the 3D Structure
● The advantage to Wcurves is that even small
features of the gene generate unique geometry.
● Features are easier to identify in 3D than the 1D
● By first locating largescale features, we can
search for smaller ones more easily.
● First align extreme points on the curves.
● Then compare regions between them.
● With a library of fragments, we pick the best match.
Wcurve Algorithm & Serial Comparison
● Largescale features guide the search for
● Conserved regions anchor search.
● After aligning 'peaks' in the curves, we align smaller
and less discriminating features.
● A library of Wcurve fragments finds best fit with
● Repeatable process allows examining and
scoring large numbers of finer features.
Wcurves of HXB2 genome and gp120
● The curve for HXB2 illustrates the most
important features of Wcurves.
● Looking at each section of the Wcurve you'll notice
that each area is different from the others.
● This is what allows us to locate small features: it is
easier to discern them in 3D than a character string.
● This figure also highlights the location of gp120.
A detailed view of gp120
● The next slide shows the first portion of HXB2's
env gene: gp120.
● Again, notice that each portion of the curve is
distinct from the others.
● The different conserved (C) and variable (V)
regions are marked across the bottom of the
The CD4 epitope in gp120
● This is where the Wcurve really becomes
useful: isolating the epitope locations within
● The highlighted areas show the epitope
locations with an additional 3bases of
conformational region before and after (which
combines a few of the regions).
● Note that the epitope is dispersed and lives
largely in the variable regions.
Clustering With the TSP
● Solutions to the Traveling Salesman Problem
can be used to cluster genes.
● The shortest path clusters moresimilar sequences.
● The difficulty is in getting clades out of the TSP.
● One approach uses dummy cities with small
distances to all other cities.
● Dummys end up in the intercluster regions.
● This approach has proven fast & repeatable.
Further Work on Clusters
● Find algorithm for repeatably assigning the number
of dummy cities.
● Automate detecting “similar” clusters.
● Timeseries analysis.
● Watch sample groups for new members.
● Track evolution of drug resistance in clinical trial
groups, individual patients.
● Our goal is to correlate neutralization outcomes.
● Compare small regions near the epitopes.
● Find DNA that clusters similarly to neutralization
● DNA clusters that match the Neutralization data
are “clinical” clades.
● Biggest issue will be deciding what “similar” is.
● Probably a good application for Fuzzy Logic.
● Thanks to the authors of Brown, et al, study.
All of the work we've shown you was done on a
computer. Without fieldwork and wet labs, it would
be empty. Next time you sit down to crunch some
numbers, stop and picture for a moment the
process of acquiring it. You'll get a whole new
appreciation for your work.