HIV1, Wcurves, & Shoe Leather● Existing genetics tools fail on HIV1 ● They make assumptions based on “normal” DNA that fail on HIV – or cancer, or plants. ● Correlation tools look at evolution, not state.● We are working on tools for clinical analysis. ● The Wcurve abstracts DNA into geometry. ● The TSP clusters genenes rather than trying to impute inheritence.
Sequences Inform Treatment● Treating HIV requires sequencing it to choose appropriate drugs: ● HIV1 evolves drug resistence in months. ● Multiple strains in a single pateint are common, both from multiple sources or evolution. ● Crossover recombination relatively common due to crossinfected cells.
Problem: HIV is Hard to Analyze● HIV is a noncorrecting retrovirus.● Evolves 10,000 times faster than humans or influenza – one new strain per patient per day.● Genomes for wild types range from 8349 to 9829 bases, making localized comparisions difficult.● The single FDA approved algorithm directing treatment from sequence handles only typeB; the U.S. Army has 15%+ nonB infections.
The Current Tools● Blast, Fasta, ClustalW perform alignment. ● Tabledriven analysis of base transitions. ● Score the entire sequence with a single value.● Graphical tools are designed to display inheritence rather than state. ● Output is difficult to read in a clinical setting.
Phenogram of DrugResistant and RandomSamples● Tries to show ancestory, not state.● Not very good for visual identification of which patients are drug resistant.
New Tools● Clinical vs. evolutionary.● Avoid assumptions that break current tools.● Suitable for a repeatable process in clinics or data mining in research.● We are using: ● Wcurve for analysis. ● TSP for clustering. ● R for data management & display.
Wcurve● Geometric abstraction of DNA.● Manufactured by a simple state machine.● Alignment at finer scale available using geometry than character strings.● Avoids assumptions about transition probabilities by taking the figure asis.
WCurve Generator is a State Machine● C,A,T,G are assigned to corners of a square.● Successive points move halfway to the next bases corner.
Wcurve for “CG”● Curve shown in Blue.● Halfway to C then G in X‑Y, single steps in Z.● Cyl. storage simplifies comparision.
Wcurve of Wild HIV1 POL GeneWcurve of Wild HIV1 POL
Distance Metric● Bases are arranged in square to minimize effects of SNPs.● Synonymous SNPs are usually in the same quadrant.● Points within same quadrant have small difference, opposite quads get larger.
Comparison Produces “Chunks”● Comparison yields a list of chunks.● Curves are aligned within the chunk.● Summing chunks gives single value two curves.● Analyzing them in detail allows mining local similarities and variations.● Grouping allows examination of crossover recombination events.
Clustering: Traveling Salesman Problem● The TSP is simple to describe, hard to solve: ● Starting and finishing in the same city. ● Visit a list of cities once each. ● Minimize the distance (cost).● Optimal solutions will cluster the nearby cities.● The problem was always in defining the clusters.
Take a Walk and Cluster Your Genes● Climer & Zhang, 2004.● Method for detecting N clusters: ● Add N dummy cities to the distance map. ● Each one has the same, small distance to all other cities (we use 220). ● Dummy cities end up in the intercluster gaps.● The process is trivial to implement: just add that many rows and columns to the original comparison matrix.
Displaying the Tour● Mapping the tour onto a circle gives a good view of the distances.● Coloring simplifies inspection. ● Black dots for dummy cities. ● Single type at the top (e.g. wild type). ● Color successive data points using the “rainbow” sequence with a large number of colors. ● Sequences more alike get more similar colors.
Multiple uses for color sequence.● Track individual over time. ● Progression through colors shows history. ● Clustering highlights progression towards drug resistance.● Track sample population. ● Recycling the colors from one initial tour helps show changes in successive graphs. ● Simplifies tracking progression in anonymous populations found in HIV treatment centers.
Visualizing Wcurves● We use a WebGLbased package “WebCurve”.● Developed at IIT as a webfriendly solution for examining 3D geometry.● Gracefully handles displaying 100+ sequences at 10K bases each on a notebook computer.● Available from github, archive includes a web server and code to generate files for display.
Summary● Wcurve and TSP allow us to cluster genes.● Provides a more useful output in a clinical setting.● Color coding the TSP results allows tracking changes in a population or progression an individual over time.