Clustering Genes: W-curve + TSP

HIV1, Wcurves, & Shoe Leather
● Existing genetics tools fail on HIV1
● They make assumptions based on “normal” DNA
that fail on HIV – or cancer, or plants.
● Correlation tools look at evolution, not state.
● We are working on tools for clinical analysis.
● The Wcurve abstracts DNA into geometry.
● The TSP clusters genenes rather than trying to
impute inheritence.

Sequences Inform Treatment
● Treating HIV requires sequencing it to choose
appropriate drugs:
● HIV1 evolves drug resistence in months.
● Multiple strains in a single pateint are common,
both from multiple sources or evolution.
● Crossover recombination relatively common due to
crossinfected cells.

Problem: HIV is Hard to Analyze
● HIV is a noncorrecting retrovirus.
● Evolves 10,000 times faster than humans or
influenza – one new strain per patient per day.
● Genomes for wild types range from 8349 to
9829 bases, making localized comparisions
difficult.
● The single FDA approved algorithm directing
treatment from sequence handles only typeB;
the U.S. Army has 15%+ nonB infections.

The Current Tools
● Blast, Fasta, ClustalW perform alignment.
● Tabledriven analysis of base transitions.
● Score the entire sequence with a single value.
● Graphical tools are designed to display
inheritence rather than state.
● Output is difficult to read in a clinical setting.

Phenogram of Drug
Resistant and Random
Samples
● Tries to show ancestory,
not state.
● Not very good for visual
identification of which
patients are drug
resistant.

Trees are not particularly
helpful either.

HIVHXB2CG TGATCTGTAGTGCTACAGAAAAATTGTGGGTCACAGTCTATTATGGGGTACCTGTGTGGA
AY736838-gp120_ -------------------------------TACAGTTTATTATGGGGTGCCTGTGTGGA
***** *********** **********
HIVHXB2CG AGGAAGCAACCACCACTCTATTTTGTGCATCAGATGCTAAAGCATATGATACAGAGGTAC
AY736838-gp120_ GAGATGCAGATACCACCCTATTTTGTGCATCAGATGCCAAGGCACATGAGACAGAAGTGC

ClustalW of gp120
** *** ***** ******************** ** *** **** ***** ** *
HIVHXB2CG ATAATGTTTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAGTAGTAT
AY736838-gp120_ ACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACACC
* ***** ********************************************* **
HIVHXB2CG TGGTAAATGTGACAGAAAATTTTAACATGTGGAAAAATGACATGGTAGAACAGATGCATG
AY736838-gp120_ TGGAAAATGTAACAGAAAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGG
*** ****** *************************** ********** ******** *
HIVHXB2CG AGGATATAATCAGTTTATGGGATCAAAGCCTAAAGCCATGTGTAAAATTAACCCCACTCT
AY736838-gp120_ AGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCT
***** ********************** ***************** ***** ** ****

Difficult to compare
HIVHXB2CG GTGTTAGTTTAAAGTGCAC------TGATTTGAAGAATGATACTAATACCAATAGTAGTA
AY736838-gp120_ GCGTTACTTTAAATTGTACCAATGCTAATTTGACCAATGGCAGTAGCAAAACCAATGTCT
● * **** ****** ** ** * ****** **** * ** * * * *
HIVHXB2CG GCGGGAGAATGATAATGGAGAAAGGAGAGATAAAAAACTGCTCTTTCAATATCAGCACAA
AY736838-gp120_ CTAACATAATAGGAAATATAACAGATGAAGTAAGAAACTGTACTTTTAATATGACCACAG

sequences vis.ually.
* *** ** * ** ** *** ****** **** ***** * ****
HIVHXB2CG GCATAAGAGGTAAGGTGCAGAAAGAATATGCATTTTT
TTATAAACTTGATATAATACCAA

AY736838-gp120_ AACTAACAGATAAGAAGCAGAAGGTCCATGCACTCTTTTATAAGCTTGATATAGTACAAA
*** ** **** ****** * ***** * ******** ********* *** **

● Not useful for large HIVHXB2CG
AY736838-gp120_

HIVHXB2CG
T---AGATAATGATACTACCAGC---TATAAGTTGACAAGTTGTAACACCTCAGTCATTA
TTGAAGATAAGAAGAATAGTAGTGAGTATAGGTTAATAAATTGTAATACTTCAGTCATTA
* ****** * * ** ** **** *** * ** ****** ** **********
CACAGGCCTGTCCAAAGGTATCCTTTGAGCCAATTCCCATACATTATTGTGCCCCGGCTG

numbers of
AY736838-gp120_ AGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTG
***** ********* ********** ******** ************ * ** ****
HIVHXB2CG GTTTTGCGATTCTAAAATGTAATAATAAGACGTTCAATGGAACAGGACCATGTACAAATG
AY736838-gp120_ GTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATG

sequences.
*** ******* **** ****** ****** ******** ***** ******* *****
HIVHXB2CG TCAGCACAGTACAATGTACACATGGAATTAGGCCAGTAGTATCAACTCAACTGCTGTTAA
AY736838-gp120_ TCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAA
***** ********** ************* ****** ************ *********
HIVHXB2CG ATGGCAGTCTAGCAGAAGAAGAGGTAGTAATTAGATCTGTCAATTTCACGGACAATGCTA
AY736838-gp120_ ATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAGATCTCACAAACAATGCCA

Gaps make analysis
*********************** ** **** ******* ** **** ******* *
HIVHXB2CG AAACCATAATAGTACAGCTGAACACATCTGTAGAAATTAATTGTACAAGACCCAACAACA
● AY736838-gp120_ AAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTCCAACA
************* ** ** ** * ************ ******** ****** *****
HIVHXB2CG ATACAAGAAAAAGAATCCGTATCCAGAGAGGACCAGGGAGAGCATTTGTTACAATAGGAA

difficult AY736838-gp120_

HIVHXB2CG
AY736838-gp120_
ATACAAGAACAAGTATAACTAT------AGGACCAGGACGAGTATTCTATAGAACAGGAG
********* *** ** *** ********* *** *** ** ** ****
A---AATAGGAAATATGAGACAAGCACATTGTAACATTAGTAGAGCAAAATGGAATAACA
ATATAATAGGAAATATAAGAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATAAAG
* ************ *** ***** ***** * **** * ** *************
HIVHXB2CG CTTTAAAACAGATAGCTAGCAAATTAAGAGAACAATTTGGAAATAATAAAACAATAATCT
AY736838-gp120_ TTTTAAAACAGGTAACTGAAAAATTAAAAGAGCACTTT------AATAAGACAATAATCT
********** ** ** ******* *** ** *** ***** **********
HIVHXB2CG TTAAGCAATCCTCAGGAGGGGACCCAGAAATTGTAACGCACAGTTTTAATTGTGGAGGGG
AY736838-gp120_ TTCAACCACCCTCAGGAGGAGATCTAGAAATTACAATGCATCATTTTAATTGTAGAGGGG
** * * * ********** ** * ******* ** *** ********** ******
HIVHXB2CG AATTTTTCTACTGTAATTCAACACAACTGTTTAATAGTACTTGGTTTAATAGTACTTGGA
AY736838-gp120_ AATTTTTCTATTGCAATACAACAAAACTGTTTAATAATATTTGCCTAGGAAATG---AAA
********** ** *** ***** ************ ** *** * * * *
HIVHXB2CG GTACTGAAGGGTCAAATAACACTGAAGGAAGTGACACAATCACCCTCCCATGCAGAATAA
AY736838-gp120_ CCATGGCGGGGTGTAATGACACT---------------ATCACACTTCCATGCAAGATAA
* * **** *** ***** ***** ** ******* ****

New Tools
● Clinical vs. evolutionary.
● Avoid assumptions that break current tools.
● Suitable for a repeatable process in clinics or
data mining in research.
● We are using:
● Wcurve for analysis.
● TSP for clustering.
● R for data management & display.

Wcurve
● Geometric abstraction of DNA.
● Manufactured by a simple state machine.
● Alignment at finer scale available using
geometry than character strings.
● Avoids assumptions about transition
probabilities by taking the figure asis.

WCurve Generator is a State Machine
● C,A,T,G are assigned to corners of a square.
● Successive points move halfway to the next
base's corner.

Wcurve for “CG”
● Curve shown
in Blue.
● Halfway to C
then G in
X‑Y, single
steps in Z.
● Cyl. storage
simplifies
comparision.

Wcurve of Wild HIV1 POL Gene
Wcurve of Wild HIV1 POL

Wcurves of Wild & Drug Resistant Pol

Detail of Wild & Drug Resistant Pol

Distance Metric
● Bases are arranged in
square to minimize
effects of SNP's.
● Synonymous SNP's
are usually in the
same quadrant.
● Points within same
quadrant have small
difference, opposite
quad's get larger.

Comparison Produces “Chunks”
● Comparison yields a list of chunks.
● Curves are aligned within the chunk.
● Summing chunks gives single value two curves.
● Analyzing them in detail allows mining local
similarities and variations.
● Grouping allows examination of crossover
recombination events.

Clustering: Traveling Salesman Problem
● The TSP is simple to describe, hard to solve:
● Starting and finishing in the same city.
● Visit a list of cities once each.
● Minimize the distance (cost).
● Optimal solutions will cluster the nearby cities.
● The problem was always in defining the
clusters.

Take a Walk and Cluster Your Genes
● Climer & Zhang, 2004.
● Method for detecting N clusters:
● Add N dummy cities to the distance map.
● Each one has the same, small distance to all other
cities (we use 220).
● Dummy cities end up in the intercluster gaps.
● The process is trivial to implement: just add that
many rows and columns to the original
comparison matrix.

Displaying the Tour
● Mapping the tour onto a circle gives a good
view of the distances.
● Coloring simplifies inspection.
● Black dots for dummy cities.
● Single type at the top (e.g. wild type).
● Color successive data points using the “rainbow”
sequence with a large number of colors.
● Sequences more alike get more similar colors.

Example with 8 DR, 100 Samples

Multiple uses for color sequence.
● Track individual over time.
● Progression through colors shows history.
● Clustering highlights progression towards drug
resistance.
● Track sample population.
● Recycling the colors from one initial tour helps show
changes in successive graphs.
● Simplifies tracking progression in anonymous
populations found in HIV treatment centers.

Visualizing Wcurves
● We use a WebGLbased package “WebCurve”.
● Developed at IIT as a webfriendly solution for
examining 3D geometry.
● Gracefully handles displaying 100+ sequences
at 10K bases each on a notebook computer.
● Available from github, archive includes a web
server and code to generate files for display.

Summary
● Wcurve and TSP allow us to cluster genes.
● Provides a more useful output in a clinical
setting.
● Color coding the TSP results allows tracking
changes in a population or progression an
individual over time.

Clustering Genes: W-curve + TSP

Recommended

Recommended

More Related Content

Similar to Clustering Genes: W-curve + TSP

Similar to Clustering Genes: W-curve + TSP (12)

More from Workhorse Computing

More from Workhorse Computing (20)

Recently uploaded

Recently uploaded (20)

Clustering Genes: W-curve + TSP