Application of W-curves and TSP to Clustering HIV1 Sequences Douglas Cork1,2,3, Steven Lembark4, Nelson Michael1,5, Jerome Kim1,5 US Military HIV Research Program1; Henry M. Jackson Foundation for the Advancement of Military Medicine2, Rockville, MD. 20850, BCPS Dept., Illinois Institute of Technology3, Chicago, IL. 60616, Workhorse Computing4, Woodhaven, NY., Walter Reed Army Institute of Research5, Rockville, MD.Abstract - The high mutation rate in HIV-1 makes it automated process that can identify and compare drugdifficult to treat and analyze. Monitoring the evolution of resistance sites would be an enormous help. The W-drug resistance requires frequent resequencing, but curves scoring process produces a set of localizedcomparing and visualizing the progress is difficult. One comparison values that can be used as landmarks thatdifficulty is simply locating the areas of interest: gaps and can be used to automate the process of locating relevantcrossover mutations make it difficult to isolate clinically areas for comparison. Clustering new samples withsignificant sequences for comparison. Effectively known ones using the Traveling Salesman Problemdisplaying the results of comparisons grouped according (“TSP”) can automate the grouping of new samples intoto multiple regions is also a problem. Our comparison drug-resistance categories. Combining them leads to analgorithm based on the W-curve helps automate the automated process for tracking medication incomparison process, producing results suitable for individuals or analysis of groups.clustering via a modified solution to the TravelingSalesman Problem (“TSP”). Appropriate color-coding ofthe TSP results allows us to display the results of multiple 3 Backgroundcomparisons effectively for single samples or time-series. Analyzing the data in many HIV studies is madeThe results can be useful for providing guidance in difficult by a combination of HIVs genetics and thetreatment, analyzing the membership in anonymous study sources of data. Patients in many HIV clinics arepopulations, tracking the evolution of drug resistance in anonymous, often because they are illegally engaged inpopulations, or rates of co-infection within study groups. prostitution, illegal drug use, or homosexual sex. Studies of groups look for new strains and how theyKey words: HIV1 Genomic Clustering, Wcurve, spread and have to check for changing members of theTraveling Salesman Problem, Drug Resistance sample population; individual patients have to be sequenced frequently to evaluate appropriate drugs.1 Disclaimer HIVs high mutation and crossover rates combined by discontinuous sequences for the drug responses makeThe opinions or assertions contained herein are the private the analysis difficult. The high mutation in areasviews of the author, and are not to be construed as between the regions of interest confound any analysisofficial, or as reflecting true views of the Dept. of the based on whole genes or regions: there is simply tooArmy or the DoD. much white noise on the genome to use large portions of it for comparison. High rates of mutation and co- 2 Introduction infection leave many individuals with multiple strains of HIV, further confounding the analysis, and HIVOne step in managing HIV infection response is strains have frequent crossover mutations, makingmonitoring individuals and study populations for things even worse.evolution of drug resistance. Given in vivo replication Treatment of HIV is still largely a manual process:kinetics with more than 109 new cells infected every day, patients have to be sampled and sequenced frequentlyeach and every possible single-point mutation occurs and doctors have to make informed decisions on how tobetween 104 and 105 times per day in an HIV infected deal with evolution of the strains infecting them.individuals . This leaves identifying the markers for Presenting sequence comparison results is particularlydrug resistance difficult, requiring multiple manual steps important: physicians have to evaluate the similarity ofin many cases. Especially for population studies, an a single individual to multiple known drug-resistant
strains. sequences will leave the curves in the same quadrant. This keeps our same-quadrant measure small for SNPs.In the US 98% of the infections are type-B, and the singleFDA-approved program for treatment using genetics We use a two-pass process for comparing W-curves.treats type-B only. The U.S. Army has to treat soldiers The first pass produces an array of alignment regionswho acquire HIV all over the world and only 85% of their with starts, stops, and a difference measure that we callcases are type-B, leaving them without any good options “chunks”. SNPs increase the difference measure, gapsfor 15% of their cases. Being able to at least classify the show up as differences in the starting values onnon-B cases and evaluate their treatment outcomes successive chunks of the comparison, indels aseffectively would be a huge help. successive chunks with no change in the relative offset. The second pass summarizes the chunks into whatever measures are useful, for example by averaging the4 Methodology differences over the length of a sequence. The advantage of chunking the results first is that4.1 W-curves similar chunks can be used to locate landmarks forThe W-curve was originally designed as the basis for a aligning sequences with one another. Areas with smallgraphical tool for visualizing very large regions of DNA differences provide an automated means of locating the[3,4]. Over time it has been adapted for numerical offsets between start and stop values in the sequences.comparison as well . By abstracting the genetic Given a library of sequences with known landmarkssequences into a three-dimensional space, W-curves offer such as points of known drug resistance, we can score aa wider range of comparisons rather than comparisons single incoming sequence against all of them.based on searching, aligning and tree-building (assuminga mutation model) with unidimensional strings ofcharacters. W-curves make it easier to find patterns in 4.2 TSP and Clusteringsequences or locate common features between genes. The Traveling Salesman Problem (“TSP”) is quite easyTheir 3D nature also makes it possible to align smaller to describe but difficult to solve. The problem is to takefeatures than are possible with string based techniques. a list of distances between cities and make a tour ofThe design of our comparison utilities builds on these them which visits each city once with the least totalcapabilities to provide a technique for matching the smalldistance. Substitute cost and the algorithm will find theregions of HIV-1s CD4 epitopes along its gp120 gene. “cheapest” route through all of the cities. As it turns out, this problem is NP-complete, requiring analysis ofThe order of bases along the corners is significant: all routes to guarantee the least distance. Much worknumber of hydrogen bonds (2 or 3) and chemical structure has gone into developing heuristics for solving this(purine or pyramidine) share quadrants around the square. problem and there are fast algorithms for approximatingThis means that most synonymous SNPs in the gene the solution. The utility of TSP problems is that an optimal tour will cluster the closest cities together. If the difference measures are for genes, they can similarly be clustered on any region of interest. A number of techniques for determining inter- and intra-clade distances have been developed. One technique developed by Climer and Zhang at Washington University is to add a fixed number of “dummy cities” to the list . Each dummy has a small distance to all other cities (we use 2-20). The non-zero distance leaves these cities in the intra-cluster gaps. We display the resulting tours as color-coded pinwheel diagrams. Appropriate color-coding makes these relatively easy to analyze individually or compare to one another.Figure 1: Wcurve generation showing initial points for curve "CG".
Figure 2: Example of tour generated by Rs TSP package of the wild, Library of drugresistant (“dr”), and random sample strains of HIV1. The library strains cluster tightly together at the top (red) with the wild strain (#33) outside the drugresistant cluster to the right. The approach shown here works well if individuals are sampled over time: preassigning the colors from a first tour makes it relatively easy to watch if samples from individuals drift into different groups.4.3 Generating and Analyzing TSP have found that this works better than simply assigning Clusters with R the colors sequentially from one to the sample size. The R statistical package includes a TSP library, available Assigning the colors by position along the tour gives from CRAN. We have used it here to generate an some additional visual feedback from the similarity of approximate solution for clustering the genes. In our case colors. This can be seen in Figure 2 where the closely an optimal tour is not required: any good approximation related “library” sequences are all red. We have found will cluster the genes properly. Our approach starts with a this color scheme also works well for comparing tours square distance matrix having zeros on the diagonals. The generated from different sections of the same genes: TSP package does not require a symmetric matrix but in even if the nodes appear in different orders on different our case we use one. graphs the similar colors grouping together help identify the clusters.The colors shown in Figure 2 are assigned by generating a list of 1024 colors and rotating the tour so that it starts A slightly different color scheme can be used to observe with the first library sequence. After that the colors are the evolution of drug resistant strains in a population. In assigned by taking the fractional tour length times 1024. that case coloring the time scale allows us to watch For example, if the total tour length were 20 and one of which direction the group is evolving. With the library the sequences fell at a cumulative distance of 9 then it sequences colored red and the population samples green would be assigned a color of 9 / 20 * 1024, or 460. We through blue over time, the migration of blue dots
Figure 3: Example Wcurves POL gene from wild (top) and example drug resistant HIV1 strains . Local features of the Wcurve geometry can provide landmarks for isolating smaller regions of the sequences for comparison .towards library samples is easy to pick out. The library sequences with known clinical results. With the library samples can be drug resistant, or susceptible to different samples in one color and the individuals sequence of drugs. Either way, the progression across clusters is samples colored over time we can see how the various relatively easy to view. samples migrate between clusters. This provides a nice way to integrate treatment information about an This approach also works well for integrating samples of individual that may rely on samples of unrelated genes.an individual over time. We can display results of comparing various regions of an individual to a library of
4.4 Combining the TSP and W-CurveThe TSP approach shown here will work for any comparison mechanism: ClustalW, Fasta, or Blast provide a suitable square comparison matrix and generate the 7 Referencesgraphic results from R. Our use of the Wcurve has an  Leitner T, Korber B, Daniels M, Calef C, Foleyadvantage due to chunks: we can automate scoring B. (2005) HIV-1 Subtype and Circulatingdiscontinuous regions that may have differing alignments. Recombinant Form (CRF) ReferenceThis matters with HIV1 where the high rate of SNPs Sequences., http://hiv.lanl.govleaves too much white noise in larger areas and the high  Brown BK, Sanders-Buell E, Rosa Borges A,rate of gaps makes locating the often small, discontinuous Robb ML, Birx DL, et al. (2008) Crosscladeareas causing drug resistance difficult. The curves neutralization patterns among HIV1 strains fromgeometry also provides us a more featurerich the six major clades of the pandemic evaluatedenvironment in which to compare the curves. Figure 3 and compared in two different models. Virologyshows the wild (HXB2 standard) and five drug resistant 375: 529–538.sequences [citation: website with the sequences we used].  Wu D, Roberge J, Cork DJ, Nguyen BG, GraceDifferences in geometry are visible, even when viewed at T (1993) Computer visualization of longdifferent scales [citation: last years publication]. The genomic sequences, in Visualization 1993, IEEEgeometric representation also offers us more options for Press, New York City, New York, CP 33:308–approximate matching using discrete spatial mathematics 315.than string comparison techniques allow.  Cork D, Lembark S, Tovanabutra S, Sanders- Buell E, Brown B, Robb M, Wieczorek L, Polonis V, Michael N, Kim J.5 Conclusions Application of W-curves and TSP to ClusteringThe W-curve provides us with a way of using landmarks HIV-1 Sequences by Epitope. 847-853to identify regions of interest and score only the relevant http://www.informatik.uni-portions of sample sequences. The TSP with Climer & trier.de/~ley/db/conf/biocomp/biocomp2010.htmlZhangs boundary techniques offers a fast, effective way . Coffin JM, (1995) HIV population dynamics into cluster genes. The R statistical package provides us vivo: implications for genetic variation,with the tools to analyze and color-code the results for pathogenesis, and therapy. Science 267, 483-analysis. Taken together this provides us with a useful 489.tool for comparing the status and evolution drug response  Stanford HIV RT and Protease Sequencefor HIV-1. Database http://hivdb.stanford.edu/pages/genotype-rx.html6 Acknowledgments  http://mypages.iit.edu/˜cork/houstep/introduction.Military HIV-1 Research Project (MHRP)/Henry Jackson  http://www.bioinformatics.org/wcurve/Foundation (HJF) and Workhorse Computing. This work  Cork DJ, Lembark S, Tovanabutra S, Robb ML,was funded by MHRP (Military HIV-1 Research Project) Kim JH (2010) W-curve Alignments for HIV1cooperative agreement (W81XWH-07-2-0067-P00001) Genomic Comparisons. PLoS ONE 5(6):between the Henry M. Jackson Foundation for the e10829. doi:10.1371/journal.plone.0010829Advancement of Military Medicine, Inc., and the U.S.  Climer S, Zhang W (2004) Take a walk andDepartment of Defense (DOD). The cohort work was cluster genes: a TSPbased approach to optimalsupported through NICHD by R01 HD34343-03 and rearrangement clustering. ACM Internationalpartly by a cooperative agreement between the Henry M. Conference Proceeding Series; Vol. 69, p. 22.Jackson Foundation for the Advancement of Military http://portal.acm.org/citation.cfm?id=1015419Medicine and the U.S. Dept. of Defense. The funders had  Zhou T, Xu L, Dey B, Hessell AJ, Van Ryk D,no role in study design, data collection and analysis, Xiang SH, Yang X, Zhang MY, Zwick MB,decision to publish, or preparation of the manuscript. Arthos J, Burton DR, Dimitrov DS, Sodroski J,The authors wish to thank Pranamee Sarma and Scott Wyatt R, Nabel GJ, Kwong PD (2007)Zintek for their assistance preparing the W-curve graphics Structural definition of a conservedused in this paper. neutralization epitope on HIV-1 gp120. Feb 15;445(7129):732-7.