Promises and Challenges of Next Generation Sequencing for HIV and HCV
AIDS CLINICAL ROUNDSThe UC San Diego AntiViral Research Center sponsors weeklypresentations by infectious disease clinicians, physicians andresearchers. The goal of these presentations is to provide the mostcurrent research, clinical practices and trends in HIV, HBV, HCV, TBand other infectious diseases of global significance.The slides from the AIDS Clinical Rounds presentation that you areabout to view are intended for the educational purposes of ouraudience. They may not be used for other purposes without thepresenter’s express permission.
Promises and Challenges of NextGeneration Sequencing for HIV and HCVSergei L Kosakovsky Pond, PhD. Associate Professor, UCSD Department of Medicine.January 11, 2013
Outline✤ Next generation / Ultradeep sequencing (NGS/UDS) technology✤ NGS applications for HIV and HCV ✤ What are the unique advantages of NGS? ✤ What are the limitations of NGS?✤ Clinical relevance of NGS-based assays✤ Regulatory approval
Genomic sequencing✤ In the recent years, sequencing (DNA, RNA) has rapidly become the cheapest and fastest assays in many applications NGS (Solexa) introduced✤ Sub-$1000 human genome very shortly. commerically http://www.genome.gov/sequencingcosts/
Is NGS relevant for medicine?✤ In 2012, 6 out of TIME magazine’s Top 10 Medical Breakthroughs relied on NGS 1 The ENCODE project (non-coding DNA) 2 The Human Microbiome Project 6 Cancer Genome Atlas 7 Neo-/pre-natal screening for rare diseases 8 Pediatric Cancer Diagnostics 10 P. acnes phage characterization
Next generation sequencing✤ Traditional (Sanger) sequencing generates a small number of intermediate length reads (~1000 bp)✤ All NGS technologies perform millions of parallel sequencing reactions to generate many, typically short, reads per run.✤ Two canonical applications for NGS ✤ Assembling long sequences from short fragments (human genome, cancer) ✤ Characterizing diverse populations (HIV, HCV, immune repertoire, metagenomics)
Platform comparison First Use in HIV/ Instrument Output per run Run-time introduced HCV settings 105-106 Roche 454 FLX Extensive 2005 400-700bp 10-20 hrs +/ Junior (>300 papers) reads Illumina 107-109 Limited (~30 2007 7 hrs - 11 days HiSeq/MiSeq 36-250bp reads papers) Life Sciences 105-107 Limited (<10 2010 1-8 hrs IonTorrent 35-400 bp reads papers) Paciﬁc 104-105 Limited (<10 Biosciences 2011 1000-10000 bp 1-2 hrs papers) PacBioRS reads
Characterizing viral diversitywithin a host✤ Being able to characterize HIV-1 populations rapidly and accurately is important for understanding pathogenesis, interplay between viruses and humoral responses, and the evolution of drug resistance✤ Both HIV-1 and HCV exist as viral quasispecies in a host, i.e. many distinct viral strains are circulating at any given moment in time✤ NGS has the potential to directly sequence many such strains✤ Using multiplexing (multiple samples/run), high throughput can be achieved
Characterizing minority DRAMs✤ Perhaps the clearest clinical application of NGS for HIV and HCV.✤ Already know what mutations we are looking for (e.g. K103N).✤ Which mutations are real? ✤ Sequencing error ✤ Assay error / reproducibility✤ What frequency of mutations matter clinically?
Drug resistance associatedmutations (DRAMs)✤ Using bulk-sequencing (standard tests): all viral strains from a biological sample are PCR ampliﬁed and sequenced together✤ Generates a “population” virus sequence that may hide mutations present in minority variants✤ The basis of all current FDA approved sequencing tests✤ Ambiguous peaks on the electropherogram reﬂect mixed populations✤ Can detect minority variants at frequencies ≥20%
Bulk sequence 5 10 15 20 25 30 35 40 45 50 55BULK A T G T G C T G C C A C A G G G A T G G A A A G G A T C A C C A G C A A T A T T C C A A T G T A G C A T G A C G A 100 105 110 115 120 125 130 135 140 145 150BULK G T T A T A T A T Y A A T A C A T G G A T G A T T T G T A T G T G G G A T C T G A C T T A G A A A T A G G R C Mixed bases ✤ Are we missing lower frequency variants? ✤ Do all four combinations of resolved mixtures (CA, CG, TA, TG) actually exist in the sample?
Cloning/Single genomesequencing✤ Cloning or limiting dilution PCR followed by Sanger sequencing: single genome sequencing (SGS)✤ Generates ~10-100 sequences; how representative is this of the entire population? pNL4-‐3p6-‐rt AB819 9 12-‐11-‐2002 AB958 13 11-‐6-‐2002 AB958 12 11-‐6-‐2002 AB570 12 12-‐13-‐2002 AB819 4 12-‐11-‐2002 AB958 9 11-‐6-‐2002 AB570 11 12-‐13-‐2002 AB819 6 12-‐11-‐2002 AB958 17 11-‐6-‐2002 AB570 4 12-‐13-‐2002 AB819 3 12-‐11-‐2002 AB819 8 12-‐11-‐2002 AB958 5 11-‐6-‐2002 AB570 13 12-‐13-‐2002 AB570 9 12-‐13-‐2002 AB595 33 2-‐20-‐1997 AB595 17 2-‐20-‐1997 AB595 16 2-‐20-‐1997 AB595 12 2-‐20-‐1997 AB595 29 2-‐20-‐1997
Cloning/SGS 0.01 Clone_0 Clone_19 Clone_1 Clone_2 Clone_3 Clone_4 Clone_5 Clone_6 Clone_7 Clone_8 Clone_9 Clone_10 Clone_11 Clone_12 Clone_13 Clone_14 Clone_15 Clone_16 Clone_17 Clone_18✤ Now have 3 variants / 20 clones✤ Are we still missing lower frequency variants?✤ Would we get the same counts if the experiment were repeated?
NGS approach Library Prep emulsion PCR✤ Prepare amplicons, e.g. Blood → HIV RNA → Sequencing cDNA → PCR 3 regions✤ Multiplex multiple Data samples/regions on the analysis plate✤ Obtain 1000s of reads / sample from a single run 454 Junior Gag: p24 (253 bp) Env: C2-V3-C3 Pol: RT (416 bp) PacBio RS (534 bp)
Massive data sets: needs tools to analyzeFASTQ output which needs to be converted to interpretable results:10,000 - 1,000,000 of records like this>FYJLQU001AI1WJ rank=0036132 x=99.0 y=3537.0 length=250GGACATCAAGCAGCCATGCAAATGTTAAAAGAGACCATCAATGAG...>FYJLQU001AI1WJ rank=0036132 x=99.0 y=3537.0 length=25028 28 28 35 37 37 37 37 37 35 33 33 35 35 35 ...>FYJLQU001AWHGJ rank=0036147 x=252.0 y=3537.5AAATCCATACAATACTCCAGTATTTGCCATAAAGAAAAAGACAGT...>FYJLQU001AWHGJ rank=0036147 x=252.0 y=3537.5 length=35421 18 18 32 33 33 35 35 35 35 25 27 31 28 31 ...Quality informatics tools are essential.
NGS/454✤ 9 variants identiﬁed.✤ Would need >200 clones to detect lowest frequency ones reliably.
Sources of error Library Viral template resampling Prep PCR recombination PCR error emulsion PCR PCR error Multiple templates on a bead Sequencing Base calling errors Detection errors Data analysis Software limitations Improper statistical analyses
454 sequencing error rates✤ Sequencing clonal populations of bacetriophages measured a sequencing error of 0.25% per base.✤ Most common errors are homopolymer runs that are too long or too short, e.g. AAAA could be reported as AAA or AAAAA.✤ Solution: We developed an algorithm to map reads to “reference sequences” (e.g. subtype-speciﬁc HIV/HCV sequences or germline IgG alleles) which corrects for most of such errors.✤ Many such algorithms exist; we are currently conducting a rigorous comparison among them.
Correcting sequencing error✤ If one has 10000 reads covering a 400 bp amplicon and the reported sequencing error rate is a uniform 1%, then, on average, ✤ each read will have 4 errors ✤ each nucleotide position will have 100 (random) mutations✤ Just because a sequencer reports the presence of a mutation, that does not meet that the mutation is real.✤ We (and other groups) have developed statistical models and algorithms than can reliably detect minority variants at 0.25-0.5% frequencies, given sufﬁcient coverage.
UCSD processing pipeline site report Real Instrument errorhttp://www.datamonkey.org
Experimental error✤ In order to detect low frequency variants, we need a lot of input templates (e.g. high viral load).✤ For few input templates, NGS could create a sense of false depth, by resampling the same templates over and over again.✤ PCR ampliﬁcation biases can cause allelic skewing (inﬂate or decrease frequencies of speciﬁc variants)
One possible solution: Primer ID✤ Tag each template with a random sequence tag/Primer ID in the cDNA primer.✤ Use the sequence tag/Primer ID to identify PCR resampling.✤ Use the resampled sequences to create a consensus sequence.✤ Use the number of sequence tags/Primer IDs to deﬁne the number of templates. Jabara C et al PNAS 2011
Resampled)Templates)with)PCR)and)Sequencing)Errors) ) )Primer)ID) ATGACGTC% ATGACGTC% ATGACGTC% ATGACGTC% ATGACGTC% ATGACGTC% ATGACGTC% ATGACGTC%✤ Creating a consensus sequence for each resampled template using Primer ID mitigates error from PCR and sequencing Jabara C et al PNAS 2011
Good reproducibility betweenruns 25 y = 0.9943x 20 R² = 0.80872 15 Run 1 10 5 0 0 5 10 15 20 25 Run 2 Ron Swanstrom (pers. comm.)
Clinical relevance✤ NGS-based assays will detect many more DRAMs than current tests.✤ Multiple studies provide evidence that SOME low level NRTI and NNRTI DRAMs are associated with subsequent virologic failure (also for FI)✤ Picture less clear with PI, likely due to the polyallelic nature of resistance✤ II to be investigated directly as are HCV antivirals✤ “The extent to which the detection of low-abundance DRMs will affect patient management is still unknown but it is hoped that use of such an assay in clinical practice, will help resolve this important question” Evaluation of a Bench-Top HIV Ultra-Deep Pyrosequencing Drug-Resistance Assay in the Clinical Laboratory Avidor et al J Clin Microbiol 2013.
Tropism analysis using NGS✤ Because NGS provide sequences, one can ask questions that require the knowledge of the entire sequence.✤ CCR5 vs CXCR4 usage has implications for treatment (with fusion inhibitors), and clinical outcomes
Tropism analysis, clinical relevance✤ Can either be measured experimentally (e.g. Enhanced Sensitivity Troﬁle Assay, ESTA), or by computational analyses of env V3 loop sequences (e.g. Geno2Pheno)✤ Low level (e.g. 2%) X4 variants are predictive of FI failure, e.g. in the Maraviroc versus Efavirenz in Treatment-Naive Patients (MERIT) study N=312 N=35 Swenson L C et al. Clin Infect Dis. 2011;53:732-742
Does the choice of platformmatter✤ Largely, no. Archer et al PLoS ONE 2012
High throughput dual infectiondetection✤ Blood → HIV RNA → cDNA → PCR 3 regions Gag: p24 (253 bp) Env: C2-V3-C3 (416 bp) Pol: RT (534 bp)✤ Sequenced 16 samples concurrently on single 454 GS FLX Titanium plate✤ Processed reads (~5 mins/patients on a computer cluster) and generated phylogenies✤ Interpreted nucleotide diversity > 2% (RT, gag) and > 5% (env), conﬁrmed by phylogenetic bootstrap, as evidence of dual infection Pacold et al ARHR 2010
DETECTION OF HIV DUAL INFECTION 1295FIG. 2. Sample I, UDS duplicate 1. First year of infection. DI in env, pol, and gag. UDS are represented as red circles and SGSas blue squares. Variant abundances per node and branches with >90% bootstrap support are labeled.identiﬁed samples A, B, C, E, F, and G as singly infected sequence was $278.18, for SGS of two coding regions(Supplementary Figs. 1–6 and 11–13; Supplementary Data are $2,646.39, and for UDS of three coding regions $1,075.10.available online at www.liebertonline.com/aid) and samples Costs of each sequencing type are summarized in Table 3. ItD1, D2, H, and I as dually infected (Fig. 2 and Supplementary took 3 hours to produce one sample’s population-based polFigs. 7–10 and 14–16). DI results speciﬁc to the coding regions sequence, 42 hours for one sample’s SGS, and 9.5 hours forof each sample are shown in Table 2. one sample’s UDS. Cost and time estimates for parallel steps For nearly all the samples, the high read coverage of UDS like RNA extraction are highly throughput-dependent. UDSidentiﬁed greater maximum divergence than SGS (Table 2). can be customized to produce fewer reads per sample at aDuplicate UDS runs performed on the same sample cDNA for lower cost. As previously noted,11 many factors (such as price Pacold et al ARHR 2010the same coding regions agreed in DI status for all 20 cases. reductions related to quantity) inﬂuence cost estimates andCombined phylogenies of UDS and SGS for each sample are may cause large price differences for experiments using the
High throughput dual infectiondetection SGS: A B C D1 E F D2 G H 25 reads per “Gold-standard” sample-region UDS: A B C D1 E F D2 G H Low viral 4,650 reads per load sample-region✤ For all dually infected samples, UDS identiﬁed a greater within- sample divergence than SGS.✤ Samples E and F both had divergence exceeding the DI threshold, but only Sample F exhibited DI-like population structure.✤ UDS required 40% of the cost and 20% of the time for SGS. Pacold et al ARHR 2010
Method comparison SGS NGS Robustness for conﬁrming DI High High Throughput potential Low High Labor High Medium Time High Low Medium (and Cost High dropping)
San Diego Primary Infection Cohort L537 ✤ Samples sequenced to date4 CI! N112 show a prevalence of DI of Q294! 11/61 = 18%. U189! 12 24 36! D224 Months after initial infection ✤ Of the 7 SI cases: K613! K908 ✤ 5 were SI in the ﬁrst year of initial7 SI! P265 infection (incidence: 8.2%) P853 ✤ 2 in the second year (incidence: S155 3.3%) U796 12 24 36 Months after initial infection ✤ Dual infections are much more 1 strain detected! frequent than expected. 2 strains detected! Pacold et al AIDS 2012
Molecular epidemiology of HIV-1✤ Because HIV is a measurably evolving pathogen that accumulates sequence diversity within hosts at rates as high as 1-2% per year within the polymerase (pol) gene, viral sequences are nearly unique to each infected person.✤ This distinct feature of the virus allows one to interrogate sequences for evidence of recent relatedness, and thus infer potential transmission links.
Establishing links✤ Putative transmission links are established if the genetic distance between two pol sequences is below a threshold D (e.g. 1.5%)✤ Median intra-subtype pairwise genetic distance is ~5%, and the probability that two randomly selected HIV-1 subtype B sequences are ≤1.5% distant is very low (p = 0.0022 for the SD AEH cohort and p = 0.0002 for a random sample) San Diego Acute and Early Cohort Random database sample 0.6 0.5 0.5 0.4 0.8 1.0 0.4 0.4 Density, AU Density, AU 0.3 0.0 0.0 0.3 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.2 0.2 0.1 0.1 0.0 0.0
San Diego HIV molecular network(bulk sequences) 2 2 2 2 2 2 2 2 12 19 2 2 Viral load, log (copies/ml) 10 3 10 2 2 N/A 7 12 1.5-2.5 2 2 2 2.5-3.5 3.5-4.5 2 2 2 2 3 4.5-5.5 5.5-6.5 5 2 2 2 2 >6.5 3 10 2 2 3 2 Direction resolved 7 4 based on EDI 21 2 Number of 2 N 2 timepoints (if > 1) 6 19 TNS < 0.8 2 2 2 2 2 2 TNS ≥ 0.8 2
Linking transmission partnersusing NGS✤ Because a substantial proportion of individuals may be multiply infected, we need to be able to draw links between minority populations.✤ NGS data have been used in HPTN 052 (to conﬁrm transmission links between serodiscordant couples)
A denser network of connections ✤ 64 new edges and 16 new nodes (a yield of ~1 connection / 2 NGS samples) were added to the network, ✤ The inclusion of NGS data ✤ increased the size of the largest cluster from 62 to 156 nodes ✤ increased the number of “hubs” by 7 (from 51 to 58).
It pays to target highly connected nodes Targeting a low degree Concept Targeting a high degree Contact Network Transmission n node has a local effect Node node hascould global effect individual Individual a lead to HIV HIV+ Edge A contact that Transmission eve Degree = 1 transmission, e.g. sexual, shared needle Degree = edges Number of contacts associated with a Number of transm connected to a node associated with a node Transmission network i Degree = 7 Degree = 7 subset of the contact net Degree = 7 Degree = 3 Degree = 1 Degree = 1 Contact w/o tranmission HIV+ HIV- Transmission
Regulatory approval: the bad news✤ No NGS platforms have been cleared/approved by FDA✤ No standards to use for comparison✤ No clear agreement on bioinformatics handling✤ Lack of proﬁciency panels and reference materials✤ Rapid change
Regulatory approval: the goodnews✤ The industry, academia, and agencies (FDA, CAP, NCBI, etc) are actively collaborating on the issue✤ Informatics rapidly improving and stabilizing✤ Clinical relevance studies are ongoing✤ This is primarily driven by human genomic applications, so HIV/HCV applications will beneﬁt from the larger effort✤ The Forum on Collaborative HIV research has held a series of roundtables to discuss issues relevant to HIV/HCV research, including the “Next Generation Sequencing Roundtable” in December 2012.
AcknowledgementsUCSD UBCDavey Smith Richard HarriganJason Young Art FY PoonSara Gianella Weibel Life IncSusan Little Mary PacoldDouglas RichmanRichard HaubrichGabe WagnerLance Hepler