Sergei L. Kosakovsky Pond, PhD (UC San Diego AntiViral Research Center) presents "Promises and Challenges of Next Generation Sequencing for HIV and HCV"
Promises and Challenges of Next Generation Sequencing for HIV and HCV
1. AIDS CLINICAL ROUNDS
The UC San Diego AntiViral Research Center sponsors weekly
presentations by infectious disease clinicians, physicians and
researchers. The goal of these presentations is to provide the most
current research, clinical practices and trends in HIV, HBV, HCV, TB
and other infectious diseases of global significance.
The slides from the AIDS Clinical Rounds presentation that you are
about to view are intended for the educational purposes of our
audience. They may not be used for other purposes without the
presenter’s express permission.
2. Promises and Challenges of Next
Generation Sequencing for HIV and HCV
Sergei L Kosakovsky Pond, PhD. Associate Professor, UCSD Department of Medicine.
January 11, 2013
3. Outline
✤ Next generation / Ultradeep sequencing (NGS/UDS) technology
✤ NGS applications for HIV and HCV
✤ What are the unique advantages of NGS?
✤ What are the limitations of NGS?
✤ Clinical relevance of NGS-based assays
✤ Regulatory approval
4. Genomic sequencing
✤ In the recent years, sequencing (DNA, RNA) has rapidly become the cheapest
and fastest assays in many applications
NGS (Solexa) introduced
✤ Sub-$1000 human genome very shortly. commerically
http://www.genome.gov/sequencingcosts/
5. Is NGS relevant for medicine?
✤ In 2012, 6 out of TIME magazine’s Top 10 Medical Breakthroughs
relied on NGS
1 The ENCODE project (non-coding DNA)
2 The Human Microbiome Project
6 Cancer Genome Atlas
7 Neo-/pre-natal screening for rare diseases
8 Pediatric Cancer Diagnostics
10 P. acnes phage characterization
6. Next generation sequencing
✤ Traditional (Sanger) sequencing generates a small number of
intermediate length reads (~1000 bp)
✤ All NGS technologies perform millions of parallel sequencing
reactions to generate many, typically short, reads per run.
✤ Two canonical applications for NGS
✤ Assembling long sequences from short fragments (human genome,
cancer)
✤ Characterizing diverse populations (HIV, HCV, immune
repertoire, metagenomics)
7. Platform comparison
First Use in HIV/
Instrument Output per run Run-time
introduced HCV settings
105-106
Roche 454 FLX Extensive
2005 400-700bp 10-20 hrs
+/ Junior (>300 papers)
reads
Illumina 107-109 Limited (~30
2007 7 hrs - 11 days
HiSeq/MiSeq 36-250bp reads papers)
Life Sciences 105-107 Limited (<10
2010 1-8 hrs
IonTorrent 35-400 bp reads papers)
Pacific 104-105
Limited (<10
Biosciences 2011 1000-10000 bp 1-2 hrs
papers)
PacBioRS reads
8. Characterizing viral diversity
within a host
✤ Being able to characterize HIV-1 populations rapidly and accurately
is important for understanding pathogenesis, interplay between
viruses and humoral responses, and the evolution of drug resistance
✤ Both HIV-1 and HCV exist as viral quasispecies in a host, i.e. many
distinct viral strains are circulating at any given moment in time
✤ NGS has the potential to directly sequence many such strains
✤ Using multiplexing (multiple samples/run), high throughput can be
achieved
9. Characterizing minority DRAMs
✤ Perhaps the clearest clinical application of NGS for HIV and HCV.
✤ Already know what mutations we are looking for (e.g. K103N).
✤ Which mutations are real?
✤ Sequencing error
✤ Assay error / reproducibility
✤ What frequency of mutations matter clinically?
10. Drug resistance associated
mutations (DRAMs)
✤ Using bulk-sequencing (standard tests): all
viral strains from a biological sample are
PCR amplified and sequenced together
✤ Generates a “population” virus sequence
that may hide mutations present in
minority variants
✤ The basis of all current FDA approved
sequencing tests
✤ Ambiguous peaks on the electropherogram
reflect mixed populations
✤ Can detect minority variants at frequencies
≥20%
11. Bulk sequence
5 10 15 20 25 30 35 40 45 50 55
BULK A T G T G C T G C C A C A G G G A T G G A A A G G A T C A C C A G C A A T A T T C C A A T G T A G C A T G A C G A
100 105 110 115 120 125 130 135 140 145 150
BULK G T T A T A T A T Y A A T A C A T G G A T G A T T T G T A T G T G G G A T C T G A C T T A G A A A T A G G R C
Mixed bases
✤ Are we missing lower frequency variants?
✤ Do all four combinations of resolved mixtures (CA, CG, TA, TG)
actually exist in the sample?
13. Cloning/SGS
0.01
Clone_0
Clone_19
Clone_1
Clone_2
Clone_3
Clone_4
Clone_5
Clone_6
Clone_7
Clone_8
Clone_9
Clone_10
Clone_11
Clone_12
Clone_13
Clone_14
Clone_15
Clone_16
Clone_17
Clone_18
✤ Now have 3 variants / 20 clones
✤ Are we still missing lower frequency variants?
✤ Would we get the same counts if the experiment were repeated?
15. NGS approach
Library
Prep
emulsion
PCR
✤ Prepare amplicons, e.g.
Blood → HIV RNA → Sequencing
cDNA → PCR 3 regions
✤ Multiplex multiple
Data
samples/regions on the analysis
plate
✤ Obtain 1000s of reads /
sample from a single run 454 Junior
Gag: p24
(253 bp)
Env: C2-V3-C3
Pol: RT
(416 bp)
PacBio RS
(534 bp)
16. Massive data sets: needs tools to
analyze
FASTQ output which needs to be converted to interpretable results:
10,000 - 1,000,000 of records like this
>FYJLQU001AI1WJ rank=0036132 x=99.0 y=3537.0 length=250
GGACATCAAGCAGCCATGCAAATGTTAAAAGAGACCATCAATGAG...
>FYJLQU001AI1WJ rank=0036132 x=99.0 y=3537.0 length=250
28 28 28 35 37 37 37 37 37 35 33 33 35 35 35 ...
>FYJLQU001AWHGJ rank=0036147 x=252.0 y=3537.5
AAATCCATACAATACTCCAGTATTTGCCATAAAGAAAAAGACAGT...
>FYJLQU001AWHGJ rank=0036147 x=252.0 y=3537.5 length=354
21 18 18 32 33 33 35 35 35 35 25 27 31 28 31 ...
Quality informatics tools are essential.
17. NGS/454
✤ 9 variants identified.
✤ Would need >200 clones to detect lowest frequency ones reliably.
18. Sources of error
Library
Viral template resampling
Prep PCR recombination
PCR error
emulsion
PCR
PCR error
Multiple templates on a bead
Sequencing Base calling errors
Detection errors
Data
analysis Software limitations
Improper statistical analyses
19. 454 sequencing error rates
✤ Sequencing clonal populations of bacetriophages measured a
sequencing error of 0.25% per base.
✤ Most common errors are homopolymer runs that are too long or too
short, e.g. AAAA could be reported as AAA or AAAAA.
✤ Solution: We developed an algorithm to map reads to “reference
sequences” (e.g. subtype-specific HIV/HCV sequences or germline
IgG alleles) which corrects for most of such errors.
✤ Many such algorithms exist; we are currently conducting a rigorous
comparison among them.
20. Correcting sequencing error
✤ If one has 10000 reads covering a 400 bp amplicon and the reported
sequencing error rate is a uniform 1%, then, on average,
✤ each read will have 4 errors
✤ each nucleotide position will have 100 (random) mutations
✤ Just because a sequencer reports the presence of a mutation, that does
not meet that the mutation is real.
✤ We (and other groups) have developed statistical models and
algorithms than can reliably detect minority variants at 0.25-0.5%
frequencies, given sufficient coverage.
22. Experimental error
✤ In order to detect low frequency variants, we need a lot of input
templates (e.g. high viral load).
✤ For few input templates, NGS could create a sense of false depth, by
resampling the same templates over and over again.
✤ PCR amplification biases can cause allelic skewing (inflate or decrease
frequencies of specific variants)
24. One possible solution: Primer ID
✤ Tag each template with a random sequence tag/Primer ID in the
cDNA primer.
✤ Use the sequence tag/Primer ID to identify PCR resampling.
✤ Use the resampled sequences to create a consensus sequence.
✤ Use the number of sequence tags/Primer IDs to define the number of
templates.
Jabara C et al PNAS 2011
25. Resampled)Templates)with)PCR)and)Sequencing)Errors) ) )Primer)ID)
ATGACGTC%
ATGACGTC%
ATGACGTC%
ATGACGTC%
ATGACGTC%
ATGACGTC%
ATGACGTC%
ATGACGTC%
✤ Creating a consensus sequence for each resampled
template using Primer ID mitigates error from PCR
and sequencing
Jabara C et al PNAS 2011
28. Clinical relevance
✤ NGS-based assays will detect many more DRAMs than current tests.
✤ Multiple studies provide evidence that SOME low level NRTI and NNRTI DRAMs are
associated with subsequent virologic failure (also for FI)
✤ Picture less clear with PI, likely due to the polyallelic nature of resistance
✤ II to be investigated directly as are HCV antivirals
✤ “The extent to which the detection of low-abundance DRMs will affect patient management is
still unknown but it is hoped that use of such an assay in clinical practice, will help resolve this
important question”
Evaluation of a Bench-Top HIV Ultra-Deep Pyrosequencing Drug-Resistance Assay in the Clinical Laboratory
Avidor et al J Clin Microbiol 2013.
29. Tropism analysis using NGS
✤ Because NGS provide sequences, one can ask questions that require
the knowledge of the entire sequence.
✤ CCR5 vs CXCR4 usage has implications for treatment (with fusion
inhibitors), and clinical outcomes
30. Tropism analysis, clinical relevance
✤
Can either be measured experimentally (e.g. Enhanced Sensitivity Trofile Assay,
ESTA), or by computational analyses of env V3 loop sequences (e.g. Geno2Pheno)
✤
Low level (e.g. 2%) X4 variants are predictive of FI failure, e.g. in the Maraviroc
versus Efavirenz in Treatment-Naive Patients (MERIT) study
N=312
N=35
Swenson L C et al. Clin Infect Dis. 2011;53:732-742
31. Does the choice of platform
matter
✤ Largely, no.
Archer et al PLoS ONE 2012
32. High throughput dual infection
detection
✤ Blood → HIV RNA → cDNA → PCR 3 regions
Gag: p24
(253 bp)
Env: C2-V3-C3
(416 bp)
Pol: RT
(534 bp)
✤ Sequenced 16 samples concurrently on single 454 GS FLX Titanium
plate
✤ Processed reads (~5 mins/patients on a computer cluster) and
generated phylogenies
✤ Interpreted nucleotide diversity > 2% (RT, gag) and > 5% (env),
confirmed by phylogenetic bootstrap, as evidence of dual infection
Pacold et al ARHR 2010
33. DETECTION OF HIV DUAL INFECTION 1295
FIG. 2. Sample I, UDS duplicate 1. First year of infection. DI in env, pol, and gag. UDS are represented as red circles and SGS
as blue squares. Variant abundances per node and branches with >90% bootstrap support are labeled.
identified samples A, B, C, E, F, and G as singly infected sequence was $278.18, for SGS of two coding regions
(Supplementary Figs. 1–6 and 11–13; Supplementary Data are $2,646.39, and for UDS of three coding regions $1,075.10.
available online at www.liebertonline.com/aid) and samples Costs of each sequencing type are summarized in Table 3. It
D1, D2, H, and I as dually infected (Fig. 2 and Supplementary took 3 hours to produce one sample’s population-based pol
Figs. 7–10 and 14–16). DI results specific to the coding regions sequence, 42 hours for one sample’s SGS, and 9.5 hours for
of each sample are shown in Table 2. one sample’s UDS. Cost and time estimates for parallel steps
For nearly all the samples, the high read coverage of UDS like RNA extraction are highly throughput-dependent. UDS
identified greater maximum divergence than SGS (Table 2). can be customized to produce fewer reads per sample at a
Duplicate UDS runs performed on the same sample cDNA for lower cost. As previously noted,11 many factors (such as price
Pacold et al ARHR 2010
the same coding regions agreed in DI status for all 20 cases. reductions related to quantity) influence cost estimates and
Combined phylogenies of UDS and SGS for each sample are may cause large price differences for experiments using the
34. High throughput dual infection
detection
SGS: A B C D1 E F D2 G H
25 reads per
“Gold-standard”
sample-region
UDS: A B C D1 E F D2 G H
Low viral
4,650 reads per load
sample-region
✤ For all dually infected samples, UDS identified a greater within-
sample divergence than SGS.
✤ Samples E and F both had divergence exceeding the DI threshold, but
only Sample F exhibited DI-like population structure.
✤ UDS required 40% of the cost and 20% of the time for SGS.
Pacold et al ARHR 2010
35. Method comparison
SGS NGS
Robustness for confirming DI High High
Throughput potential Low High
Labor High Medium
Time High Low
Medium (and
Cost High
dropping)
36. San Diego Primary Infection Cohort
L537 ✤ Samples sequenced to date
4 CI!
N112
show a prevalence of DI of
Q294
! 11/61 = 18%.
U189
! 12 24 36
!
D224
Months after initial infection ✤ Of the 7 SI cases:
K613
! K908
✤ 5 were SI in the first year of initial
7 SI! P265
infection (incidence: 8.2%)
P853 ✤ 2 in the second year (incidence:
S155 3.3%)
U796
12 24 36
Months after initial infection
✤ Dual infections are much more
1 strain detected!
frequent than expected.
2 strains detected!
Pacold et al AIDS 2012
39. Molecular epidemiology of HIV-1
✤ Because HIV is a measurably evolving pathogen that accumulates
sequence diversity within hosts at rates as high as 1-2% per year
within the polymerase (pol) gene, viral sequences are nearly unique to
each infected person.
✤ This distinct feature of the virus allows one to interrogate sequences
for evidence of recent relatedness, and thus infer potential
transmission links.
40. Establishing links
✤
Putative transmission links are established if the genetic distance between two
pol sequences is below a threshold D (e.g. 1.5%)
✤
Median intra-subtype pairwise genetic distance is ~5%, and the probability that
two randomly selected HIV-1 subtype B sequences are ≤1.5% distant is very low
(p = 0.0022 for the SD AEH cohort and p = 0.0002 for a random sample)
San Diego Acute and Early Cohort Random database sample
0.6
0.5
0.5
0.4
0.8
1.0
0.4
0.4
Density, AU
Density, AU
0.3
0.0
0.0
0.3
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
0.2
0.2
0.1
0.1
0.0
0.0
41. San Diego HIV molecular network
(bulk sequences)
2 2
2
2 2
2
2
2
12
19
2
2
Viral load,
log (copies/ml)
10
3
10 2
2 N/A
7
12
1.5-2.5
2 2
2
2.5-3.5
3.5-4.5
2
2
2
2 3
4.5-5.5
5.5-6.5
5
2 2
2
2
>6.5
3 10
2 2
3
2 Direction resolved
7
4
based on EDI
21
2
Number of
2
N
2 timepoints (if > 1)
6
19
TNS < 0.8
2 2
2 2 2
2
TNS ≥ 0.8
2
42. Linking transmission partners
using NGS
✤ Because a substantial proportion of
individuals may be multiply
infected, we need to be able to draw
links between minority populations.
✤ NGS data have been used in HPTN
052 (to confirm transmission links
between serodiscordant couples)
43. A denser network of connections
✤ 64 new edges and 16 new nodes (a
yield of ~1 connection / 2 NGS
samples) were added to the network,
✤ The inclusion of NGS data
✤ increased the size of the largest
cluster from 62 to 156 nodes
✤ increased the number of “hubs” by
7 (from 51 to 58).
44. It pays to target highly connected nodes
Targeting a low degree Concept Targeting a high degree
Contact Network Transmission n
node has a local effect Node
node hascould global effect individual
Individual
a lead to HIV HIV+
Edge A contact that Transmission eve
Degree = 1 transmission, e.g. sexual, shared needle
Degree = edges Number of contacts associated with a Number of transm
connected to a node associated with a
node
Transmission network i
Degree = 7 Degree = 7 subset of the contact net
Degree = 7
Degree = 3
Degree = 1
Degree = 1
Contact w/o tranmission
HIV+ HIV- Transmission
45. Regulatory approval: the bad news
✤ No NGS platforms have been cleared/approved by FDA
✤ No standards to use for comparison
✤ No clear agreement on bioinformatics handling
✤ Lack of proficiency panels and reference materials
✤ Rapid change
46. Regulatory approval: the good
news
✤ The industry, academia, and agencies (FDA, CAP, NCBI, etc) are actively
collaborating on the issue
✤ Informatics rapidly improving and stabilizing
✤ Clinical relevance studies are ongoing
✤ This is primarily driven by human genomic applications, so HIV/HCV
applications will benefit from the larger effort
✤ The Forum on Collaborative HIV research has held a series of roundtables
to discuss issues relevant to HIV/HCV research, including the “Next
Generation Sequencing Roundtable” in December 2012.
47. Acknowledgements
UCSD UBC
Davey Smith Richard Harrigan
Jason Young Art FY Poon
Sara Gianella Weibel Life Inc
Susan Little Mary Pacold
Douglas Richman
Richard Haubrich
Gabe Wagner
Lance Hepler