Translational Science, Oncology, IMED Biotech Unit, AstraZeneca
IDT Webinar series
April 2018
The quest for high confidence mutations in plasma:
searching for a needle in a haystack
Using unique molecular identifiers in targeted sequencing to maximize sensitivity and
specificity
Iwanka Kozarewa Dan Stetson
Next generation sequencing (NGS) - revolutionizing molecular biology
and healthcare
• In comparison to Sanger sequencing, NGS has:
ü Massively reduced cost
ü Massively increased sequencing output and speed of data generation
• This allows broader and deeper profiling of patients’ genomic landscape
which in turn leads to:
ü A better understanding of disease onset
ü More precise matching of patients to drugs
ü Easier monitoring of disease progression
2
Tissue Molecular Profiling to inform treatment – moving towards
standard practice
3 IMED Biotech Unit I Translational Science Oncology
From: Frampton et al (2013) Development and validation of a clinical
cancer genomic profiling test based on massively parallel DNA
sequencing. http://www.ncbi.nlm.nih.gov/pubmed/24142049
• Taking tumour tissue samples for various assessments is a
standard in pathology workup in oncology.
• A lot of efforts have been dedicated to incorporating genomics
analysis within these assessments (at diagnosis or later).
• DNA from FFPE is challenging, but significant progress has been
made to optimize and standardize data generation procedures.
• In Nov 2017 FDA approved the first NGS-based
companion diagnostic test (FoundationOne
CDx ).
Liquid biopsy - the Future of Personalised Medicine?
http://liquid-biopsy.gene-quantification.info/
• Liquid biopsies have a lot of advantages over standard biopsies
• Liquid biopsies can be a source of diverse materials:
ü cfDNA including ctDNA
ü CTCs
ü exosomes
ü cfRNA
ü Among these, ctDNA is the one with current most widespread use. It has been
used to study tumour evolution (Scherer et al, 2016), predict response to therapy
(Goldberg et al., 2018) and as an early predictor of relapse (Garcia-Murillas et al,
2015).
ü Recently, as part of the CancerSEEK test, ctDNA sequencing has been trialed
for early detection of eight different solid tumours
(http://science.sciencemag.org/content/early/2018/01/17/science.aar3247)
Liquid biopsies were listed as one of the top ten
technology breakthroughs in 2015 by the MIT
Technology Review
(www.technologyreview.com/s/544996/10-
breakthrough-technologies-of-2015-where-are-
they-now/)
Unique Molecular Identifiers (UMIs) – significant advantage in ctDNA sequencing
workflow
• Problems with plasma material:
ü Limited quantity: usually max 5 mL plasma are available for
extraction, often only 1-2 mL
ü Uncertain quality: host genomic DNA contamination, both high
molecular weight and fragmented, is always present, at varying
levels
ü Unknown tumour fraction: the proportion of ctDNA cannot be
predicted prior to analysis
5
Specificity: standard enrichment and sequencing methods
result in a very high number of false positive variants in the
<1% allele frequency bracket
Sensitivity: considered less of a problem since false
negatives (expected variants not detected) are rarely
observed
Good news:
Most of the ctDNA molecules are amenable
to molecular modifications required for
sequencing
A typical plasma sample of 1 mL contains
~3,000 copies of each gene, implicating a
sensitivity limit of detecting only 1 in 15,000
copies from a 5-mL sample (Leung et al., 2016)
Protocols have been developed to enrich for
ctDNA from total cfDNA, but so far they
have been not easy to implement and led to
significant ctDNA losses
Unique Molecular
Tagging - History and
Principle
§ Unique molecular identifiers (UMIs) usually are designed
as a string of random nucleotides that constitute a part of
an adapter.
§ The assumption is that each original DNA molecule will be
ligated to adapter duplex containing a different UMI.
§ The amplified products of the original molecule will be
distinguished from the ones generated from another
template and can be grouped into consensus read pairs or
‘families’.
§ The concept has been existing for over a decade (Miner et
al., 2004; McCloskey et al., 2007; Kinde et al, 2011; Schmitt et al.,
2012) and has been commercialized by several companies
(e.g. Agilent Technologies, Roche and Integrated DNA
Technologies).
§ Commercial “Clinical grade” assays assumed to be using
similar approaches coupled with bespoke informatics.
6
6 nt8 nt
8 nt
Image courtesy of Integrated DNA Technologies (IDT)
Use of UMIs – best laboratory practice
• With the IDT Dual Index product the indices are part of the adapters. This
results in sample tagging at the earliest possible stage during the lab process
minimizing the risk of cross contamination.
• The product has dual indexes which completely overcomes any inaccuracies in
multiplex sequencing on the Illumina platforms.
• In our workflow, IDT xGen® Dual Index UMI Adapters are used together with
KAPA HyperPrep reagents (Roche).
ü To facilitate successful library preparation even from minimal or poor quality
material, the starting material is always split between two reactions.
ü Adapter: insert molar ratios of 200:1 as recommended by KAPA are used.
However, for starting material <10 ng the adapter molarity is capped at 1.5 µM.
ü With UMIs, more stringent adapter and primer removal conditions are required:
0.7X AMPure after adapter ligation and 0.9X after pre-enrichment PCR.
ü With UMIs, it is possible to use higher number of PCR cycles than with
standard adapters allowing even small panels to be run on HiSEQ and
NovaSeq.
Images courtesy of Integrated DNA Technologies (IDT)
Use of UMIs – best analysis practise
• For the initial evaluation, we used a set of 8 commercially
acquired plasma samples and in-house created panel (n
genes=112).
• We observed that compared to the same set analyzed without
UMIs the number of variants with AF% ≤0.6 was reduced from
~11,500 to ~1,000.
• We set the requirement for minimum reads per consensus family
to 2. Increase to 3 did not affect the quality of calls made.
• We managed to suppress most noise by requiring the sum of the
base qualities to be a minimum of 40, regardless of how many
reads were required to attain that number
8
UMI sequencing – relation to input mass
• Because of the need to build consensus ‘families’, we
sequence ctDNA to very high sequencing depth (depth of
1,000-25,000 prior to deduplication). The depth after de-
duplication tends to be higher for samples with more input
material, but the increase is not linear.
• For samples with input <10 ng ~30-40 million reads (exact
number depending on panel) are sufficient to achieve max
depth and further sequencing only increases PCR
duplicates.
• We found variant concordance to be better in samples
based on higher input material.
9
10
UMI analysis and panel size - irrespective of panel size, the application of the Q40 is beneficial in
reducing the number of variants reported
n=613
n=303
Counts
Counts
BinnedAF%BinnedAF%
89109
30596
9363
Use of UMIs – downstream manual and automated
data curation
• Manual curation is done in several stages:
i. Mapping quality in the vicinity of the variants
is checked.
ii. Reporting of any damage bias or callability
issues against the variant is checked.
• In the absence of matched or germline data,
assessing the presence of the variant in internal
and external datasets guides the decision whether
a variant is germline or somatic and its expected
functional impact.
• Cohort analysis can suggest whether the variant of
interest is a sequencing or enrichment artifact.
• Our excellent bioinformatics team continues to
work on automation of the manual curation steps
with very promising results so far.
11
Damage bias
reporting
Callability issue
reporting
As a result of manual curation the number of variants per patient
sample generally drops from thousands to ~20.
In addition to single nucleotide variants, amplifications can be
detected in samples with a good ctDNA fraction (>20% tumour)
Machine learning approach to assign confidence to variance
• When there is no consensus in the reads per UMI
family, the read at that position is replaced with an
N (no call). These no calls, along with poor
mappability regions, are visible in a genome
browser NGB.
• We are using a Convolution Neural Network
(CNN) keras.io to classify the images into ‘clean’
and ‘noisy’ and assign a confidence value to the
variant.
EGFR C797S Resistance Mutation – one of the first examples highlighting the power of
ctDNA sequencing
13 IMED Biotech Unit I Translational Science Oncology
Slide courtesy of Brian Dougherty (Oncology TS)
Availability of multiple ‘longitudinal’ samples from a given patient (usually consecutive
time points on a study) greatly facilitates confident calling of variants with <1% AF –
example 1
• During exploratory sequencing of ctDNA samples
from an ongoing AZ sponsored early phase trial in
mixed solid tumours – nine (9) different ESR1
mutations were found in a single ER-positive breast
cancer patient.
ü This included 7 known activating and 2 novel mutations
ü The frequency of the mutations ranged from 0.17 to
1.84%
ü They were detected in only two of three time points
tested
ü No ESR1 mutations were detected in any other patient
samples in this cohort (n=10 pts)14
Availability of multiple ‘longitudinal’ samples from a given patient (usually consecutive
time points on a study) greatly facilitates confident calling of variants with <1% AF –
example 2
15
S
Allelefrequency(%)
Treatment cycle
• On the same trial, a sub 1% activating EGFR
variant was reported in a single NSCLC patient.
ü Reported in 2 of 127 SCCHN cases by
Schwentner, I. et al. (2008)
ü Noted as an activating mutation in prostate cancer
by Cai et al. (2008)
ü Very recently reported as a novel acquired &
dominant mutation in a single osimertinib-treated
NSCLC patient by Ou et al. (2017)
§ Duplex molecular identifiers (UMIs) are designed to capture
and label each strand of the original parental molecule.
§ By labeling both strands of the original parental molecule, only
alterations present in both strands are processed. This is
expected to be very beneficial especially in distinguishing real
variants from ‘damage-caused’ artifacts, e.g. ones generated
by aging, formalin preservation or high temperature during
hybridization.
§ The informatics workflow requires additional processing steps
to eliminate background noise. Newman et al. show that by
using all of the duplex steps, 98% of noise can be removed.
§ Various duplex sequencing concepts have been proposed
(Schmitt et al., 2012, Newman et al., 2016, Bettegowda et al., 2014).
16
Newman et al., Nat Biotechnol, 2016
Looking forward - Duplex
Seq principle
Looking forward - Duplex
UMI’s
17
6 nt8 nt
8 nt
§ Duplex UMI adapters provide an additional
layer of sensitivity – tagging each end of the
double-stranded molecule with it’s own
respective UMI sequence
§ UMI’s ligated to 5’ and 3’ ends of each
strand are distinct from each other
§ UMI sequences randomly ligated to
insert prior to PCR
§ i7 and i5 indices incorporated during
PCR
§ Each originating molecule will have strand-
specific traceability, helping to further reduce
false positives, artifacts and noise
§ Will be valuable in detecting low AF and novel
variants in cfDNA with stronger confidence
vs. Dual Indexed UMI
Duplex UMI
Images courtesy of Integrated DNA Technologies (IDT)
Take Home Message
• Identification of somatic, tumour-specific variants and copy number alterations is feasible
in ctDNA from different indications, obtained at different time points on a study.
• Use of dual indexes coupled with unique molecule tagging greatly facilitates true positive
identification.
• At the current stage of the sequencing technology, in-depth manual or automated
curation is required for confident identification of <1% variants.
• Beyond the current UMI strategy, Duplex sequencing may improve our internal NGS
sensitivity for all NGS based clinical trial projects.
18 IMED Biotech Unit I Translational Science Oncology
Acknowledgements
Bioinformatics &
Production Informatics
Sally Luke
Manasa Surakala
Krishna Bulusu
Avinash Reddy
Miika Ahdesmaki
Justin Johnson
19
Translational Genomics
Vimbayi Madamombe
Alan Barnicle
Hedley Carr
Barrett Nuttall
Amelia Raymond
Brian Dougherty
Translational Science Leadership and
Strategists
Liz Harrington
Andy Pierce
Carl Barrett
ALL PATIENTS AND THEIR FAMILIES
IMED Biotech Unit I Translational Science Oncology
Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove
it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the
contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000,
F: +44 (0)20 7604 8151, www.astrazeneca.com
20

The quest for high confidence mutations in plasma: searching for a needle in a haystack

  • 1.
    Translational Science, Oncology,IMED Biotech Unit, AstraZeneca IDT Webinar series April 2018 The quest for high confidence mutations in plasma: searching for a needle in a haystack Using unique molecular identifiers in targeted sequencing to maximize sensitivity and specificity Iwanka Kozarewa Dan Stetson
  • 2.
    Next generation sequencing(NGS) - revolutionizing molecular biology and healthcare • In comparison to Sanger sequencing, NGS has: ü Massively reduced cost ü Massively increased sequencing output and speed of data generation • This allows broader and deeper profiling of patients’ genomic landscape which in turn leads to: ü A better understanding of disease onset ü More precise matching of patients to drugs ü Easier monitoring of disease progression 2
  • 3.
    Tissue Molecular Profilingto inform treatment – moving towards standard practice 3 IMED Biotech Unit I Translational Science Oncology From: Frampton et al (2013) Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. http://www.ncbi.nlm.nih.gov/pubmed/24142049 • Taking tumour tissue samples for various assessments is a standard in pathology workup in oncology. • A lot of efforts have been dedicated to incorporating genomics analysis within these assessments (at diagnosis or later). • DNA from FFPE is challenging, but significant progress has been made to optimize and standardize data generation procedures. • In Nov 2017 FDA approved the first NGS-based companion diagnostic test (FoundationOne CDx ).
  • 4.
    Liquid biopsy -the Future of Personalised Medicine? http://liquid-biopsy.gene-quantification.info/ • Liquid biopsies have a lot of advantages over standard biopsies • Liquid biopsies can be a source of diverse materials: ü cfDNA including ctDNA ü CTCs ü exosomes ü cfRNA ü Among these, ctDNA is the one with current most widespread use. It has been used to study tumour evolution (Scherer et al, 2016), predict response to therapy (Goldberg et al., 2018) and as an early predictor of relapse (Garcia-Murillas et al, 2015). ü Recently, as part of the CancerSEEK test, ctDNA sequencing has been trialed for early detection of eight different solid tumours (http://science.sciencemag.org/content/early/2018/01/17/science.aar3247) Liquid biopsies were listed as one of the top ten technology breakthroughs in 2015 by the MIT Technology Review (www.technologyreview.com/s/544996/10- breakthrough-technologies-of-2015-where-are- they-now/)
  • 5.
    Unique Molecular Identifiers(UMIs) – significant advantage in ctDNA sequencing workflow • Problems with plasma material: ü Limited quantity: usually max 5 mL plasma are available for extraction, often only 1-2 mL ü Uncertain quality: host genomic DNA contamination, both high molecular weight and fragmented, is always present, at varying levels ü Unknown tumour fraction: the proportion of ctDNA cannot be predicted prior to analysis 5 Specificity: standard enrichment and sequencing methods result in a very high number of false positive variants in the <1% allele frequency bracket Sensitivity: considered less of a problem since false negatives (expected variants not detected) are rarely observed Good news: Most of the ctDNA molecules are amenable to molecular modifications required for sequencing A typical plasma sample of 1 mL contains ~3,000 copies of each gene, implicating a sensitivity limit of detecting only 1 in 15,000 copies from a 5-mL sample (Leung et al., 2016) Protocols have been developed to enrich for ctDNA from total cfDNA, but so far they have been not easy to implement and led to significant ctDNA losses
  • 6.
    Unique Molecular Tagging -History and Principle § Unique molecular identifiers (UMIs) usually are designed as a string of random nucleotides that constitute a part of an adapter. § The assumption is that each original DNA molecule will be ligated to adapter duplex containing a different UMI. § The amplified products of the original molecule will be distinguished from the ones generated from another template and can be grouped into consensus read pairs or ‘families’. § The concept has been existing for over a decade (Miner et al., 2004; McCloskey et al., 2007; Kinde et al, 2011; Schmitt et al., 2012) and has been commercialized by several companies (e.g. Agilent Technologies, Roche and Integrated DNA Technologies). § Commercial “Clinical grade” assays assumed to be using similar approaches coupled with bespoke informatics. 6 6 nt8 nt 8 nt Image courtesy of Integrated DNA Technologies (IDT)
  • 7.
    Use of UMIs– best laboratory practice • With the IDT Dual Index product the indices are part of the adapters. This results in sample tagging at the earliest possible stage during the lab process minimizing the risk of cross contamination. • The product has dual indexes which completely overcomes any inaccuracies in multiplex sequencing on the Illumina platforms. • In our workflow, IDT xGen® Dual Index UMI Adapters are used together with KAPA HyperPrep reagents (Roche). ü To facilitate successful library preparation even from minimal or poor quality material, the starting material is always split between two reactions. ü Adapter: insert molar ratios of 200:1 as recommended by KAPA are used. However, for starting material <10 ng the adapter molarity is capped at 1.5 µM. ü With UMIs, more stringent adapter and primer removal conditions are required: 0.7X AMPure after adapter ligation and 0.9X after pre-enrichment PCR. ü With UMIs, it is possible to use higher number of PCR cycles than with standard adapters allowing even small panels to be run on HiSEQ and NovaSeq. Images courtesy of Integrated DNA Technologies (IDT)
  • 8.
    Use of UMIs– best analysis practise • For the initial evaluation, we used a set of 8 commercially acquired plasma samples and in-house created panel (n genes=112). • We observed that compared to the same set analyzed without UMIs the number of variants with AF% ≤0.6 was reduced from ~11,500 to ~1,000. • We set the requirement for minimum reads per consensus family to 2. Increase to 3 did not affect the quality of calls made. • We managed to suppress most noise by requiring the sum of the base qualities to be a minimum of 40, regardless of how many reads were required to attain that number 8
  • 9.
    UMI sequencing –relation to input mass • Because of the need to build consensus ‘families’, we sequence ctDNA to very high sequencing depth (depth of 1,000-25,000 prior to deduplication). The depth after de- duplication tends to be higher for samples with more input material, but the increase is not linear. • For samples with input <10 ng ~30-40 million reads (exact number depending on panel) are sufficient to achieve max depth and further sequencing only increases PCR duplicates. • We found variant concordance to be better in samples based on higher input material. 9
  • 10.
    10 UMI analysis andpanel size - irrespective of panel size, the application of the Q40 is beneficial in reducing the number of variants reported n=613 n=303 Counts Counts BinnedAF%BinnedAF% 89109 30596 9363
  • 11.
    Use of UMIs– downstream manual and automated data curation • Manual curation is done in several stages: i. Mapping quality in the vicinity of the variants is checked. ii. Reporting of any damage bias or callability issues against the variant is checked. • In the absence of matched or germline data, assessing the presence of the variant in internal and external datasets guides the decision whether a variant is germline or somatic and its expected functional impact. • Cohort analysis can suggest whether the variant of interest is a sequencing or enrichment artifact. • Our excellent bioinformatics team continues to work on automation of the manual curation steps with very promising results so far. 11 Damage bias reporting Callability issue reporting As a result of manual curation the number of variants per patient sample generally drops from thousands to ~20. In addition to single nucleotide variants, amplifications can be detected in samples with a good ctDNA fraction (>20% tumour)
  • 12.
    Machine learning approachto assign confidence to variance • When there is no consensus in the reads per UMI family, the read at that position is replaced with an N (no call). These no calls, along with poor mappability regions, are visible in a genome browser NGB. • We are using a Convolution Neural Network (CNN) keras.io to classify the images into ‘clean’ and ‘noisy’ and assign a confidence value to the variant.
  • 13.
    EGFR C797S ResistanceMutation – one of the first examples highlighting the power of ctDNA sequencing 13 IMED Biotech Unit I Translational Science Oncology Slide courtesy of Brian Dougherty (Oncology TS)
  • 14.
    Availability of multiple‘longitudinal’ samples from a given patient (usually consecutive time points on a study) greatly facilitates confident calling of variants with <1% AF – example 1 • During exploratory sequencing of ctDNA samples from an ongoing AZ sponsored early phase trial in mixed solid tumours – nine (9) different ESR1 mutations were found in a single ER-positive breast cancer patient. ü This included 7 known activating and 2 novel mutations ü The frequency of the mutations ranged from 0.17 to 1.84% ü They were detected in only two of three time points tested ü No ESR1 mutations were detected in any other patient samples in this cohort (n=10 pts)14
  • 15.
    Availability of multiple‘longitudinal’ samples from a given patient (usually consecutive time points on a study) greatly facilitates confident calling of variants with <1% AF – example 2 15 S Allelefrequency(%) Treatment cycle • On the same trial, a sub 1% activating EGFR variant was reported in a single NSCLC patient. ü Reported in 2 of 127 SCCHN cases by Schwentner, I. et al. (2008) ü Noted as an activating mutation in prostate cancer by Cai et al. (2008) ü Very recently reported as a novel acquired & dominant mutation in a single osimertinib-treated NSCLC patient by Ou et al. (2017)
  • 16.
    § Duplex molecularidentifiers (UMIs) are designed to capture and label each strand of the original parental molecule. § By labeling both strands of the original parental molecule, only alterations present in both strands are processed. This is expected to be very beneficial especially in distinguishing real variants from ‘damage-caused’ artifacts, e.g. ones generated by aging, formalin preservation or high temperature during hybridization. § The informatics workflow requires additional processing steps to eliminate background noise. Newman et al. show that by using all of the duplex steps, 98% of noise can be removed. § Various duplex sequencing concepts have been proposed (Schmitt et al., 2012, Newman et al., 2016, Bettegowda et al., 2014). 16 Newman et al., Nat Biotechnol, 2016 Looking forward - Duplex Seq principle
  • 17.
    Looking forward -Duplex UMI’s 17 6 nt8 nt 8 nt § Duplex UMI adapters provide an additional layer of sensitivity – tagging each end of the double-stranded molecule with it’s own respective UMI sequence § UMI’s ligated to 5’ and 3’ ends of each strand are distinct from each other § UMI sequences randomly ligated to insert prior to PCR § i7 and i5 indices incorporated during PCR § Each originating molecule will have strand- specific traceability, helping to further reduce false positives, artifacts and noise § Will be valuable in detecting low AF and novel variants in cfDNA with stronger confidence vs. Dual Indexed UMI Duplex UMI Images courtesy of Integrated DNA Technologies (IDT)
  • 18.
    Take Home Message •Identification of somatic, tumour-specific variants and copy number alterations is feasible in ctDNA from different indications, obtained at different time points on a study. • Use of dual indexes coupled with unique molecule tagging greatly facilitates true positive identification. • At the current stage of the sequencing technology, in-depth manual or automated curation is required for confident identification of <1% variants. • Beyond the current UMI strategy, Duplex sequencing may improve our internal NGS sensitivity for all NGS based clinical trial projects. 18 IMED Biotech Unit I Translational Science Oncology
  • 19.
    Acknowledgements Bioinformatics & Production Informatics SallyLuke Manasa Surakala Krishna Bulusu Avinash Reddy Miika Ahdesmaki Justin Johnson 19 Translational Genomics Vimbayi Madamombe Alan Barnicle Hedley Carr Barrett Nuttall Amelia Raymond Brian Dougherty Translational Science Leadership and Strategists Liz Harrington Andy Pierce Carl Barrett ALL PATIENTS AND THEIR FAMILIES IMED Biotech Unit I Translational Science Oncology
  • 20.
    Confidentiality Notice This fileis private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com 20