CancerSeek

Detection and localization of surgically resectable cancer
with a multi-analyte blood test
Science 2018
Bioinformatics Journal Club 04/18/ 2018
Thi Nguyen, Ph.D. Candidate
Graduate Biomedical Sciences | Immunology Theme
University of Alabama at Birmingham (UAB)
kimthi@uab.edu

Outline
1.Authors
2.Liquid biopsies/ ctDNA
3.Study Design/ sample collections
4.Technologies: SafeSeq and Bioplex-200
5.CancerSeek algorithm
6.Figures
7.Conclusion + Limitations

Authors
Joshua Cohen
• MD/Ph.D. student at Johns Hopkins University School of Medicine
• MSTP in biomedical engineering, Mentors: Bert Vogelstein and Kenneth
Kinzler at Ludqig center.
• BS at MIT in chemical-biological engineer and M.Phil in Computational
biology at University of Cambridge, UK.
Nickolas Papadopoulous
• Oncology Professor at Johns Hopkins
• International expert in cancer diagnostics
• Discover the genetic basis of predisposition to hereditary nonpolyposis
colon cancer
• Scientific advisor at Personal Genome Diagnostics, Inc.

Cancer screening test
Sensitivity = true-positives/(true-positives + false-negatives)
• ability to identify correctly those who have cancer among the population with cancer
Specificity = true-negatives/(true-negatives + false positives)
• ability to identify correctly those who do not have cancer among the population without cancer
1. Non-blood based
• Pap screening
• Colonoscopy
• Mammography
• Cervical cytology
• CT scan
2. Blood-based:
• Biomarkers protein
• ctDNA

Liquid Biopsies
• Liquid biopsy “a test done on a sample of blood to
look for cancer cells from a tumor that are
circulating in the blood or pieces of DNA from
tumor cells that are in the blood.”
• Liquid biopsy from blood: ctDNA, CTC, exoxomes,
proteins, miRNA, mRNA, metabolites.
• Blood, urine, or other body fluids
1. Definition of liquid biopsy - NCI Dictionary of Cancer Terms - National Cancer Institute.
Calaroma information 2017

cfDNA vs ctDNA
cfDNA (cell-free DNA)
• Non-encapsulated DNA fragments of 100-300bp
• t1/2 ~ 2h for ctDNA and 1h for fetal-derived cfDNA
• Source: death, dying, necrosis/apoptosis cells
• Used in noninvasive prenatal diagnostics and cancer assessment
• Concentration in blood varies, increase with the size of fetus/ tumor
ctDNA (circulating tumor DNA)
• How is it specific to tumor? Cancer-specific mutations.
• As a biomarker: real time, non-invasive, multi-lesions, potentially cheaper (>biopsies)
• Often low concentration mutant DNA in the sea of wild type DNA, especially in early stage of
cancer. Eg. Early stage has <1 mutant template/ml plasma -> beyond detection limit (0.1%)
• But mutation information alone is not enough to predict the location of origin -> challenge
for follow-up tests

Study cohortTable S11. Cancer patients evaluated in this
study by tumor type and stage.
Tumor Type AJCC Stage
Patients
(n)
Proportion
of cases
(%)
Breast
I 32 15
II 114 55
III 63 30
I-III 209 --
Colorectum
I 77 20
II 191 49
III 120 31
I-III 388 --
Esophagus
I 5 11
II 29 64
III 11 24
I-III 45 --
Liver
I 5 11
II 19 43
III 20 45
I-III 44 --
Lung
I 46 44
II 27 26
III 31 30
I-III 104 --
Ovary
I 9 17
II 4 7
III 41 76
I-III 54 --
Pancreas
I 4 4
II 83 89
III 6 6
I-III 93 --
Stomach
I 21 31
II 30 44
III 17 25
I-III 68 --
• 1005 patients
• 8 types of cancer stage II (49%), stage III (31%)
and stage I (20%)
• Neoadjuvant chemo/ metastasis excluded
• Median age = 64 (range 22-93)
Control group:
• 812 “healthy” controls
• Median age = 55 (range 17-88)
• Criteria: no known history of cancer, high-grade dysplasia,
autoimmune or chronic kidney disease.

Sample processing
Patients (n=1005)
Blood
plasma PBMC
White blood cells
Cell-free DNA
7.5ml
PCR products
1%
PCR
21 cycles
PCR products
MiSeq/HiSeq
Bioplex 200
proteins
concentration
QIAsymphony
DNA
Tumor biopsies
n = 153
DNA FFPE
MiSeq/HiSeq
90% concordance in mutation
Wildtype DNA

Sample identification
• To confirm plasma, WBC and plasma DNA were from the same patient
• Use primers to amplify ~38,000 unique LINE (long interspersed nucleotide elements)
• LINE contain 26,220 common polymorphism which can establish/refute sample identity.
• Calculate Concordance = number of matched polymorphic sites/ total number of
genotypes that has adequate coverage in both samples.
• Match criteria: Concordance > 0.99 and at least 5,0000 amplicons has adequate
coverage

Safe-SeqS
Multiplex PCR to detect and quantify rare mutations
Safe-Seq procedures
1. each fragment is assigned a unique identification (UID)
DNA sequence (green or blue bars)
2. the uniquely tagged fragments are amplified, producing
UID families
3. A supermutant = UID family with ≥95% family members
have the same mutation.
Author’s PCR procedure
• 61 primer pairs -> 2 sets of primer (28+33 pairs)
• Plasma DNA divided -> 6 independent reactions
1/ reduce complexity of template to better detect rare alleles
2/ duplicate signals
• Initial amplification (15 cycles) -> 1% PCR products
• 2nd amplification (21 cycles) -> Illumina MiSeq/hiSeqSensitivity= 9 in 1 million

Mutation detection and analysis
Mutation detection
• Read was matched to reference sequence using custom scripts :
• https://github.com/InSilicoSolutions/SafeSeqS
• Reads from a common template molecule were grouped based on UID
• Artefactual mutations removed by requiring a mutation to be presen tin > 90% reads in
each UID family
• Redundant reads from optical duplication were removed by requiring reads to be at least
5000 pixels apart when located on the same file.
• Mutations must meet either one of these 2 criteria to be considered (1) present in the
COSMIC databases or (2) predicted to be inactivating in tumor suppressor genes.
• Synonymous mutation (except those at exon ends) and intronic mutations (except for
those at splice sites) were excluded.
Mutation analysis
• Mutant allele frequency (MAF) = mutant fraction per well.
• MAF in a sample = SUM of supermutant in 6 wells / total number of UID in 6 wells

Bioplex-200
• xMAP technology to multiplex up to 100 different analytes/ sample
• 100 colored magnetic beads created by the use of 2 fluorescent dyes
at distinct ratios of concentrations.
Houser, B. (2012). Bio-Rad’s Bio-Plex® suspension array system, xMAP technology overview.
Archives of Physiology and Biochemistry,
Magnetic bead
Charge-coupled device
CCD technology

Approach
• CancerSEEK approach: Combined gene + protein biomarkers
• Features
1. Gene: 61 amplicons panel of 16 genes: NRAS, CTNNB1, PIK3CA,
FBXW7, APC, EGFR, BRAG, CDKN2A, PTEN, FGFR2, HRAS, AKT1,
TP53, PPP2R1A, GNAS
2. Protein: Literature search to find protein that detect at least 1/8
cancer types with >10% sensitivity and 99% specificities : list of 41
proteins (39 can be reproducibly evaluated) -> narrow down the test
to 8 proteins

CancerSEEK overview
“Cancer detection: Seeking signals
in blood.” Mark Kalinich and Daniel
A. Haber. Science. 2018

CancerSEEK algorithm-1
1. Mutant allelle frequency (MAF) normalization:
• MAF = # supermutants/ # UID in the same well
• Normalized by observed MAFs (for each mutation) in training set composed of normal
controls + 256 healthy WBC .
• MAF < 100 UID : set to zero
• Average MAF = ave_i for each mutation i = 1,… n
• 25th percentile of this ave_i distribution -> ave_ref
• Normalized MAF = MAF * (ave_ref/ave_i)
2. Reference distribution and p-values:
• UID was split in 10 intervals (<1000, 1000 - 2000, … , >9000)
• Corresponding to the range of UIDs, MAF was compared to 2 reference distributions:
(normal + 256 WBC healthy) or cancer patients in training set using 10-fold cross
validation-> pN and pC values.
“The classification of a sample's ctDNA status was obtained from a statistical test comparing
the normalized mutation frequencies of the sample of interest to the distributions of the
normalized mutation frequencies of, respectively, normal and cancer samples in the training
set.”

3. Log ratios and omega scores
• pC/pN for each mutation was calculated (Min and Max of 6 wells was omitted):
where Wi = #UID/ total UID for mutation i
Example for KRAS mutation:
Ø The number of supermutants and UIDs in each of the six wells were
(161, 3755), (78, 2198), (99, 2966), (84, 2013), (177, 3694), (117,3427), respectively.
Ø 6 MAFs (0.043, 0.035, 0.033, 0.042, 0.048, 0.034),
or (0.0057, 0.0047, 0.0044, 0.0056, 0.0064, 0.0045) after normalization.
Ø pC = (1.06E-06, 5.70E-06, 1.02E-05, 1.03E-06, 3.09E-07, 8.83E-06)
Ø pN = (0.100, 0.124, 0.128, 0.114, 0.094, 0.112)
Ø pC / pN = (94243, 21716, 12510,110752, 305090, 12680).
Ø Eliminate min and max

Highest 𝜴 scores (table S5)

Lowest 𝜴 scores (table S5)

5. Logistic Regression:
• omega score + 8 protein concentration (CA-125, CA19-9, CEA, HGF, MPO, OPN, PRL, TIMP-1)
• Selection of 8/ 39 proteins:
1/ eliminate any proteins with higher median values in normal samples: 39->26 left
2/ Forward selection: each protein was dropped, and the decrease in accuracy of the test was
Checked -> importance of each protein
3/ Perform 10 rounds of 10-fold cross-validations
6. Tissue localization:
• Random forest to predict cancer types using omega score + 8 protein + 31 other proteins +
gender.
• Classification calls were obtained in an average round of 10-fold CV.
• Concordance between mutations in plasma vs tumor was considered only when omega> 3
and primary tumor contain any mutation with MAF> 5%
4. Protein normalization and transformation:
• Set all values < limits of detection : m
• Set all values > limits of detection : M
• Further transformation: if a protein concentration < 95th percentile of normal samples in
training set, then protein concentration = 0, otherwise, protein concentration = original value

Fig. 1. Rationale:
Challenges to design PCR-based mutation detection test
1. The test must query a sufficient number of bases to allow detection
of a large number of cancers
2. Each base must be sequenced thousands of times to detect low
prevalence mutations
3. However, there must be a limit on the number of bases to reduce
artefactual mutations
4. Cost-effective, amenable to high throughput

Fig.1: What is the minimum number of amplicons required to detect
at least 1 driver mutation?
Fig.1. Evaluate the 61 amplicon panel
• Curve = proportions of cancers detected as # of amplicons
increased
• Dots = fraction of cancers detected using the 61- amplicon panel
Sup. Fig. 1: Distribution of number
of detectable mutations in 805
tumors

Fig. 2. Performance of CancerSEEK
(A) ROC curve for CancerSEEK.
Red dot = test average performance at
>99% specificity.
(B) Median sensitivity by stage
• Error bar = SE of the median
(C) Sensitivity by tumor type
• Error bars = 95% CI

Sup. Fig. 2. Performance of CancerSEEK
Table S9. Logistic regression model coefficients and
importance scores.
Feature
Logistic
Regression
Coefficient
Importance
Score
Ω score 1.77E+00 7.55E+00
CA-125 4.15E-02 1.37E+00
CEA 2.33E-04 1.17E+00
CA19-9 1.20E-02 5.18E-01
Prolactin 3.51E-05 4.76E-01
HGF 2.45E-03 3.03E-01
OPN 1.45E-05 1.72E-01
Myeloperoxidase 5.40E-03 9.31E-02
TIMP-1 7.34E-06 7.05E-02

• Logistic regression can have many dependent variables (numerical/categorical…)
• Just as least square regression is used to estimate coefficients to best fit linear regression,
Logistic regression uses maximum likelihood estimation to obtain best fit predictors. After
The original function is estimated, the process is repeated until the Log likelihood does not change
significantly.
Logistic regression
• Goal: predict the binary outcome from a set of independent variables (used to classify samples)
• Instead of fitting a line to the data (linear regression), logistic regression fits an S shape logistic
curve, which is limited to values between 0 and 1.
• Curve is constructed using the natural logarithm of the odds of the target variable.
Therefore,

Sup. Fig3. PCA of ctDNA and 8 proteins clusters cancer patients vs. control.

Sup. Fig4. Effects of each CancerSEEK features on sensitivity
Each panel displays the
Sensitivity achieved when a
particular feature is excluded
from the logistic regression

Fig. 3. Supervised Machine learning to identify cancer type
• Percentages = proportions of patients
correctly classified by 1 of 2 most
likely subtypes (sum of light and dark
blue bars) or the most likely type
(light blue bar).
• Error bars = 95% CI

Conclusion
• CancerSEEK = multi-analyte blood
test that can detect the presence of 8
common solid tumors (60%
estimated cancer death in the US) by
combining 8 protein biomarkers with
genetic biomarkers (61 amplicons of
16 genes)
• Estimate cost ~< $500
This study lays a foundation for a single multi-analyte blood test that combine other blood
biomarkers (metabolites, mRNA, miRNA and methylated DNA) to detect cancer for early
intervention.

Limitations of study
1. The patient cohort are individuals with known cancers with marked symptoms. In true
screening setting, patients would have less advanced diseases and the sensitivity will
mostly likely be a lot less than estimated here.
2. Control are healthy individuals whereas in true screening setting, some individuals
might have inflammatory or other diseases that could result in a greater proportion of
false positive results.
3. No independent validation cohort
4. Only look at 8 cancer types

CancerSeek

Recommended

Recommended

More Related Content

Similar to CancerSeek

Similar to CancerSeek (20)

More from Thi K. Tran-Nguyen, PhD

More from Thi K. Tran-Nguyen, PhD (20)

Recently uploaded

Recently uploaded (20)

CancerSeek