A Retrospective Analysis of Exome Sequencing Cases Using the GenePool™ Genomi...
Poster_NHGRI_DahliaShvets.ver2
1. Abstract
Comparative Genomics Analysis Unit, Cancer Genetics and Comparative Genomics Branch, NHGRI
Dahlia Shvets, James C. Mullikin, Leslie Biesecker, and Nancy F. Hansen
Investigation of Clonal Hematopoiesis in Whole-Exome Sequencing of ClinSeq Individuals
National Human Genome Research Institute
Comparative Genomics Analysis Unit
Cancer is thought to arise from the gradual
accumulation of specific genetic mutations,
sometimes years before the presence of clinical
symptoms. Early mutations can result in clonal
expansion of mutated stem or progenitor cells.
Clonal expansion subsequently increases the
likelihood of cooperating mutations occurring in
cells already harboring initiating mutations.
Clonal hematopoiesis--the clonal expansion of
hematopoietic stem cells--may signal the onset
of many hematologic cancers. In previous work,
clonal hematopoiesis has been shown to occur
with higher incidence in the elderly and is a risk
factor for later hematopoietic cancers.
Two recent studies published in the New
England Journal of Medicine, by Genovese, et
al. and Jaiswal, et al., performed large-scale
analyses searching for recurrent somatic
mutations in whole-exome sequencing of DNA
isolated from blood. These studies conclude that
clonal hematopoiesis with somatic mutations can
be detected via DNA sequencing, that it
increases in prevalence with age, and is
associated with an increased risk of hematologic
cancer. We aim to replicate their work using
whole-exome sequences from 1,001 individuals
in the ClinSeq cohort. We examine blood
derived DNA sequence data from ClinSeq
individuals to identify the genes and their
mutations that may drive clonal expansion.
Prior to this work, the ClinSeq data had already
been aligned with NovoAlign to the GRCh37
human reference sequence. We used the
program LoFreq to call low-frequency variants
using these alignments. Following Genovese et
al., we attempted to remove unreliable data from
our analysis by excluding genomic regions of
low complexity, excess coverage, segmental
duplications, known large insertions, and sites
failing Hardy Weinberg equilibrium tests. Then,
to separate germline from somatic variants,
binomial tests of the null hypothesis that the true
allelic fraction is 50% were implemented, along
with a false discovery rate correction using the
Benjamini-Hochberg method. In addition, any
variants occurring with an allele fraction of less
than 10% or occurring more than three times in
the cohort were removed.
In the process of searching for drivers of clonal
hematopoiesis, we uncovered an underlying
issue related to sequencing capture kits. The
mutation profile of our discovered somatic
mutations failed to mimic the expected mutation
profile seen by Genovese et al. A large number
of A to C (and T to G) base changes were
reported in our results, which upon visual
examination, displayed extremely high strand
biases. This pointed towards the need to further
filter the data based on strand bias. After
extensive filtering, our final gene lists of putative
drivers in the ClinSeq whole exome data
included almost all of the candidate driver genes
noted by Genovese et al. and Jaiswal et al.
However, we also observed numerous other
genes with an even larger number of somatic
mutations that were not previously noted
candidate drivers.
1. Workflow
2. ClinSeq Cohort
3. Mutation Profiles
The ClinSeq cohort includes males and females
primarily between the ages of 45 and 65, a somewhat
narrower range than that of the Genovese cohort.
The ClinSeq cohort consisting of 1001 individuals
provides the opportunity to obtain follow up samples
in the future from any individuals that may exhibit
somatic mutations in genes previously known to
cause clonal expansion.
The pipeline for this project mimics the steps taken by Genovese
et al. to identify putative somatic mutations. Binomial testing was
performed on the null hypothesis that the allelic fraction is 50%,
Benjamini Hochberg method was implemented for multi-test
correction, and samples appearing more than twice in the cohort
or with an allelic fraction of less than 10% were removed.
Figure S5 from Supplementary Data of Genovese et al.
“Clonal Hematopoiesis and Blood-Cancer Risk Inferred from
Blood DNA Sequence”. New England Journal of Medicine.
26 Nov. 2014; 371:2477-87.
Specific tissues are known to
exhibit certain mutation profiles.
Our original mutation profile, seen
above left, for both driver genes,
and total genes was most similar to
the inclusive somatic Wave1 seen
by Genovese et al. which is not
expected given our data. Wave1
was excluded from the Genovese
analysis due to sequencing error.
After investigating the cause of
potential error in our data, and after
subsequent strand bias filtering of
less than 15, the final mutation
profile seen above matches what is
expected for somatic mutations in
DNA from the blood.
The number of total somatic mutations
in most samples is very low. The
graph on the left excludes 6 samples
that have greater than 50 somatic
mutations. All final somatic mutations
were calculated at a false discovery
rate cut off of 0.05 and a strand bias
cut off of less than 15. On the right are
the top genes that had the greatest
number of samples containing at least
one mutation in the given gene, along
with the driver genes seen by
Genovese, et al. We found a total of
4,600 genes exhibiting at least one
somatic mutation. The remaining five
of the 14 putative drivers discovered
by Genovese et al. were not present in
the ClinSeq samples.
4. Capture Kit
Analysis
The skewed mutation profile lead to
further investigation of the effects of the
different capture kits used on the ClinSeq
cohort. Three different capture kit types
were used in sequencing the 1001
samples: ICGC and Index, Exon, Truseq
V1 and V2. Each base change for each
capture kit was plotted against the
number of times that it occurred in the
cohort. It became evident that one of the
capture kits was prone to error because
so many of the A to C and T to G base
changes were appearing only on one
strand. For the other capture kits, and
other base changes, the majority of
strand bias values are extremely low.
Due to this finding, all variants with a
strand bias higher than 15 were removed
from the final analysis. These graphs
exclude any strand bias values higher
than 50.
5. Total Somatic Mutations
Future Directions
6. Mutations in Driver Genes
Next generation read data showing a somatic
DNMT3A p.R882H mutation. This mutation was
covered by 136 reads, 27.9% of which displayed the
mutant allele. DNMT3A p.R882H mutations are found
frequently in acute myeloid leukemia (AML) and are
associated with shorter overall survival [Ley et al.,
NEJM, "DNMT3A Mutations in Acute Myeloid
Leukemia", 2010].
Read data showing a somatic frameshift insertion
in the TET2 gene. This mutation causing a
frameshift at p.C262, was covered by 63 reads,
and had 30.2% of reads displaying the altered
allele. Truncating mutations in TET2 have been
found in roughly 15% of a variety of malignant
myeloid disorders [Delhommeau et al., NEJM,
"Mutation in TET2 in Myeloid Cancers", 2009].
Future work will include ensuring that genes with a large number of somatic mutations aren’t subject to
copy number variation, and searching in known driver regions for additional low-level variants that may
have been missed by our Lofreq analysis. Additional work will involve following up with ClinSeq individuals
who have somatic mutations in the putative driver genes, obtaining new DNA samples if possible, and
analyzing them for the presence of any newly acquired mutations.
SAMPLE
ANNOTATION
BINOMIAL
HYPOTHESIS
TESTING
FALSE
DISCOVERY RATE
CORRECTION
FILTER
MUTATIONS
OBSERVED 3 OR
MORE TIMES IN
COHORT
LOFREQ VARIANT
CALLING
ERROR PRONE
REGION
FILTERING
PUTATIVE
CLONAL
HEMATOPOIESIS
DRIVERS
ALIGNED
CLINSEQ BAM
FILES
ALLELE FRACTION
FILTERING
0
10
20
30
40
40 50 60 70
Age
NumberofPatients
Gender
Male
Female
Clinseq Cohort Age & Gender
0
50
100
150
0 10 20 30 40 50
Number of Somatic Mutations
NumberofSamples
0
50
100
count
Total Somatic Mutations per ClinSeq Sample
0
5
10
15
20
25
TTN SYNEI DNMT3A LYST MUC16 // TET2 ATM ASXL1 CBL JAK2 TP53 SF3B1 MYD88
Gene Names
TotalGeneCounts
0
5
10
15
20
25
Mutation.Count
Top Genes with Somatic Mutations
0
2000
4000
6000
8000
0 10 20 30 40 50
Strand Bias
Count
Base Change
A > C−
T > G−
ICGC and Index Capture Kit
0
5000
10000
15000
0 10 20 30 40
Strand Bias
Count
Base Change
A > C−
T > G−
Exon Capture Kit
0
2000
4000
6000
0 10 20 30 40 50
Strand Bias
Count
Base Change
A > C−
T > G−
Truseq V1 and V2 Capture Kit
0
20000
40000
60000
0 10 20 30 40 50
Strand Bias
Count
Base Change
C > T−
G > A−
ICGC and Index Capture Kit
0e+00
5e+04
1e+05
0 10 20 30 40
Strand Bias
Count
Base Change
C > T−
G > A−
Exon Capture Kit
0
20000
40000
60000
0 10 20 30 40 50
Strand Bias
Count
Base Change
C > T−
G > A−
Truseq V1 and V2 Capture Kit