final_presentation

Biology 101
and 102, 202, 324, and 404...

DNA Structure
Chromosomes Genes Base pairs
Dad’sMom’s

Cancer Biology Refresher
 germline mutations: inherited mutations. Present since
birth.
 Present in both normal and tumor cells.
 somatic mutations: arise over the course of an
individual’s lifetime.
 Present in tumor cells, but not in normal cells.

Reference: A T C G A T C G A T C G
Tumor: A T G G A T C G C T C G
Normal: A T G G A T C G A T C G
1 2 3 4 5 6 7 8 9 10 11 12

A Statement of Facts
 Breast cancer is the leading type of
cancer in women, accounting for
25% of all reported cases.
 In 2013, there were 1.68 million
cases and 522,000 deaths due to
breast cancer.
Source: http://www.fanpop.com/clubs/breast-cancer-
awareness/images/372389/title/pink-ribbonA

Goal
Given paired normal-tumor sequencing data
from a breast cancer patient, identify the
somatic mutations present in the cancer
genome.

Example of Raw Sequencing Data

Approach
 Can we use deep-sequenced data from other cancer
patients to classify the current patient’s SNPs?
 Use 9 machine learning algorithms to predict which
SNPs are truly somatic.

10
(1) Number of reads covering or bridging the site
(11) Sum of squares of reference mapping qualities
(2) Number of reference Q13 bases on the forward strand
(12) Sum of non-reference mapping qualities
(3) Number of reference Q13 bases on the reverse strand
(13) Sum of squares of non-reference mapping qualities
(4) Number of non-reference Q13 bases on the forward strand
(14) Sum of tail distances for reference bases
(5) Number of non-reference Q13 bases on the reverse strand
(15) Sum of squares of tail distance for reference bases
(6) Sum of reference base qualities
(16) Sum of tail distances for non-reference bases
(7) Sum of squares of reference base qualities
(17) Sum of squares of tail distance for non-reference bases
(8) Sum of non-reference base qualities
(18) P(D∣Gi=aa), phred−scaled, i.e. x is transformed to −10log(x)
(9) Sum of squares of non-reference base qualities
(19) maxGi≠aa(P(D∣Gi)), phred-scaled
(10) Sum of reference mapping qualities
(20) ∑Gi≠aa (P(D∣Gi)), phred-scaled
(Q13 means base quality bigger or equal to Phred score 13; D represents the three dimensional vector (depth, number of reference bases and number of non-reference base
at the current site; Gi∈{aa, ab, bb} means the genotype at site i, where a, b∈{A, C, T, G} and a is the reference allele and b is the non-reference allele.)
(41) QUAL: phred-scaled probability of the call given data
(51) QD: variant confidence/unfiltered depth
(42) Allele count for non-ref allele in genotypes
(52) SB: strand bias (the variation being seen on only the forward or only the reverse strand)
(43) AF: allele frequency for each non-ref allele
(53) SumGLbyD
(44) Total number of alleles in called genotypes
(54) Allelic depths for the ref-allele
(45) Total (unfiltered) depth over all samples
(55) Allelic depths for the non-ref allele
(46) Fraction of reads containing spanning deletions
(56) DP: read depth (only filtered reads used for calling)
(47) HRun: largest contiguous homopolymer run of variant allele in either direction
(57) GQ: genotype quality computed based on the genotype likelihood
(48) HaplotypeScore: estimate the probability that the reads at this locus are coming from no more than 2 local haplotypes
(58) P(D∣Gi=aa), phred-scaled
(49) MQ: root mean square mapping quality
(59) P(D∣Gi=ab), phred-scaled
(50) MQ0: total number of reads with mapping quality zero
(60) P(D∣Gi=bb), phred-scaled.
(98) Forward strand non-reference base ratio F24/F4
(103) Sum of squares of non-reference mapping quality ratio F33/F13
(99) Reverse strand non-reference base ratio F25/F5
(104) Sum of non-reference tail distance ratio F36/F16
(100) Sum of non-reference base quality ratio F28/F8
(105) Sum of squares of non-reference tail distance ratio F37/F17
(101) Sum of squares of non-reference base quality ratio F29/F9
(106) Non-reference allele depth ratio F75/F55
(102) Sum of non-reference mapping quality ratio F32/F12
From Samtools:
x1 - x20 for normal
x21 - x40 for tumor
From GATK:
x41 - x60 for normal
x61 - x80 for tumor

Feature Extraction
 Source: “Feature-based classifiers for Somatic mutation
detection in tumor-normal paired sequencing data” by Jiarui
Ding, et al.
 Selected 106 features that are computed from SAMtools and
GATK (two popular genomics toolkits).
 Random Forest, SVM, and Logistic Tree have already
achieved good accuracy using these features.

Feature Selection and
Merging
 Merge all the features for normal and tumor SNPS detected from
SAMtools and GATK (4  1).
 Delete the uninformative features (e.g., number of non-reference Q13
bases on the tumor) and features with too much missing data.
 62 features left!
 Approximate missing data by substituting the mean of data present
(“mean imputation”).
tumor
SAMtools
normal
GATK
tumor
GATK
SNP
positions
normal
SAMtools
tumor
GATK
normal
GATK
tumor
SAMtools
Missing Data

Normalization
 Normalization prevents large-valued features from
dominating the principal components.
 Method 1: perform mean-centering and then divide by the
standard deviation.
 Use to detect machine and experimental errors.
 Method 2: divide by the maximum of absolute value of the
data.
 Use to normalize for machine learning algorithm input.

Principal Component Analysis (PCA)
 Identifies an orthonormal basis that captures
the greatest variance in our data.
 Reduce the dimension to top 10 principal
components.
 These account for 81.5% of the variance in our
data.
 These principal components serve as “super
feature” inputs for our machine learning
algorithms

Initial classification by
SAMtools and GATK:
somatic
germline
germline
somati
c

9 Machine Learning Algorithms
1. Use first 10 principal components as features
2. Run the algorithms with training data from another
patient (860 samples).
3. Pass each SNP through every algorithm, tracking
whether it is classified as somatic or non-somatic.
4. Select a threshold (we used 8).
5. If more than the threshold number of algorithms classify
that SNP as somatic, we assign it a final label of
somatic!

QDA RBF SVM Linear SVM
Random ForestNaïve Bayes
Decision Tree Nearest Neighbors LDA
+ Neural Network
(processing sped up by
parallel programming)

OpenMP
OpenMPI
CUDA
More on Neural Networks

How did we do?
 Cross-reference our somatic SNPs against several
databases (gene function, disease + phenotype
association etc.)
 Compile a list of known breast cancer driver genes on
chromosome 1 and search for them among our results:
1. RAP1A [151 SNPs]
2. PARP1 [234 SNPs]
3. TACSTD2 [13 SNPs]
 In total:
 8,660 associations with breast cancer in other research
studies were found.

• microtubule assembly protein
• blocking dynamic instability of
microtubules used as a cancer
treatment, preventing cell migration
PARP1
• involved in DNA damage repair
• interacts with BRCA1 and BRCA2 (two
of the most cited breast cancer driver
genes)
during homologous recombination
TACSTD2
• tumor-associated calcium signal transducer
Source:
http://www.proteinatlas.org/i
mages_dictionary/microtub
ules__1__6376__1_blue_g
reen.jpg
Source:
http://en.wikipedia.org/wiki/
PARP1
MAP1A
Source: www.sinobiological.com

Future Work
 Deep-sequenced data is expensive to produce. For
best results, we need the data to come from
individuals of similar backgrounds (gender, ethnicity,
etc.)
 Building a repository of data with high coverage for
each cancer type would increase our training set
size and ward off the perils of over-fitting.
 Learn how to overcome the sequencing errors that
each sequencing technology is prone to.

References
1. Jiarui Ding, Ali Bashashati, and et al, “Feature-based classifiers for somatic mutation deletion in
tumor-normal paired sequencing data”, Bioinformatics (2012), pg 167-175, vol. 28
2. Xindong Wu, Vipin Kumar and et al, “Top 10 Algorithms in Data Mining”, Knowl Information System
(2008) 14:1–37
3. Jonathan Shlens, “A Tutorial on Pricipal Component Analysis”, Google Research (2014)
4. Christoforides, A. and J. Carpten, et al. "Identification of somatic mutations in cancer through
Bayesian-based analysis of sequenced genome pairs." BMC Genomics (2012) 14 (1): 302
5. Shiraishi, Y. and Y. Sato, et al. "An empirical Bayesian framework for somatic mutation detection
from cancer genome sequencing data." Nucleic Acids Research (2013)
6. SciKit: http://scikit-learn.org/stable/
7. ANN: http://takinginitiative.wordpress.com/2008/04/23/basic-neural-network-tutorial-c-
implementation-and-source-code/

final_presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to final_presentation

Similar to final_presentation (20)

final_presentation

Editor's Notes