SlideShare a Scribd company logo
1 of 28
Biology 101
and 102, 202, 324, and 404...
DNA Structure
Chromosomes Genes Base pairs
Dad’sMom’s
Cancer Biology Refresher
 germline mutations: inherited mutations. Present since
birth.
 Present in both normal and tumor cells.
 somatic mutations: arise over the course of an
individual’s lifetime.
 Present in tumor cells, but not in normal cells.
Reference: A T C G A T C G A T C G
Tumor: A T G G A T C G C T C G
Normal: A T G G A T C G A T C G
1 2 3 4 5 6 7 8 9 10 11 12
A Statement of Facts
 Breast cancer is the leading type of
cancer in women, accounting for
25% of all reported cases.
 In 2013, there were 1.68 million
cases and 522,000 deaths due to
breast cancer.
Source: http://www.fanpop.com/clubs/breast-cancer-
awareness/images/372389/title/pink-ribbonA
Goal
Given paired normal-tumor sequencing data
from a breast cancer patient, identify the
somatic mutations present in the cancer
genome.
Example of Raw Sequencing Data
Approach
 Can we use deep-sequenced data from other cancer
patients to classify the current patient’s SNPs?
 Use 9 machine learning algorithms to predict which
SNPs are truly somatic.
10
(1) Number of reads covering or bridging the site
(11) Sum of squares of reference mapping qualities
(2) Number of reference Q13 bases on the forward strand
(12) Sum of non-reference mapping qualities
(3) Number of reference Q13 bases on the reverse strand
(13) Sum of squares of non-reference mapping qualities
(4) Number of non-reference Q13 bases on the forward strand
(14) Sum of tail distances for reference bases
(5) Number of non-reference Q13 bases on the reverse strand
(15) Sum of squares of tail distance for reference bases
(6) Sum of reference base qualities
(16) Sum of tail distances for non-reference bases
(7) Sum of squares of reference base qualities
(17) Sum of squares of tail distance for non-reference bases
(8) Sum of non-reference base qualities
(18) P(D∣Gi=aa), phred−scaled, i.e. x is transformed to −10log(x)
(9) Sum of squares of non-reference base qualities
(19) maxGi≠aa(P(D∣Gi)), phred-scaled
(10) Sum of reference mapping qualities
(20) ∑Gi≠aa (P(D∣Gi)), phred-scaled
(Q13 means base quality bigger or equal to Phred score 13; D represents the three dimensional vector (depth, number of reference bases and number of non-reference base
at the current site; Gi∈{aa, ab, bb} means the genotype at site i, where a, b∈{A, C, T, G} and a is the reference allele and b is the non-reference allele.)
(41) QUAL: phred-scaled probability of the call given data
(51) QD: variant confidence/unfiltered depth
(42) Allele count for non-ref allele in genotypes
(52) SB: strand bias (the variation being seen on only the forward or only the reverse strand)
(43) AF: allele frequency for each non-ref allele
(53) SumGLbyD
(44) Total number of alleles in called genotypes
(54) Allelic depths for the ref-allele
(45) Total (unfiltered) depth over all samples
(55) Allelic depths for the non-ref allele
(46) Fraction of reads containing spanning deletions
(56) DP: read depth (only filtered reads used for calling)
(47) HRun: largest contiguous homopolymer run of variant allele in either direction
(57) GQ: genotype quality computed based on the genotype likelihood
(48) HaplotypeScore: estimate the probability that the reads at this locus are coming from no more than 2 local haplotypes
(58) P(D∣Gi=aa), phred-scaled
(49) MQ: root mean square mapping quality
(59) P(D∣Gi=ab), phred-scaled
(50) MQ0: total number of reads with mapping quality zero
(60) P(D∣Gi=bb), phred-scaled.
(98) Forward strand non-reference base ratio F24/F4
(103) Sum of squares of non-reference mapping quality ratio F33/F13
(99) Reverse strand non-reference base ratio F25/F5
(104) Sum of non-reference tail distance ratio F36/F16
(100) Sum of non-reference base quality ratio F28/F8
(105) Sum of squares of non-reference tail distance ratio F37/F17
(101) Sum of squares of non-reference base quality ratio F29/F9
(106) Non-reference allele depth ratio F75/F55
(102) Sum of non-reference mapping quality ratio F32/F12
From Samtools:
x1 - x20 for normal
x21 - x40 for tumor
From GATK:
x41 - x60 for normal
x61 - x80 for tumor
Feature Extraction
 Source: “Feature-based classifiers for Somatic mutation
detection in tumor-normal paired sequencing data” by Jiarui
Ding, et al.
 Selected 106 features that are computed from SAMtools and
GATK (two popular genomics toolkits).
 Random Forest, SVM, and Logistic Tree have already
achieved good accuracy using these features.
Feature Selection and
Merging
 Merge all the features for normal and tumor SNPS detected from
SAMtools and GATK (4  1).
 Delete the uninformative features (e.g., number of non-reference Q13
bases on the tumor) and features with too much missing data.
 62 features left!
 Approximate missing data by substituting the mean of data present
(“mean imputation”).
tumor
SAMtools
normal
GATK
tumor
GATK
SNP
positions
normal
SAMtools
tumor
GATK
normal
GATK
tumor
SAMtools
Missing Data
Normalization
 Normalization prevents large-valued features from
dominating the principal components.
 Method 1: perform mean-centering and then divide by the
standard deviation.
 Use to detect machine and experimental errors.
 Method 2: divide by the maximum of absolute value of the
data.
 Use to normalize for machine learning algorithm input.
Principal Component Analysis (PCA)
 Identifies an orthonormal basis that captures
the greatest variance in our data.
 Reduce the dimension to top 10 principal
components.
 These account for 81.5% of the variance in our
data.
 These principal components serve as “super
feature” inputs for our machine learning
algorithms
Initial classification by
SAMtools and GATK:
somatic
germline
germline
somati
c
9 Machine Learning Algorithms
1. Use first 10 principal components as features
2. Run the algorithms with training data from another
patient (860 samples).
3. Pass each SNP through every algorithm, tracking
whether it is classified as somatic or non-somatic.
4. Select a threshold (we used 8).
5. If more than the threshold number of algorithms classify
that SNP as somatic, we assign it a final label of
somatic!
QDA RBF SVM Linear SVM
Random ForestNaïve Bayes
Decision Tree Nearest Neighbors LDA
+ Neural Network
(processing sped up by
parallel programming)
OpenMP
OpenMPI
CUDA
More on Neural Networks
How did we do?
 Cross-reference our somatic SNPs against several
databases (gene function, disease + phenotype
association etc.)
 Compile a list of known breast cancer driver genes on
chromosome 1 and search for them among our results:
1. RAP1A [151 SNPs]
2. PARP1 [234 SNPs]
3. TACSTD2 [13 SNPs]
 In total:
 8,660 associations with breast cancer in other research
studies were found.
• microtubule assembly protein
• blocking dynamic instability of
microtubules used as a cancer
treatment, preventing cell migration
PARP1
• involved in DNA damage repair
• interacts with BRCA1 and BRCA2 (two
of the most cited breast cancer driver
genes)
during homologous recombination
TACSTD2
• tumor-associated calcium signal transducer
Source:
http://www.proteinatlas.org/i
mages_dictionary/microtub
ules__1__6376__1_blue_g
reen.jpg
Source:
http://en.wikipedia.org/wiki/
PARP1
MAP1A
Source: www.sinobiological.com
Future Work
 Deep-sequenced data is expensive to produce. For
best results, we need the data to come from
individuals of similar backgrounds (gender, ethnicity,
etc.)
 Building a repository of data with high coverage for
each cancer type would increase our training set
size and ward off the perils of over-fitting.
 Learn how to overcome the sequencing errors that
each sequencing technology is prone to.
References
1. Jiarui Ding, Ali Bashashati, and et al, “Feature-based classifiers for somatic mutation deletion in
tumor-normal paired sequencing data”, Bioinformatics (2012), pg 167-175, vol. 28
2. Xindong Wu, Vipin Kumar and et al, “Top 10 Algorithms in Data Mining”, Knowl Information System
(2008) 14:1–37
3. Jonathan Shlens, “A Tutorial on Pricipal Component Analysis”, Google Research (2014)
4. Christoforides, A. and J. Carpten, et al. "Identification of somatic mutations in cancer through
Bayesian-based analysis of sequenced genome pairs." BMC Genomics (2012) 14 (1): 302
5. Shiraishi, Y. and Y. Sato, et al. "An empirical Bayesian framework for somatic mutation detection
from cancer genome sequencing data." Nucleic Acids Research (2013)
6. SciKit: http://scikit-learn.org/stable/
7. ANN: http://takinginitiative.wordpress.com/2008/04/23/basic-neural-network-tutorial-c-
implementation-and-source-code/
Thank You!

More Related Content

What's hot

In Silico Prescription of Anticancer Drugs Reveals Targeting Opportunities
In Silico Prescription of Anticancer Drugs Reveals Targeting OpportunitiesIn Silico Prescription of Anticancer Drugs Reveals Targeting Opportunities
In Silico Prescription of Anticancer Drugs Reveals Targeting OpportunitiesNuria Lopez-Bigas
 
Overexpression of peptide deformylase in breast, colon, and lung cancers
Overexpression of peptide deformylase in breast, colon, and lung cancersOverexpression of peptide deformylase in breast, colon, and lung cancers
Overexpression of peptide deformylase in breast, colon, and lung cancersEnrique Moreno Gonzalez
 
A new assay for measuring chromosome instability (CIN) and identification of...
A new assay for measuring chromosome instability  (CIN) and identification of...A new assay for measuring chromosome instability  (CIN) and identification of...
A new assay for measuring chromosome instability (CIN) and identification of...Enrique Moreno Gonzalez
 
IncellDx Oncobreast 3Dx CSUPERB Poster
IncellDx Oncobreast 3Dx CSUPERB PosterIncellDx Oncobreast 3Dx CSUPERB Poster
IncellDx Oncobreast 3Dx CSUPERB PosterAmanda Chargin
 
Axiom® Genome-Wide LAT 1 Array World Array 4
Axiom®  Genome-Wide LAT 1 Array World Array 4Axiom®  Genome-Wide LAT 1 Array World Array 4
Axiom® Genome-Wide LAT 1 Array World Array 4Affymetrix
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsNatalio Krasnogor
 
Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...
Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...
Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...Enrique Moreno Gonzalez
 
HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...
HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...
HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...CrimsonpublishersCancer
 
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancerMmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancerAustin Publishing Group
 

What's hot (9)

In Silico Prescription of Anticancer Drugs Reveals Targeting Opportunities
In Silico Prescription of Anticancer Drugs Reveals Targeting OpportunitiesIn Silico Prescription of Anticancer Drugs Reveals Targeting Opportunities
In Silico Prescription of Anticancer Drugs Reveals Targeting Opportunities
 
Overexpression of peptide deformylase in breast, colon, and lung cancers
Overexpression of peptide deformylase in breast, colon, and lung cancersOverexpression of peptide deformylase in breast, colon, and lung cancers
Overexpression of peptide deformylase in breast, colon, and lung cancers
 
A new assay for measuring chromosome instability (CIN) and identification of...
A new assay for measuring chromosome instability  (CIN) and identification of...A new assay for measuring chromosome instability  (CIN) and identification of...
A new assay for measuring chromosome instability (CIN) and identification of...
 
IncellDx Oncobreast 3Dx CSUPERB Poster
IncellDx Oncobreast 3Dx CSUPERB PosterIncellDx Oncobreast 3Dx CSUPERB Poster
IncellDx Oncobreast 3Dx CSUPERB Poster
 
Axiom® Genome-Wide LAT 1 Array World Array 4
Axiom®  Genome-Wide LAT 1 Array World Array 4Axiom®  Genome-Wide LAT 1 Array World Array 4
Axiom® Genome-Wide LAT 1 Array World Array 4
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
 
Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...
Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...
Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...
 
HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...
HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...
HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...
 
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancerMmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
Mmp13 and serpinb2 as novel biomarkers for hypopharyngeal cancer
 

Similar to final_presentation

Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
 
Cdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitCdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitMarco Antoniotti
 
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...Antoaneta Vladimirova
 
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...IJERD Editor
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
 
Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNAUlises Urzua
 
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...caijjournal
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...Warren Kibbe
 
Axiom® Genome-Wide AFR 1 Array World Array 3
Axiom®  Genome-Wide AFR 1 Array World Array 3Axiom®  Genome-Wide AFR 1 Array World Array 3
Axiom® Genome-Wide AFR 1 Array World Array 3Affymetrix
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...GenomeInABottle
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics PosterMichael Atkins
 
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...Lietuvos kompiuterininkų sąjunga
 
CRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowCRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowHorizonDiscovery
 
CDAC 2018 Gonzales-Perez interpretation of cancer genomes
CDAC 2018 Gonzales-Perez interpretation of cancer genomesCDAC 2018 Gonzales-Perez interpretation of cancer genomes
CDAC 2018 Gonzales-Perez interpretation of cancer genomesMarco Antoniotti
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923GenomeInABottle
 

Similar to final_presentation (20)

Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Cdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitCdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution trait
 
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
Towards Prediction of Platinum Treatment Response in Ovarian Cancer using Mac...
 
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
2015-03-31_MotifGP
2015-03-31_MotifGP2015-03-31_MotifGP
2015-03-31_MotifGP
 
Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNA
 
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
Zinc supplementation may reduce the risk of hepatocellular carcinoma using bi...
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
 
Axiom® Genome-Wide AFR 1 Array World Array 3
Axiom®  Genome-Wide AFR 1 Array World Array 3Axiom®  Genome-Wide AFR 1 Array World Array 3
Axiom® Genome-Wide AFR 1 Array World Array 3
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
 
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...
 
Poster
PosterPoster
Poster
 
CRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowCRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and How
 
CDAC 2018 Gonzales-Perez interpretation of cancer genomes
CDAC 2018 Gonzales-Perez interpretation of cancer genomesCDAC 2018 Gonzales-Perez interpretation of cancer genomes
CDAC 2018 Gonzales-Perez interpretation of cancer genomes
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 

final_presentation

  • 1.
  • 2. Biology 101 and 102, 202, 324, and 404...
  • 3. DNA Structure Chromosomes Genes Base pairs Dad’sMom’s
  • 4. Cancer Biology Refresher  germline mutations: inherited mutations. Present since birth.  Present in both normal and tumor cells.  somatic mutations: arise over the course of an individual’s lifetime.  Present in tumor cells, but not in normal cells.
  • 5. Reference: A T C G A T C G A T C G Tumor: A T G G A T C G C T C G Normal: A T G G A T C G A T C G 1 2 3 4 5 6 7 8 9 10 11 12
  • 6. A Statement of Facts  Breast cancer is the leading type of cancer in women, accounting for 25% of all reported cases.  In 2013, there were 1.68 million cases and 522,000 deaths due to breast cancer. Source: http://www.fanpop.com/clubs/breast-cancer- awareness/images/372389/title/pink-ribbonA
  • 7. Goal Given paired normal-tumor sequencing data from a breast cancer patient, identify the somatic mutations present in the cancer genome.
  • 8. Example of Raw Sequencing Data
  • 9. Approach  Can we use deep-sequenced data from other cancer patients to classify the current patient’s SNPs?  Use 9 machine learning algorithms to predict which SNPs are truly somatic.
  • 10. 10 (1) Number of reads covering or bridging the site (11) Sum of squares of reference mapping qualities (2) Number of reference Q13 bases on the forward strand (12) Sum of non-reference mapping qualities (3) Number of reference Q13 bases on the reverse strand (13) Sum of squares of non-reference mapping qualities (4) Number of non-reference Q13 bases on the forward strand (14) Sum of tail distances for reference bases (5) Number of non-reference Q13 bases on the reverse strand (15) Sum of squares of tail distance for reference bases (6) Sum of reference base qualities (16) Sum of tail distances for non-reference bases (7) Sum of squares of reference base qualities (17) Sum of squares of tail distance for non-reference bases (8) Sum of non-reference base qualities (18) P(D∣Gi=aa), phred−scaled, i.e. x is transformed to −10log(x) (9) Sum of squares of non-reference base qualities (19) maxGi≠aa(P(D∣Gi)), phred-scaled (10) Sum of reference mapping qualities (20) ∑Gi≠aa (P(D∣Gi)), phred-scaled (Q13 means base quality bigger or equal to Phred score 13; D represents the three dimensional vector (depth, number of reference bases and number of non-reference base at the current site; Gi∈{aa, ab, bb} means the genotype at site i, where a, b∈{A, C, T, G} and a is the reference allele and b is the non-reference allele.) (41) QUAL: phred-scaled probability of the call given data (51) QD: variant confidence/unfiltered depth (42) Allele count for non-ref allele in genotypes (52) SB: strand bias (the variation being seen on only the forward or only the reverse strand) (43) AF: allele frequency for each non-ref allele (53) SumGLbyD (44) Total number of alleles in called genotypes (54) Allelic depths for the ref-allele (45) Total (unfiltered) depth over all samples (55) Allelic depths for the non-ref allele (46) Fraction of reads containing spanning deletions (56) DP: read depth (only filtered reads used for calling) (47) HRun: largest contiguous homopolymer run of variant allele in either direction (57) GQ: genotype quality computed based on the genotype likelihood (48) HaplotypeScore: estimate the probability that the reads at this locus are coming from no more than 2 local haplotypes (58) P(D∣Gi=aa), phred-scaled (49) MQ: root mean square mapping quality (59) P(D∣Gi=ab), phred-scaled (50) MQ0: total number of reads with mapping quality zero (60) P(D∣Gi=bb), phred-scaled. (98) Forward strand non-reference base ratio F24/F4 (103) Sum of squares of non-reference mapping quality ratio F33/F13 (99) Reverse strand non-reference base ratio F25/F5 (104) Sum of non-reference tail distance ratio F36/F16 (100) Sum of non-reference base quality ratio F28/F8 (105) Sum of squares of non-reference tail distance ratio F37/F17 (101) Sum of squares of non-reference base quality ratio F29/F9 (106) Non-reference allele depth ratio F75/F55 (102) Sum of non-reference mapping quality ratio F32/F12 From Samtools: x1 - x20 for normal x21 - x40 for tumor From GATK: x41 - x60 for normal x61 - x80 for tumor
  • 11.
  • 12.
  • 13. Feature Extraction  Source: “Feature-based classifiers for Somatic mutation detection in tumor-normal paired sequencing data” by Jiarui Ding, et al.  Selected 106 features that are computed from SAMtools and GATK (two popular genomics toolkits).  Random Forest, SVM, and Logistic Tree have already achieved good accuracy using these features.
  • 14. Feature Selection and Merging  Merge all the features for normal and tumor SNPS detected from SAMtools and GATK (4  1).  Delete the uninformative features (e.g., number of non-reference Q13 bases on the tumor) and features with too much missing data.  62 features left!  Approximate missing data by substituting the mean of data present (“mean imputation”). tumor SAMtools normal GATK tumor GATK SNP positions normal SAMtools tumor GATK normal GATK tumor SAMtools Missing Data
  • 15. Normalization  Normalization prevents large-valued features from dominating the principal components.  Method 1: perform mean-centering and then divide by the standard deviation.  Use to detect machine and experimental errors.  Method 2: divide by the maximum of absolute value of the data.  Use to normalize for machine learning algorithm input.
  • 16.
  • 17. Principal Component Analysis (PCA)  Identifies an orthonormal basis that captures the greatest variance in our data.  Reduce the dimension to top 10 principal components.  These account for 81.5% of the variance in our data.  These principal components serve as “super feature” inputs for our machine learning algorithms
  • 18. Initial classification by SAMtools and GATK: somatic germline germline somati c
  • 19. 9 Machine Learning Algorithms 1. Use first 10 principal components as features 2. Run the algorithms with training data from another patient (860 samples). 3. Pass each SNP through every algorithm, tracking whether it is classified as somatic or non-somatic. 4. Select a threshold (we used 8). 5. If more than the threshold number of algorithms classify that SNP as somatic, we assign it a final label of somatic!
  • 20. QDA RBF SVM Linear SVM Random ForestNaïve Bayes Decision Tree Nearest Neighbors LDA + Neural Network (processing sped up by parallel programming)
  • 22.
  • 23. How did we do?  Cross-reference our somatic SNPs against several databases (gene function, disease + phenotype association etc.)  Compile a list of known breast cancer driver genes on chromosome 1 and search for them among our results: 1. RAP1A [151 SNPs] 2. PARP1 [234 SNPs] 3. TACSTD2 [13 SNPs]  In total:  8,660 associations with breast cancer in other research studies were found.
  • 24.
  • 25. • microtubule assembly protein • blocking dynamic instability of microtubules used as a cancer treatment, preventing cell migration PARP1 • involved in DNA damage repair • interacts with BRCA1 and BRCA2 (two of the most cited breast cancer driver genes) during homologous recombination TACSTD2 • tumor-associated calcium signal transducer Source: http://www.proteinatlas.org/i mages_dictionary/microtub ules__1__6376__1_blue_g reen.jpg Source: http://en.wikipedia.org/wiki/ PARP1 MAP1A Source: www.sinobiological.com
  • 26. Future Work  Deep-sequenced data is expensive to produce. For best results, we need the data to come from individuals of similar backgrounds (gender, ethnicity, etc.)  Building a repository of data with high coverage for each cancer type would increase our training set size and ward off the perils of over-fitting.  Learn how to overcome the sequencing errors that each sequencing technology is prone to.
  • 27. References 1. Jiarui Ding, Ali Bashashati, and et al, “Feature-based classifiers for somatic mutation deletion in tumor-normal paired sequencing data”, Bioinformatics (2012), pg 167-175, vol. 28 2. Xindong Wu, Vipin Kumar and et al, “Top 10 Algorithms in Data Mining”, Knowl Information System (2008) 14:1–37 3. Jonathan Shlens, “A Tutorial on Pricipal Component Analysis”, Google Research (2014) 4. Christoforides, A. and J. Carpten, et al. "Identification of somatic mutations in cancer through Bayesian-based analysis of sequenced genome pairs." BMC Genomics (2012) 14 (1): 302 5. Shiraishi, Y. and Y. Sato, et al. "An empirical Bayesian framework for somatic mutation detection from cancer genome sequencing data." Nucleic Acids Research (2013) 6. SciKit: http://scikit-learn.org/stable/ 7. ANN: http://takinginitiative.wordpress.com/2008/04/23/basic-neural-network-tutorial-c- implementation-and-source-code/

Editor's Notes

  1. Thank Hau Man Thank Binghang + BGI Thank Tim + RIPS-HK
  2. Imagine this scenario:
  3. In somatic, mention true positives and false negatives.