Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine Learning Applications in
Computational Genomics
— Some new algorithms for understanding cancer genomes
Jian Ma
Com...
2
TCTCTCAGAGGGCCCTGATGGAAGAATCCCCCTACCACCCTTCCAGGCTGACTTCTGTCTATTTCTCCTGCAGAGTGAGCTGGACTTGGAAAAGGGCTTG
GAGATGAGAAAATGGGTCC...
Why Computational Genomics?
3
Why Computational Genomics?
! Key to personalized precision medicine, especially for cancer
4
David Patterson
! Cancer res...
The Human genome:
the “blueprint” of our body
5
GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGA
TTCTACAATACTAGCTT...
DNA, Chromosome, and Genome
6
Chapter4:DNA,Chromosomes,andGenomes
" beads-on-a-string"
form of chromatin
30-nmchromatin
fi...
DNA, RNA, Protein
! Central Dogma in molecular biology
• DNA
• RNA
• Protein
! In general, proteins do most of the work, a...
Most of the genome are non-coding
8
© 2005 Nature Publishing Group
SINEs
LINEs
Protein-coding
genes
Introns
Miscellaneous
...
Most functional information is non-coding
! 5% highly conserved, but only 1.5% encodes proteins
9
chr2 (q31.1) 21 p14 2p12...
Annotating the non-coding regions
10
Scale
chr2:
NKI LADs (Tig3)
10 kb hg19
20,090,000 20,095,000 20,100,000 20,10
TTC32
L...
Cancer genomics
workflow
Each type of cancer is different
12
widespread—remain and eventua
promising the function of the lun
organs. From a genetic...
Each individual tumor is different
! Data from TCGA’s analyses show that most cancer types has
a great number of mutations...
Supervised learning Un-supervised learning
genes
samples
Analyzing gene expression data
How to deal with high dimension? 

Identify the most important genes
! d is the damping factor, a
parameter representing
t...
Tumor heterogeneity vs.
gene networks
16
NCIS - Liu et al. BMC Bioinfo 2014
C3 - Hou et al. Bioinformatics 2016
LDGM - Tia...
Deep learning applications
17
x y
Features Model ResultsClean data
A
D
Feature
extraction
Discriminative features
Raw data...
DeepBIND
18
A N A LY S I S
t
i
P
a
v
p
f
P
a
t
a
s
a
t
i
t
Figure 1 DeepBind’s input data, training procedure and applicat...
Cancer genome
19
MCF-7 http://www.path.cam.ac.uk/~pawefish/
20
Structural variations (SVs) 

in cancer genomes
inversion translocation
gain loss duplication
Whole genome sequencing
M...
Aneuploidy — Common feature of cancer cells
21
MCF-7 http://www.path.cam.ac.uk/~pawefish/
! Allele-specific copy
number (A...
Goal — Quantify allele-specific SVs
22
Goal - Quantify Allele-Specific SVs
4
Goal - Quantify Allele-Specific SVs
4
Goal - ...
Weaver — algorithm overview
23
Probabilistic Graphical Model
(Markov Random Field)
Mappability GC Content
Purity ASCNG ASC...
24
Purity ASCNG ASCNS Timing of SV Phasing
100 kb
21,850,000 21,900,000 21,950,000 22,000,000 22,050,000 22,100,000 22,150...
! MRF:
• genome node, cancer node, genome edge, cancer edge
25
(B) (C)
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
R11 R12 R13 R14 R15 ...
ASCN and SVs in MCF-7
26
! 83% of SVs have copy number > 1
! 68% of the regions have imbalanced copy number
! We found 276...
ASCN and SVs in HeLa
! WGS reads obtained from
Adey et al. Nature 2013
! ASCNG are 97% consistent
with Adey et al. (Fosmid...
Application to TCGA Data
! Inter-chromosomal chromothripsis
28
1X
62X
(A) (B)
FOXG1
4
2214
6(C)
Supplementary Figure 14: (...
Upcoming SlideShare
Loading in …5
×

The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

In this The Hive Think Tank talk, Professor Jian Ma introduces machine learning methods that can be used to help tackle some of the most intriguing questions in genomics and biomedicine. He discusses the research projects in his group to study genome structure and function, including algorithms to unravel complex genomic aberrations in cancer genomes and gene regulatory principles encoded in our genome, by utilizing
probabilistic graphical models and deep neural network techniques. The knowledge obtained from such computational methods can greatly enhance our ability to understand disease genomes.

  • Login to see the comments

The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

  1. 1. Machine Learning Applications in Computational Genomics — Some new algorithms for understanding cancer genomes Jian Ma Computational Biology Department School of Computer Science
  2. 2. 2 TCTCTCAGAGGGCCCTGATGGAAGAATCCCCCTACCACCCTTCCAGGCTGACTTCTGTCTATTTCTCCTGCAGAGTGAGCTGGACTTGGAAAAGGGCTTG GAGATGAGAAAATGGGTCCTGTCGGGAATCCTGGCTAGCGAGGAGACTTACCTGAGCCACCTGGAGGCACTGCTGCTGGTGAGGAGGATTTAGGGAGCTG AGCAGGGCGGGATGGGGCAGGGTGACAGGGTTGGGGAGCCTCTTTGCCCTTAAGTCCCAGGTCAGCTGTCAGAGCCTGGGTGCAGCTCGCCATCCCTGGA GTGGATACCAGTGGAAGACTGAGTTGCCAAACCAAGCTGGTTTTAAAATTGTATTTGTTATGTGATTTAAAAATAAAAGTGCATATGTCAGGTAACCATG ACTGTCTACTGCCATACAATGCACCTGACGGATGGCAGCCCCTCTCACCTGTGCTACCTCACTTGTGCCCTCTTCCAGCCCATGAAGCCTTTGAAAGCCG CTGCCACCACCTCTCAGCCGGTGCTGACGAGTCAGCAGATCGAGACCATCTTCTTCAAAGTGCCTGAGCTCTACGAGATCCACAAGGAGTTCTATGATGG GCTCTTCCCCCGCGTGCAGCAGTGGAGCCACCAGCAGCGGGTGGGCGACCTCTTCCAGAAGCTGGTGAGTAACCCAGGGCCGGTGCTGGGACTACAGGCG TGTACCACCACGTCCAGCTAATTTTTTGCATTTTTAGTAGAGACAGGGTTTTGCTATGTTGGCCAGGCTGGTCTCAAACTCCTAACCTCAAGTGATCCAC CTGCCTCAGCCTCCCAAAGTACTGAGATTACAGGCGTGAGCCGCCATGCCCAGCCTTTTTTTTTTTTTTTCTAATTTATATTTATTTAGATAGTTATTTT TAAAAAGAGATGGGGACTTACTACGTTGTCCAGGCTGGAGTGCAGTGGCTATTCACAGGCGCAATTCCACTGCTCATCAGCACGGGAGTTTTGACCTCCT TCCTTTCCAACCTTGGCTGTTTCACTCCTTCTTAGGCAAACTGATGGTTCCCGACTCCTGGGAGGTCACCATATTGATGCCAAACTTAGTGTGTAGTGCA CTACAGCCCAGAACTCCTGACTGAAGCCATCCTCCGGCCTCAGCCTTCCGCGTAGCTGGGGCTATAGGTGCACGCCACCACACCCTGTGTGTGGCTGGGA CTACAGGTGCACGCCATCACACCCTGTGTGCGCCATCACACCCTGTGTGCACCATCACACCCTGTGTGCACACACTTTCCCTAAAGCAGGCTTCCTCCGC TGGGAAACAAGTCCTCTAGGGGCAGGTGTGGCCAGAGGCCAGGCCCCCCTCTAAGTGTGAAGAGCATGTGATTCCTTAAAAGCCCTTCCCCCAGCACTTC TGGACTACCGAGACACACAGCTCTGGCCTCGGGCCTCCCCTTGGCTGGTGCTGGGGGCTGAGTTTTCTGCTCTGAGGTGTGGCTTTCCTGTAGGGGGACC CCTCCCTCTGCCACCCTGTGCTGCAGACCCCCAGACTCCAGGCCAGAGCTAAGGCTTGAGGAACACAGAAGGCACTTAATTTGTTCCAGTTCTTGCTCCC TGGGGCTCTTTCCCCCATGGCCAGAGAGCAGGAGGCTGTATTTTGATACATGCTGCCCCCTCCATCTTTGAAGCCCCCCCACCCCCGTTTCTCCGTGTGT GTGTCAGCAGTTTTAAACCTAGTGGAGGGTGGTGGCTCGGGCTGGGCTCCGCGTCGGGCTGCCCCGCAGCTGCTCTTGGGCAGCCAGGGCCGCTGGGTGT GGGGCCGCCGGGAATGGCGGGCCCGGGTGAGGGCGGGCCCGGGTGAGGGCGGGGGCGGAGAGGCGAAGAAGCTGCAGGAAGGGAGGGTGACGAGGGGGAA GCGAAGGAAGGGGAAGAGGAAGGGAAAAGCGAGCGAGAGGGGCAAGGCGGAAGAGGAAGCAGGGCGGAAGGGAAGCCCGGGCCGCAGACGGCGAAGGAGG CAGCGGGCCGGGGGCTGAGGCGGGAGCGAGGACACGCCCAAGAGAGGAAGCAGAGGGAGGCGGAAGCGTGGAGGAAGGGGCGAGAGGCATCATCAAAGGA GATGAGGGGAGCGTAGGGGCCGGGAAAGAGGCACAAGGAAGAAAGTATGGGAAGGAGGAATGGAGGGTCAGGGCTAGGCGGCGGGAGGGCGCCAGGCCGG GAAGAGTACAAGGACAAGGAGGTCAGGTTTGGGCCTACATCCCGGGGACAGGGGCGGCCATGGCGGCGGCAGCCAGGGAGGAGGAGGAGGAGGCGGCTCG GGAGTCAGCCGCCTGCCCGGCTGCGGGGCCAGCGCTCTGGCGCCTGCCGGAAGTGCTGCTGCTGCACATGTGCTCCTACCTCGACATGCGGGCCCTCGGC CGCCTGGCCCAGGTGTACCGCTGGCTGTGGCACTTCACCAACTGCGACCTGCTCCGGCGCCAGATAGCCTGGGCCTCGCTCAACTCCGGCTTCACGCGGC TCGGCACCAACCTGATGACCAGTGTCCCAGTGAAGGTGTCTCAGAACTGGATAGTGGGGTGCTGCCGAGAGGGGATTCTGCTGAAGTGGAGATGCAGTCA GATGCCCTGGATGCAGCTAGAGGATGATGCTTTGTACATATCCCAGGCTAATTTCATCCTGGCCTACCAGTTCCGTCCAGATGGTGCCAGCTTGAACCGT CAGCCTCTGGGAGTCTGCTGGGCATGATGAGGACGTTTGCCACTTTGTGCTGGCCACCTCGCATATTGTCAGTGCAGGAGGAGATGGGAAGATTGGCCTT GGTAAGATTCACAGCACCTTCGCTGCCAAGTACTGGGCTCATGAACAGGAGGTGAACTGTGTGGATTGCAAAGGGGGCATCATATCATTGTGAGTGGCTC CAGGGACAGGACGGCCAAGGTGTGGCCTTTGGCCTCAGGCCAGCTGGGGTAGTGTTTATACACCATCCAGACTGAAGACCAAATCTGGTCTGTTGCTATC Fundamental question: How the changes in genome sequences give rise to phenotypic differences (e.g., disease states) ! When they got into the genome and how they have evolved ! Their roles in genome organization and gene regulation for 
 human biology ! Their implications in human diseases such as cancer Our goal — from base-pairs to bedside
  3. 3. Why Computational Genomics? 3
  4. 4. Why Computational Genomics? ! Key to personalized precision medicine, especially for cancer 4 David Patterson ! Cancer research has become big data science ! How to store and manage data efficiently ! How to analyze data in a distributed environment ! How to enhance data security but reduce barriers for sharing ! How to extract meaningful patterns ! How to identify mechanisms to help treatment ! …
  5. 5. The Human genome: the “blueprint” of our body 5 GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGA TTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATT AGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCT ATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAAC ATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATT James Watson
 Francis Crick February 15, 2001 March, 2011
  6. 6. DNA, Chromosome, and Genome 6 Chapter4:DNA,Chromosomes,andGenomes " beads-on-a-string" form of chromatin 30-nmchromatin fiberof packed nucleosomes Figure4-72 Chromatinpacking.This modelshowssomeof the manylevelsof chromatinpackingpostulatedto giverise to the highlycondensedmitotic chromosome. sectionof chromosomein extendedform condensedsection of chromosome entire mitotic chromosome T300nm I Tl 1 n m I T30 nm I TI 700nm Ii T1400nm I NETRESULT:EACHDNAMOLECULEHASBEEN ret CHROMOSOMALDNAANDITSPACKAGINGINTHECHROMATINFIBER (A) (B) -r^ Figure chrom a male under arethe chrom differ identi Chrom expos of hum coupl dyes.F fromc specif chrom DNAdoublehelix 5' Y 3' hydrogen-bonded basepairs 4-4). This complementary base-pairlng enables the base pairs to be packed in the energetically most favorable arrangement in the interior of the double helix. In this arrangement, each base pair is of similar width, thus holding the sugar- phosphate backbones an equal distance apart along the DNA molecule. To max- imize the efficiency of base-pair packing, the two sugar-phosphate backbones blocksof DNA phosphate suqar '; +K- sugar oase phosphate n e double-strandedDNA llilii:i:ilitffi$$iiiffiliiiii:ii:iii <CAGA>D nucleotide intoa poly strand)with backbonef andT)exte composed togetherby the pairedb endsofthe polaritieso antiparall molecule.I leftof the fi shownstra twistedint the right.F Figure4-4 the DNAdo chemicalst hydrogenb betweenA whereatom bonds(see broughtclo thedouble 3', s', H N - C_ C C - N / l I H - N o ' - L C N C - C C _ ,-n, , ,o'l [n,, thyminesugar-phosphate backbone H Ha d e n i n e N -HilililililO
  7. 7. DNA, RNA, Protein ! Central Dogma in molecular biology • DNA • RNA • Protein ! In general, proteins do most of the work, and are encoded by subsequences of DNA, known as genes. ! However, only less than 2% of the human genome codes for proteins. 7
  8. 8. Most of the genome are non-coding 8 © 2005 Nature Publishing Group SINEs LINEs Protein-coding genes Introns Miscellaneous unique sequences Miscellaneous heterochromatin Segmental duplications Simple sequence repeats DNA transposons LTR retrotransposons 20.4% 13.1% 1.5% 25.9% 11.6% 8% 5% 3% 2.9% 8.3% ss II elements transpose directly from DNA to DNA, and include DNA transposons and peat transposable elements (MITEs). nts (and especially their extinct remnants) make up a large portion of the human genome, with ample, the SINE Alu element) present in more than a million copies. Transposable-element mplex interactions with the host genome and other subgenomic elements, ranging from m. For a review of transposable-element structure, origins, impacts and evolution see REF. 17. ent man % of f the 000 nces. f s such %) s e www.nature.com/reviews/genetics Nat Rev Genet, 2005
  9. 9. Most functional information is non-coding ! 5% highly conserved, but only 1.5% encodes proteins 9 chr2 (q31.1) 21 p14 2p12 13 31.1 q34 q35 chr2: DLX1 DLX2 Vertebrate Cons Chimp Rhesus Bushbaby Tree_shrew Mouse Rat Guinea_Pig Shrew Hedgehog Dog Cat Horse Cow Armadillo Elephant Tenrec Opossum Platypus Lizard Chicken Zebrafish Tetraodon Fugu Stickleback Medaka 172660000 172665000 172670000 172675000 UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics Vertebrate Multiz Alignment & PhastCons Conservation (28 Species) DLX1 Gaps Human Chimp Rhesus Bushbaby Tree_shrew Mouse Rat Guinea_Pig Shrew Hedgehog Dog Cat Horse Cow Armadillo Elephant Tenrec Opossum Platypus UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics Vertebrate Multiz Alignment & PhastCons Conservation (28 Species) K P R T I Y S S L Q L Q A L N 1 A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G C T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A T A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C C C C C C T A G G A C A A T T T A T T C C A G T T T G C A G C T G G A C G C T T T G A A T A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A G C C C A G G A C A A T C T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A C T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C T A G G A C G A T T T A T T C C A G T T T G C A G C T G C A G G C T T T G A A T A A A C C C A G G A C T A T T T A T T C C A G T C T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C T A T A T A T T C C A G T T T G C A G T T G C A G G C A T T G A A C What do they do?
  10. 10. Annotating the non-coding regions 10 Scale chr2: NKI LADs (Tig3) 10 kb hg19 20,090,000 20,095,000 20,100,000 20,10 TTC32 LaminB1 (Tig3) 2 - -2 _ GM78 CHD2 IgM 889 - 1 _ GM78 Pol2 IgM 156.2 - 0 _ GM78 Pol2 Std 259.8 - 0 _ GM78 Rad2 IgR 8.7 - 0 _ GM78 TBP IgM 40.1 - 0 _ GM78 Z274 Std 16 - 1 _ K562 CHD2 IgR 1785 - 1 _ K562 Pol2 IgM 27.9 - 0 _ K562 IFa3 Pol2 Sd 211.5 - 0 _ K562 IFa6 Pol2 Sd 199.4 - 0 _ K562 IFg3 Pol2 Sd 241.7 - 0 _ K562 IFg6 Pol2 Sd 261.1 - 0 _ K562 Pol2 Std 343.1 - 0 _ K562 Rad2 Std 8.6 - 0 _ K562 TBP IgM 397 - 1 _ K562 Z274 UCD 5.4 - 0 _ ChromHMM also enables the analy across multiple cell types. When the ch mon across the cell types, a common m a virtual ‘concatenation’ of the chrom Alternatively a model can be learned by marks across cell types, or independent each cell type. Lastly, ChromHMM sup models with different number of chrom relations in their emission parameters ( We wrote the software in Java, whic virtually any computer. ChromHMM an tion is freely available at http://compbio he observed combination of chromatin marks using ndependent Bernoulli random variables2, which t learning of complex patterns of many chromatin . As input, it receives a list of aligned reads for each ark, which are automatically converted into pres- ce calls for each mark across the genome, based on kground distribution. One can use an optional addi- f aligned reads for a control dataset to either adjust for present or absent calls, or as an additional input tively, the user can input files that contain calls from nt peak caller. By default, chromatin states are ana- ase-pair intervals that roughly approximate nucleo- t smaller or larger windows ied. We also developed an ameter-initialization proce- bles relatively efficient infer- arable models across differ- of states (Supplementary e outputs of ChromHMM. hromatin-state annotation from ChromHMM and visualized Scale chr4: GM12878 1_Active_Promoter 2_Weak_Promoter 3_Poised_Promoter 4_Strong_Enhancer 5_Strong_Enhancer 6_Weak_Enhancer 7_Weak_Enhancer 8_Insulator 9_Txn_Transition 10_Txn_Elongation 11_Weak_Txn 12_Repressed 13_Heterochrom/lo 14_Repetitive/CNV 15_Repetitive/CNV 50 kb 103650000 103700000 RefSeq Genes GM12878 (User ordered) GM12878 (User ordered) NFKB1 NFKB1 a b cEmission parameters Transition parameters ChromHMM — Ernst and Kellis, Nature Methods 2012 Alternatively a model can be learned by a virtual ‘stacking’ of all marks across cell types, or independent models can be learned in each cell type. Lastly, ChromHMM supports the comparison of models with different number of chromatin states based on cor- relations in their emission parameters (Supplementary Fig. 4). We wrote the software in Java, which allows it to be run on virtually any computer. ChromHMM and additional documenta- tion is freely available at http://compbio.mit.edu/ChromHMM/. verted into pres- enome, based on an optional addi- et to either adjust n additional input contain calls from in states are ana- roximate nucleo- Scale chr4: GM12878 1_Active_Promoter 2_Weak_Promoter 3_Poised_Promoter 4_Strong_Enhancer 5_Strong_Enhancer 6_Weak_Enhancer 7_Weak_Enhancer 8_Insulator 9_Txn_Transition 10_Txn_Elongation 11_Weak_Txn 12_Repressed 13_Heterochrom/lo 14_Repetitive/CNV 15_Repetitive/CNV 50 kb 103650000 103700000 103750000 RefSeq Genes GM12878 (User ordered) GM12878 (User ordered) NFKB1 NFKB1 MANBA a b cEmission parameters State(userorder) State(userorder) Statefrom(userorder) Transition parameters Mark CTCF H3K27me3 H3K36me3 H4K20me1 H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac WCE Genome(%) RefSeqTSS CpGisland RefSeqTSS2kb RefSeqexon RefSeqgene RefSeqTES Conserved Lamina State to (user order) Category 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GM12878 fold enrichments
  11. 11. Cancer genomics workflow
  12. 12. Each type of cancer is different 12 widespread—remain and eventua promising the function of the lun organs. From a genetics persp seem that there must be mutatio primary cancer to a metastatic o are mutations that convert a nor nign tumor, or a benign tumor to (Fig. 2). Despite intensive effor sistent genetic alterations that dis that metastasize from cancers th metastasized remain to be identi One potential explanation in or epigenetic changes that are tify with current technologies (see matter” below). Another explana static lesions have not yet been ficient detail to identify these ge particularly if the mutations ar in nature. But another possibl that there are no metastasis gen primary tumor can take many y size, but this process is, in prin by stochastic processes alone (17 tumors release millions of cells tion each day, but these cells hav and only a miniscule fraction es lesions (19). Conceivably, these may, in a nondeterministic man and randomly lodge in a capillary that provides a favorable micro growth. The bigger the primary more likely that this process w scenario, the continual evolutio tumor would reflect local selec rather than future selective adva that growth at metastatic sites is n additional genetic alterations is a recent results showing that eve when placed in suitable enviro lymph nodes, can grow into org with a functioning vasculature ( 1500 1000 500 Colorectal(MSI) Lung(SCLC) Lung(NSCLC) Melanoma Esophageal(ESCC) Non-Hodgkinlymphoma Colorectal(MSS) Headandneck Esophageal(EAC) Gastric Endometrial(endometrioid) Pancreaticadenocarcinoma Ovarian(high-gradeserous) Prostate Hepatocellular Glioblastoma Breast Endometrial(serous) Lung(neversmokedNSCLC) Chroniclymphocyticleukemia Acutemyeloidleukemia Glioblastoma Neuroblastoma Acutelymphoblasticleukemia Medulloblastoma Rhabdoid Mutagens Non-synonymousmutationspertumor (median+/-onequartile) 250 225 200 175 150 125 100 50 75 25 0 B Adult solid tumors Liquid Pediatric Number of nonsynonymous mutations in representative human cancers, detected by genome-wide sequencing studies. Vogelstein et al. Science 2013
  13. 13. Each individual tumor is different ! Data from TCGA’s analyses show that most cancer types has a great number of mutations that occur at a low frequency. ! Long-tail distribution 13 doi: 10.1038/nature08645 SUPPLEMENTARY INFORMATION SI Guide Supplementary Figure 1 Haploid physical coverage of breast cancer samples. Physical coverage indicates the number of DNA fragments of which both ends have been sequenced that on average overlie any position in the genome. Supplementary Figure 2 Genome wide circos plots of somatic rearrangements in all 24 breast cancers in the study. Stephens et al. Nature 2009
  14. 14. Supervised learning Un-supervised learning genes samples Analyzing gene expression data
  15. 15. How to deal with high dimension? 
 Identify the most important genes ! d is the damping factor, a parameter representing the extent to which the ranking depends on the structure of the graph. ! f is the prior probability of the gene which we set to the absolute differential expression. ! is the in- degree of i 15 Gene Network Gene Expression Somatic Alteration Data (SNP, CNV, etc.) Ranks of Genes ▪ A ranking framework based on PageRank that considers the impact of genes in the network ▪ Impact includes connectivity and the amount downstream genes to be differentially expressed ▪ Dynamic damping factor is used to improve the original PageRank in ranking genes DawnRank Personalized Driver Alterations rt+1 j = (1 dj)fj + dj NX i=1 Ajirt i degi degi = PN j=1 Aji Hou and Ma, Genome Med 2014
  16. 16. Tumor heterogeneity vs. gene networks 16 NCIS - Liu et al. BMC Bioinfo 2014 C3 - Hou et al. Bioinformatics 2016 LDGM - Tian, Gu, and Ma, Nucleic Acids Res 2016 NRAS GABBR1 ATF2 MAPK1 PRKACA GNAI2 PRKACB CREB3L4 ADCY2 KCNJ3 PLCB4 GRB2 GNAI3 SRC PIK3CD CALML6 ESR1 GABBR2 ADCY4 FOS ADCY3 NOS3 PLCB2 OPRM1 AKT1 GNAS CREB3 PIK3CA HRAS PLCB3KCNJ6 CREB3L1 GNAO1 SHC1 MAP2K1 PIK3R5 ADCY5 MAPK3 PLCB1PIK3R3 SOS1 GNAI1 CALML3 MMP2PRKACG PRKCD CREB3L2HBEGF SHC4 PIK3CB AKT3 CREB5 GRM1 ADCY1 MMP9 EGFRJUN ADCY7 ATF6B SHC2 PIK3R1 CALM1 SOS2 ADCY9 ATF4 PIK3R2 SHC3 SP1 # interactions # interactions ESR1degreeRankofESR1 A B C Lumina A Basal-like % degree from Luminal A % degree from Basal-like LDGM Glasso JGL CNJGL Figure 5: Differential networks on estrogen signaling pathway reconstructed based on gene expression data from breast cancer Luminal A and Basal-like subtypes. (A) The degree of ESR1 in estimated differential networks with increased number of interactions. (B) The rank of ESR1 by its degree in differential networks. The number of interactions is up to 1,000 in (A) and (B). (C) A differential network b estimated by LDGM with = 0.362. Node size is proportional to the node’s degree. Width of an interaction i j is proportional to the score |bij|. The origin of interactions in the differential network is inferred by a principle of majority approach based on Glasso (see Supplementary Text). J.P.Hou et al.
  17. 17. Deep learning applications 17 x y Features Model ResultsClean data A D Feature extraction Discriminative features Raw data Label C Intron Exon Feature extraction Training Evaluation Supervised Unsupervised x • Linear regression • Logistic regression • Random Forest • SVM • … • PCA • Factor analysis • Clustering • Outlier detection • … B A C G T C G C G T A G T C C G T T A G T C G T A G G A G A A T A G C T G CA C G T G A CC A T G A G T C A T G CT G CG T C C G TA TC G A T G T C C G A G T A C A CC ACC GA G TG T G TC A T G C T A C A G C T AT G C G C T AG C T G AC T G A CT AT C G G C T A T G C A G A G C A C G A CGG CT C GA T G CC T G A T C C C A G T A G CT A G C T A CCA G C CA G CT CT G A CG T C T A C GA T C GT G A CA T C GG C A G CA T GG C A G CA T C G T A C G A T C G A T G C A C G TC G A T T G A T A G A C GC GA C T GA T CA T GA C T GT A G C G T A G C T A G C TC G AC A TC G A T G A T T C A T AG A TC T A C G T A Layer 1 A T G C A G A G C A C G A CGG CT C GA T G CC T G A T C C G T A G C T A G C TC G AC A TC G A T G A T T C A T AG A TC T A C G T A Raw data Pre- processing Raw data Layer 2 Intron ExonTSS Figure 1. Machine learning and representation learning. (A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervised machine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data are often high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot). Alternatively, higher-level features extracted using a deep model may be able to better discriminate between classes (right plot). (D) Deep networks use a hierarchical structure to learn increasingly abstract feature representations from the raw data. Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al Published online: July 29, 2016 x y Features Model ResultsClean data A D Feature extraction Discriminative features Raw data Label C Intron Exon Feature extraction Training Evaluation Supervised Unsupervised x • Linear regression • Logistic regression • Random Forest • SVM • … • PCA • Factor analysis • Clustering • Outlier detection • … B A C G T C G C G T A G T C C G T T A G T C G T A G G A G A A T A G C T G CA C G T G A CC A T G A G T C A T G CT G CG T C C G TA TC G A T G T C C G A G T A C A CC ACC GA G TG T G TC A T G C T A C A G C T AT G C G C T AG C T G AC T G A CT AT C G G C T A T G C A G A G C A C G A CGG CT C GA T G CC T G A T C C C A G T A G CT A G C T A CCA G C CA G CT CT G A CG T C T A C GA T C GT G A CA T C GG C A G CA T GG C A G CA T C G T A C G A T C G A T G C A C G TC G A T T G A T A G A C GC GA C T GA T CA T GA C T GT A G C G T A G C T A G C TC G AC A TC G A T G A T T C A T AG A TC T A C G T A Layer 1 A T G C A G A G C A C G A CGG CT C GA T G CC T G A T C C G T A G C T A G C TC G AC A TC G A T G A T T C A T AG A TC T A C G T A Raw data Pre- processing Raw data Layer 2 Intron ExonTSS Figure 1. Machine learning and representation learning. (A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervised machine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data are often high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot). Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al Published online: July 29, 2016 More traditional Machine Learning Applications to Deep Learning Application Angermueller et al., Mol Sys Bio 2016
  18. 18. DeepBIND 18 A N A LY S I S t i P a v p f P a t a s a t i t Figure 1 DeepBind’s input data, training procedure and applications. 1. The sequence specificities of DNA- and RNA-binding proteins can now be measured by several types of high-throughput assay, including PBM, SELEX, and ChIP- and CLIP-seq techniques. 2. DeepBind captures these binding specificities from raw sequence data by jointly discovering new sequence motifs along with rules for combining them into a predictive binding score. Graphics processing units (GPUs) are used to automatically train high-quality models, with expert tuning allowed but not required. 3. The resulting DeepBind models can then be used to identify binding sites in test sequences and Alipanahi et al. Nat Biotech 2015 (Other methods: DeepSEA — Zhou & Troyanskaya, Nat Methods 2015;
 DanQ — Quang & Xie, Nucleic Acids Res 2016)
  19. 19. Cancer genome 19 MCF-7 http://www.path.cam.ac.uk/~pawefish/
  20. 20. 20 Structural variations (SVs) 
 in cancer genomes inversion translocation gain loss duplication Whole genome sequencing Methods: DELLY, Meerkat, BreakDancer, 
 CREST, CNVnator, 
 CONSERTING, and many others
  21. 21. Aneuploidy — Common feature of cancer cells 21 MCF-7 http://www.path.cam.ac.uk/~pawefish/ ! Allele-specific copy number (ASCN) tools • ABSOLUTE, ASCAT, Patchwork ! SVs can further modify the aneuploid cancer genome into a mixture of genomic segments with extensive range of CNAs ! We need methods that combine SV and ASCN ! How SVs interact with ASCNs? How different SVs interact with each other? A N A LY S I S Percent of samples with WGD 6245 43 1143 2059 64 5327 a 0 0.5 1.0 Purity 1 2 3 4 5+ LUAD LUSC HNSC KIRC BRCA BLCA CRC UCEC GBM OV Ploidy 0 500 1,000 Samples (all lineages) Near diploid 1 WGD 2+ WGD ple Near diploid WGD samples Amplification Deletion b Zack et al. Nature Genetics 2013
  22. 22. Goal — Quantify allele-specific SVs 22 Goal - Quantify Allele-Specific SVs 4 Goal - Quantify Allele-Specific SVs 4 Goal - Quantify Allele-Specific SVs 4
  23. 23. Weaver — algorithm overview 23 Probabilistic Graphical Model (Markov Random Field) Mappability GC Content Purity ASCNG ASCNS Timing of SV Phasing SV list BAM file 1KGP haplotypes SNP list Cancer Genome Graph SNP linkage SNP LD (B) (C) R1 R3 R4 R5 Rm R6 R10 R1 R2 R3 R4 R5 R6 R10 R12 R13 R14 R16 R17 R18 R21 R1 R2 R3 R4 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21 R2 (A) interchr del dup intrachrintrachr R11 R12 R13 R14 R15 R16 R18 R19 R20 R21deln m p t s q chrA chrB Coverage from read mapping Input Output Li et al. Cell Systems 2016
  24. 24. 24 Purity ASCNG ASCNS Timing of SV Phasing 100 kb 21,850,000 21,900,000 21,950,000 22,000,000 22,050,000 22,100,000 22,150,000 22,200,000 22,250,000 MTAP C9orf53 CDKN2A CDKN2A CDKN2B-AS1 CDKN2B 142_ 0 _ chr9 9p23 21.3 21.1 12 9q12 13 31.1 32 33.1 Coverage LOH & first amplification Deletion Second amplification (B) Del1 Del2 ASCNS and Timing of SV Del1 Del2 Del1 Del2 Figure 1: (A) Schema diagram for Weaver. Dark green boxes show the different types of analyses, unique to Weaver that are not dealt with by other methods, while light green ones show ‘by-products’ of Weaver shown to have an improvement over existing methods. (B) An example demonstrating a Weaver output focused on ASCNS and Timing of SV. Dark blue segments (two copies) and light blue segment (one copy) represent a portion of the MCF-7 genome that originated from the same allele on chr9. The other allele was lost during tumorigenesis, resulting in LOH. The predicted evolution of this region
  25. 25. ! MRF: • genome node, cancer node, genome edge, cancer edge 25 (B) (C) R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R12 R13 R14 R15 R16 R17 R18 R19 R20 R11 R21 R1 R3 R4 R5 Rm Rp Rs Rq R6 R7 R8 R9 R10 RnR2 dup intrachrintrachr mq→(12,14) m(12,14)→q m(12,14)→s ms→(12,14) R(12,14) R15 R11 R21 Rp Rs Rq R(3,4) R(5,6) R(19,20) Rt m +R2 -R2 n +R6 -R10 p +R4 -R16 q -R12 -R21 s +R14 +R18 t +R16 -R18 label L_pos R_pos 2 2 1 1 1 1 2 2 1 2 2 1 2 2 1 1 1 2 L_allele R_allele CN R1 30 0.33 R2 40 0.5 R5 20 0 R7 10 0 R12 20 0.5 R17 10 0 label cov allele_freq 2 1 2 2 2 0 1 0 1 1 0 1 CN_1 CN_2 Genomic regions SVs Inputs Outputs(D) (E) 𝜇0 = 0; 𝜇1 = 1; b = 10 Time Post- Pre- chrA chrB R1 R2 R10 R(7,9) Rt R17 R18 RnRm R16 Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre- sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructed from (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashed Cancer Genome Graph (B) (C) R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R12 R13 R14 R15 R16 R17 R18 R19 R20 R11 R21 R1 R3 R4 R5 Rm Rp Rs Rq R6 R7 R8 R9 R10 RnR2 R12 R13 R14 R16 R17 R18 R21 R1 R3 R4R2 dup intrachrintrachr mq→(12,14) m(12,14)→q m(12,14)→s ms→(12,14) R(12,14) R15 R11 R21 Rp Rs Rq R(3,4) R(5,6) R(19,20) Rt m +R2 -R2 n +R6 -R10 p +R4 -R16 q -R12 -R21 s +R14 +R18 t +R16 -R18 label L_pos R_pos 2 2 1 1 1 1 2 2 1 2 2 1 2 2 1 1 1 2 L_allele R_allele CN R1 30 0.33 R2 40 0.5 R5 20 0 R7 10 0 R12 20 0.5 R17 10 0 label cov allele_freq 2 1 2 2 2 0 1 0 1 1 0 1 CN_1 CN_2 Genomic regions SVs Inputs Outputs(D) (E) 𝜇0 = 0; 𝜇1 = 1; b = 10 m p s q Time Post- Pre- chrA chrB R1 R2 R10 R(7,9) Rt R17 R18 RnRm R16 Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre- sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructed from (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashed MRF representation rozygosity (LOH) regions is known that most of the h using SV boundaries has segmentation methods in me Graph G := {R, E} eference adjacencies (Er) adjacencies (Ec) (dashed configurations E between senting the tail (right) and adjacent regions from the ndom Field (MRF, M := nt probabilities. The MRF en sequencing data can be explained in the following n hidden Markov models between ‘local’ variables, er genomes with complex ed steps are described in ions, and formal function ed in the Supplementary 2 million from Weaver based on various datasets, depending on the size of and the number of SVs. The rationale behind the segmentation step with SV time ASCNG boundaries coincide with SV breakpoints [14]. Our segmentati the advantage to provide base-level ASCNG boundaries as compared to exist copy number analysis, which typically use fixed segmentation size. Given the segmentation of the genome and SV set C, we then build C (Fig. 5(B)), with nodes representing genomic region sets (R) and edges rep (solid lines in the figure) if two nodes are adjacent in the normal genome lines in the figure) if two nodes are adjacent in the cancer genome by SV c lin node Ri and Rj can be represented as: ( iRi ⇠ jRj), 2 {+, }, with + a head (left) of a given genomic region R, e.g., (+Ri ⇠ Ri+1) 2 Er, if Ri an same chromosome in the normal genome. We then convert the original Cancer Genome Graph G := {R, E} into {R, Rc, Er, Ec}), which is a widely used probabilistic graphical model to e can be viewed as undirected graph and the aggregated inference problem in W viewed as a maximum a posteriori (MAP) problem with hidden states and ob sections. Unlike conventional methods for estimating copy number chang (HMMs), which are designed for sequential data and only consider the dep MAP solution of MRF model provides the most probable configuration of ane SVs, involving ‘global’ variable dependencies defined by long-range SVs. Supplementary Note 6. In the following sections, we describe hidden stat of the MRF MAP problem. Details on potential functions on nodes and edge Note. Hidden states H = NLD(Gi , Gi+1) ⇥ NLD(Gi , Gi+1) NLD(Ga i , Ga i+1) ⇥ NLD(Gb i , Gb i+1) + NLD(Ga i , Gb i+1) ⇥ NLD(Gb i , Ga i+1) where NLD(Ga i , Ga i+1) is the number of phased haplotypes (total number 1092 ⇥ 2 in phase 1) in 1KGP with genotype (Ga i , Ga i+1). Other genotype configurations can be similarly calculated. (ii) Similarly, we define the read linkage score for the phasing Ga i , Ga i+1/Gb i , Gb i+1 as: RL(Ga i , Ga i+1/Gb i , Gb i+1) = NRL(Ga i , Ga i+1) + NRL(Gb i , Gb i+1) NRL(Ri, Ri+1) where NRL(Ri, Ri+1) is the total number of reads covering genomic regions (Ri, Ri+1) and NRL(Ga i , Ga i+1) is total number of reads covering (Ga i , Ga i+1). If there are no reads covering (Ri, Ri+1) (NRL(i, i+1) = 0), RL = 0 Therefore, we define genotype linkage as GL(Ga i , Ga i+1/Gb i , Gb i+1) = log(LD(Ga i , Ga i+1/Gb i , Gb i+1) ⇤ RL(Ga i , Ga i+1/Gb i , Gb i+1)) In real data application, we have found that RL and LD correlate very well. For example, in the MCF-7 analysis when we chose SNP pairs with 100% RL support as gold standard, we found AUC= 0.9964 using LD scores. Markov random field model M After we convert G into MRF M using steps in Supplementary Note 6, the MRF MAP problem is given by ˆH = argmaxH 8 < : X i2R ⇥R(O|Hi) + X c2C ⇥C(O|Hc) + X i2R R(O|Hi, Hi+1) + X c2C X i2N (c) C(Hi, Hc) 9 = ; 7 genome node
 potential function cancer node
 potential function genome edge
 potential function cancer edge
 potential function
  26. 26. ASCN and SVs in MCF-7 26 ! 83% of SVs have copy number > 1 ! 68% of the regions have imbalanced copy number ! We found 276 SVs after whole chromosome dup ! We have used physical mapping to validate the results
  27. 27. ASCN and SVs in HeLa ! WGS reads obtained from Adey et al. Nature 2013 ! ASCNG are 97% consistent with Adey et al. (Fosmid seq) 27 Structural variants were identified by clustering discordantly mapped reads from 40-kb and 3-kb mate-pair libraries (Supplemen- tary Fig. 8). Twenty interchromosomal links were identified, including links for marker chromosomes M11 (9q33–11p14) and M14 (13q21– 19p13). In addition, 209 HeLa-specific deletions and 8 inversions were found (Supplementary Figs 9 and 11, and Supplementary Table 10). Only two genes that are impacted by HeLa-specific structural rearran- gements (Supplementary Table 11) intersected with SCGC (STK11 (ref. 18), FHIT), both of which are recurrently deleted in cervical 18,19 pool. Alleles that were p given clone were assigned and the unobserved alle haplotype. When overlap this resulted in haplotyp which 50% of the total len 550 kb containing 90.6% inherited. Most of the HeLa gen 1 2 3 4 5 X 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 HPV integration 3q11 M5, S Linked position Marker chromosome name Supported by Sequence data Colour indicates suspected haplotype Haplotype A Haplotype B Tandem duplication Probable contiguity a 1q11 M1 1q11 M25 15q11 M18 9p11 M10 3p21 M10 5q11 M4 3p11 M4 12q15 M12 5p 2xM7 11p14 M11,S 9q33 M11,S 9q33 M11 19p13 M14,S 21q11 M18 20p11 M15 13q21 M14,S 15q11 M18 3q11 M1 1p11 M2 9q11 M2 15q M13 21q11 M25 11q22 M11 5p marker M7 HPV locus 4q31-35 6q13-21 18q1 2 3 4 5 6 7 8 3q24-29 LOH Chr18 / S3 window ratios CCL-2 window ratios S3 copy-number calls S3-specific differences Windowratio;copynumber b Genomic position RESEARCH LETTER Adey et al. Nature, 2013
  28. 28. Application to TCGA Data ! Inter-chromosomal chromothripsis 28 1X 62X (A) (B) FOXG1 4 2214 6(C) Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571). (B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed at boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on the fold-back inversion boundary and highly amplified. ! Breakage-fusion-bridge amplifications 1X 62X A) (B) FOXG1 4 2214 6(C) Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571). (B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed at boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on the fold-back inversion boundary and highly amplified. 1X 62X A) (B) FOXG1 4 2214 6(C) Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571). (B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed TCGA-36-1571

×