Learning biologically relevant features using convolutional neural networks for dna sequence analysis

Learning Biologically Relevant Features Using
Convolutional Neural Networks for DNA Sequence Analysis
22/01/2018
Invited Research Talk @ Bayer, Ghent, Belgium
Jasper Zuallaert, Wesley De Neve
¹ IDLab, ELIS, Ghent University, Ghent, Belgium
² Center for Biotech Data Science, Ghent University Global Campus (GUGC), Songdo, Korea

Introduction
Convolutional neural networks for DNA analysis
Visualization of biologically relevant features
Conclusions & future work
2

Introduction
3

Automatic genome annotation
* Which parts of the genome correspond to which functionalities?
* Which anomalies in the genome correspond to diseases?
* Can we manipulate the genome to avoid or cure diseases?
→ First step in mapping functionality to the genome, is to structure it
?
Primary structure?
Tertiary structure?
Binding sites?
Exons? Introns?
Genes?
Secondary structure?
4

Expert knowledge on translation initiation & splice sites
ExonIntron
10s to 10 000s
< 20
Exon Intron
G C C G C C C C A T G G … A G G T A G T …
A C A
G A G
C T A … … N A G G … … …
A C C C C C C C C C C C C C
G T T T T T T T T T T T T T
TAA
TAG
TGA
10s to 100s
~ 20
Exon
Translation initiation site Donor splice site
Acceptor splice site
Polypyrimidine tractBranch point
Stop codon
5

Dataset composition
6
Fixed length
(~ 200 - 400)
annotated site
Datasets with true and pseudo splice / translation initiation sites
…AGCGGCATCCAGGTAAGTTCTTCAACCTGTAAGGGAGGCTTCAGTTAAAGCCATCCGA…
…AGCGGCATCCAGGTAAATGTCTTCAA…
…CATCCAGATGAAGTTCTTCAACCTAT…
…TGTCTTCAACCTGTAAGGGAGGCTTC…
…AGGGAGGCTTCAGTTAAAGCCATCCG…

Introduction
7

125 126 215 218
158 210 056 089
068 063 066 067
054 065 045 023
085 112 102 106
058 154 156 181
085 084 120 123
The success of Deep Learning
8
Introduction of Deep Learning
1.2 million images
......
...
1000 classes
ImageNet classification competition

Neural networks
Input Output
Self-learning, black-box systems
Feedback → update parameters
9

Convolutional Neural Networks for images
Input image
Lines and
shapes Structures
Concepts
10

11
Convolutional Neural Networks for DNA sequences
A 1 0 0 0
G 0 0 1 0
T 0 0 0 1
T 0 0 0 1
C 0 1 0 0
A 1 0 0 0
G 0 0 1 0
G 0 0 1 0
T 0 0 0 1
A 1 0 0 0
G 0 0 1 0
C 0 1 0 0
C 0 1 0 0
T 0 0 0 1
C 0 2 0 0
A 1 0 0 0
G 0 0 2 0
G 0 0 3 0
T 0 0 0 1
A/G 1 0 2 0
T/C 0 2 0 1
T/C 0 1 0 1
T/C 0 1 0 1
Pattern detection Combination of
patterns from
previous layers
True splice site
False splice site

Beats state-of-the-art on various datasets with different sizes, class imbalance and sequence lengths
donors acceptors
Results on splice site prediction
A Degroeve et al, 2005 (SVM)
B Lee et al, 2015 (DBN)
C Our approach (CNN)
Positives Negatives
200 to 15 000 1000 to 75 000
240 x
12
A B C A B C

Introduction
13

Goal → which parts of the input impact the prediction, and why?
0.74
0.26
14
Visualization of neural networks
120 206 55 75 85
128 155 23 178 164
250 216 223 217 64
23 54 54 237 253
16 24 101 132 177
0.05 0.01 0.06 0.07 -0.05
0.12 0.68 0.98 0.84 0.06
0.23 0.55 0.84 -0.06 -0.12
0.08 0.21 -0.06 -0.22 -0.23
0.02 0.06 -0.26 -0.83 -0.55
Step 1 --- Forward propagation
Calculate predictions
Step 2 --- Backpropagation
Calculate contribution scores per input

Visualization of neural networks
Images Genomic data
??
15
Source: Visualizing Deep Neural Network Decisions:
Prediction Difference Analysis, Zintgraf et al, 2017
Saliency map for prediction: cuckatoo
(Part of) saliency map for TIS prediction

1. Calculate* contribution scores per nucleotide (𝑐𝑠𝑖) , for each sequence
2. Normalize scores
3. Evaluation, e.g., by averaging over multiple sequences
Making sense of DNA saliency maps
16
𝑤𝑐𝑠𝑖𝑗 = 100 ∗ 𝑚 ∗
𝑐𝑠𝑖𝑗
𝑝=1
𝑚
𝑞=1
𝑛
𝑐𝑠 𝑝𝑞
* Using DeepLIFT
Learning Important Features Through Propagating
Activation Differences, Shrikumar et al., 2017
This gives a more interpretable meaning
to the contribution scores
+ it normalizes scores for different
datasets on the same scale

17
Visualization example: acceptor sites
Branch point detection
CTNA
AG
CAGGTAAG
AG exclusion zone- Polypyrimidine tract (Cs and Ts)
- Acceptor motif CAGG(T)
Presence of a donor pattern:
- Expected towards the end of the sequence
 most exons are <200 nucleotides long

18
Visualization example: donor sites
CAGGTAAG
Donor motif CAGGTAAGT
Presence of a donor pattern:
- Not expected in the
sequence, as this would imply
an unlikely short intron+exon

19
Visualization example: translation initiation sites
Translation initiation site motif (GCCACCATGGCG)
Presence of a donor motif:
- Expected after the site, as the end of the first exon
- Not expected in front of the site
Presence of a stop codon (TGA, TAA, TAG)
- Not expected at any third position (because of translation in codons)
- At other position, no influence
TAA
CAGGTAAG

Introduction
20

Conclusions
Pattern detection with Convolutional Neural Networks = very effective for splice site / TIS prediction
- End-to-end prediction system
- No manual feature engineering needed
 Without any prior knowledge, biologically relevant features are learnt
Publications:
DLB2H 2017 - Interpretable Convolutional Neural Networks for Effective Translation Initiation Site Prediction
Under revision:
Bioinformatics - Interpretable Convolutional Neural Networks for Improved Splice Site Prediction
21

Improving predictions using extra forms of data
 Spatial properties, physicochemical properties, …
 Visualization of which inputs the network uses
Automating the workflow for determining biologically relevant features
 automatically group similar patterns with similar scores
+ look further into the network internals
 Verification on known problems (see previous slides)  detection of previously unknown patterns?
 Verification on an artificially created dataset, seeing if all induced features can be found
Future work
22

Thank you for your attention!
23

Learning biologically relevant features using convolutional neural networks for dna sequence analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Learning biologically relevant features using convolutional neural networks for dna sequence analysis

Similar to Learning biologically relevant features using convolutional neural networks for dna sequence analysis (20)

More from Wesley De Neve

More from Wesley De Neve (20)

Recently uploaded

Recently uploaded (20)

Learning biologically relevant features using convolutional neural networks for dna sequence analysis

Editor's Notes