Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Learning biologically relevant features using convolutional neural networks for dna sequence analysis
1. Learning Biologically Relevant Features Using
Convolutional Neural Networks for DNA Sequence Analysis
22/01/2018
Invited Research Talk @ Bayer, Ghent, Belgium
Jasper Zuallaert, Wesley De Neve
¹ IDLab, ELIS, Ghent University, Ghent, Belgium
² Center for Biotech Data Science, Ghent University Global Campus (GUGC), Songdo, Korea
4. Automatic genome annotation
* Which parts of the genome correspond to which functionalities?
* Which anomalies in the genome correspond to diseases?
* Can we manipulate the genome to avoid or cure diseases?
→ First step in mapping functionality to the genome, is to structure it
?
Primary structure?
Tertiary structure?
Binding sites?
Exons? Introns?
Genes?
Secondary structure?
4
5. Expert knowledge on translation initiation & splice sites
ExonIntron
10s to 10 000s
< 20
Exon Intron
G C C G C C C C A T G G … A G G T A G T …
A C A
G A G
C T A … … N A G G … … …
A C C C C C C C C C C C C C
G T T T T T T T T T T T T T
TAA
TAG
TGA
10s to 100s
~ 20
Exon
Translation initiation site Donor splice site
Acceptor splice site
Polypyrimidine tractBranch point
Stop codon
5
6. Dataset composition
6
Fixed length
(~ 200 - 400)
annotated site
Datasets with true and pseudo splice / translation initiation sites
…AGCGGCATCCAGGTAAGTTCTTCAACCTGTAAGGGAGGCTTCAGTTAAAGCCATCCGA…
…AGCGGCATCCAGGTAAATGTCTTCAA…
…CATCCAGATGAAGTTCTTCAACCTAT…
…TGTCTTCAACCTGTAAGGGAGGCTTC…
…AGGGAGGCTTCAGTTAAAGCCATCCG…
11. 11
Convolutional Neural Networks for DNA sequences
A 1 0 0 0
G 0 0 1 0
T 0 0 0 1
T 0 0 0 1
C 0 1 0 0
A 1 0 0 0
G 0 0 1 0
G 0 0 1 0
T 0 0 0 1
A 1 0 0 0
G 0 0 1 0
C 0 1 0 0
C 0 1 0 0
T 0 0 0 1
C 0 2 0 0
A 1 0 0 0
G 0 0 2 0
G 0 0 3 0
T 0 0 0 1
A/G 1 0 2 0
T/C 0 2 0 1
T/C 0 1 0 1
T/C 0 1 0 1
Pattern detection Combination of
patterns from
previous layers
True splice site
False splice site
12. Beats state-of-the-art on various datasets with different sizes, class imbalance and sequence lengths
donors acceptors
Results on splice site prediction
A Degroeve et al, 2005 (SVM)
B Lee et al, 2015 (DBN)
C Our approach (CNN)
Positives Negatives
200 to 15 000 1000 to 75 000
240 x
12
A B C A B C
15. Visualization of neural networks
Images Genomic data
??
15
Source: Visualizing Deep Neural Network Decisions:
Prediction Difference Analysis, Zintgraf et al, 2017
Saliency map for prediction: cuckatoo
(Part of) saliency map for TIS prediction
16. 1. Calculate* contribution scores per nucleotide (𝑐𝑠𝑖) , for each sequence
2. Normalize scores
3. Evaluation, e.g., by averaging over multiple sequences
Making sense of DNA saliency maps
16
𝑤𝑐𝑠𝑖𝑗 = 100 ∗ 𝑚 ∗
𝑐𝑠𝑖𝑗
𝑝=1
𝑚
𝑞=1
𝑛
𝑐𝑠 𝑝𝑞
* Using DeepLIFT
Learning Important Features Through Propagating
Activation Differences, Shrikumar et al., 2017
This gives a more interpretable meaning
to the contribution scores
+ it normalizes scores for different
datasets on the same scale
17. 17
Visualization example: acceptor sites
Branch point detection
CTNA
AG
CAGGTAAG
AG exclusion zone- Polypyrimidine tract (Cs and Ts)
- Acceptor motif CAGG(T)
Presence of a donor pattern:
- Expected towards the end of the sequence
most exons are <200 nucleotides long
18. 18
Visualization example: donor sites
CAGGTAAG
Donor motif CAGGTAAGT
Presence of a donor pattern:
- Not expected in the
sequence, as this would imply
an unlikely short intron+exon
19. 19
Visualization example: translation initiation sites
Translation initiation site motif (GCCACCATGGCG)
Presence of a donor motif:
- Expected after the site, as the end of the first exon
- Not expected in front of the site
Presence of a stop codon (TGA, TAA, TAG)
- Not expected at any third position (because of translation in codons)
- At other position, no influence
TAA
CAGGTAAG
21. Conclusions
Pattern detection with Convolutional Neural Networks = very effective for splice site / TIS prediction
- End-to-end prediction system
- No manual feature engineering needed
Without any prior knowledge, biologically relevant features are learnt
Publications:
DLB2H 2017 - Interpretable Convolutional Neural Networks for Effective Translation Initiation Site Prediction
Under revision:
Bioinformatics - Interpretable Convolutional Neural Networks for Improved Splice Site Prediction
21
22. Improving predictions using extra forms of data
Spatial properties, physicochemical properties, …
Visualization of which inputs the network uses
Automating the workflow for determining biologically relevant features
automatically group similar patterns with similar scores
+ look further into the network internals
Verification on known problems (see previous slides) detection of previously unknown patterns?
Verification on an artificially created dataset, seeing if all induced features can be found
Future work
22
Automatic Genome Annotation
Goal : finding links between genome and functionality, to fight diseases and others
Incredibly complex matter: the genome is not just a linear DNA sequence, but a 3D structure with many internal and external dependencies
To find dependencies, first we need to find some structure in the genome
Regular approaches in the field require a manual extraction of features based on human experience. In the image you can the typical composition of translation initiation and splice sites
The datasets we work with are built up by extracting all possible splice sites (all canonical splice sites, i.e. with GT in the middle), accompanied by a label indicating whether or not it is indeed a true splice site
Results on splice site prediction
The right graph summarizes the results of 24 x 10-fold cross-validation tests, on datasets with varying sizes (see table)
In contrast to perturbation based approaches, backpropagation based approaches involve only one forward propagation and one back propagation per input. A variety of approaches are used, but they all work according to the principal of gradient calculation. We make use of DeepLIFT (Shrikumar et al, 2017), which is a backpropagation-based approach.
Using these approaches, a saliency map is produced for each input.
When looking at saliency maps for images, we immediately understand what our network is looking at.
However, for DNA sequences, this is not the case. It is very hard to make sense of saliencies of different nucleotides.