Towards reading genomic data using deep learning-driven NLP techniques

1/18
Towards reading genomic data using
deep learning-driven NLP techniques
Jasper Zuallaert
Mijung Kim
Wesley De Neve
Data Science Lab @ Ghent University – iMinds, Belgium
Center for Biotech Data Science @ Ghent University Global Campus, Korea
BIOINFO 2016 – Precision Bioinformatics & Machine Learning Incheon Global Campus, 19/08/2016

2/18
Automatic genome annotation
Regular approaches
Our approach → deep learning
Extra → word representations & data augmentation
Current experimental results
Conclusions & future research
Outline

3/18
Automatic genome annotation
?
• Which parts of a genome correspond to which functionalities?
• Which anomalies in a genome correspond to which diseases?
• Can we manipulate a genome so to cure or avoid diseases?
→ First step to map the functionality of a genome is to identify its building blocks
We want to develop an end-to-end learning system that is able to automatically
discover the start of genes, and how these genes are split up into introns and exons
+ understand the biological motivation, if any, behind the decisions taken

4/18
Genome structure
CAGACTATATCGACTAATATATCTCATCTACAGATACTGACTAGCATCGATATTATG
… ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| …
GTCTGATATAGCTGATTATATAGAGTAGATGTCTATGACTGATCGTAGCTATAATAC
DNAGene
Proteins
GeneGene Gene
Exon Intron Exon Intron Exon
Splice sites

5/18
Regular approaches
→ Use features that have been manually identified by human experts
Our approach
→ Uses features that have been automatically identified by deep learning
Splice site prediction
Donor site Acceptor site
Exon ExonIntron
10 to 10 000 < 20
Branch site
G T A GA G
C
A
A G
C
A
G
A
C
T
C C C C C C C C C C
T T T T T T T T T T
NA
C T A
T C G
C
T
N
C
T
--------

6/18
He et al., “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv, 2015
Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, arXiv, 2015
The success of deep learning
Introduction of
Deep Learning

7/18
Convolutional neural networks
Very successful in image processing because of the detection of visual patterns
lines shapes structures objects
Currently also very successful for character-level natural language processing
→ this observation inspired us to use CNNs for genomic data analysis
→ condition: need for a proper one-hot encoding (numerical representation)
A C G T
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

8/18
Convolutional
neural networks
A T A T C G
1 0 1 0 0 0
0 0 0 0 1 0
0 0 0 0 0 1
0 1 0 1 0 0
Input
Filters
Filtered output
After max-pooling
3 2 1
1 2 -1
0 -3 -2
0 -2 0
4 -2 2
0 0 -1
0 -1 -1
0 1 -2
2 2 -1 0
7 -4 4 -1
2 2 0
7 4 4

9/18
Our architecture
CGTT...AGGGCGCCATATCGAGCATGTTATCTGCGTA...CAGT
30x
40x 40x 40x 40x
30x 30x 30x Size 64
Size 64
Size 64
Size 64
512 #
256 # Yes | No
128 #
Look for different filter sizes
Introduce some translational invariance
Extra focus on the middle part
Combine results and make final prediction
Use of recurrent Long Short-Term Memory
Networks

10/18
Very important in Natural Language Processing
Each word is represented by a vector of floating-point values
wi = [0.25, 0.30, -0.18, 0.45, …]
Each word vector is trained by its neighbours in a vast collection of texts→ trained to predict its neighbours
The quick brown fox jumps over the lazy dog
Result = powerful word embeddings
Word2Vec representation
King - Man + Woman ≈ Queen
Man
King
Woman
Queen

11/18
Word2Vec algorithm applied on protein/DNA sequences
Each 𝒏-gram represented by a vector
e.g., GTC = [0.359, 0.211, -0.492, …, 0.129]
Each word vector is trained by its neighbours on a genome → trained to predict its neighbours
... ATG TGT GTC TCA CAC …
ProtVec representation

12/18
→ Generate extra (more general) data by introducing uncertainty
Basic string
CGATTATATCATCGCGGCGCATCGCGACTCGAGAATATCATCGGCGACGTACTCATCATGCAA
After mutation
CGATTATATCATCGCGGCGCATCGCGACTCGAGAATATCATCGGCGACGTACTCATCATGCAA
CGATTATNTCATCNCNGCGCATNGCGACTCGAGAATATCATCNGCGACGTACTNATCATGNAA
After insertion
CGATTATNTC ATCNCNGC GCAT NGCGACTCGAGAATATC ATCNGCGAC GTA CTNATCAT GNAA
TTATNTCNATCNCNGCNGCATNNGCGACTCGAGAATATCNATCNGCGACNGTANCTNATCATN
N = unknown =
Data augmentation
¼ A ¼ C
¼ G ¼ T

13/18
Experimental setup
Code written in Python, using the Theano and Lasagne packages
Four GTX Titan X GPUs for training
5-fold cross-validation  train/validate/test = 3/1/1
!! Class imbalance !!
=> weighted cost function (cross entropy), prioritizing positive samples
Publication Degroeve et al., 2005 Lee et al., 2015 Pashaei et al., 2016
Dataset Arabidopsis thaliana UCSC-HG19/38 HS3D (Homo sapiens)
# positives ~ 9028 ~ 160 000 ~ 2800
# negatives ~ 240 000 ~ 800 000 ~ 28 000

14/18
Experimental results (1):
proposed architecture + simple one-hot encoding

15/18
Experimental results (2)
proposed architecture + ProtVect + data augmentation

16/18
Experimental results (2)
proposed architecture + ProtVect + data augmentation

17/18
Deep learning for automatic genome annotation
Facilitates automatic feature learning
CNNs allow for pattern detection in genomic data
Outperforms current techniques for splice site prediction
Future research
Optimization → Find better network topologies
→ Further exploration of word representations and data augmentation
Visualization → Get biological insights into what the network learns
Generalization → Find generic architecture for modeling of any noisy character sequences
Conclusions & future research

18/18
Jasper Zuallaert: jasper.zullaert@ugent.be
Mijung Kim: mijung.kim@ugent.be
Wesley De Neve: wesley.deneve@ugent.be

Towards reading genomic data using deep learning-driven NLP techniques

More Related Content

Similar to Towards reading genomic data using deep learning-driven NLP techniques

More from Ghent University Global Campus

Recently uploaded

Towards reading genomic data using deep learning-driven NLP techniques