1/18
Towards reading genomic data using
deep learning-driven NLP techniques
Jasper Zuallaert
Mijung Kim
Wesley De Neve
Data Science Lab @ Ghent University – iMinds, Belgium
Center for Biotech Data Science @ Ghent University Global Campus, Korea
BIOINFO 2016 – Precision Bioinformatics & Machine Learning Incheon Global Campus, 19/08/2016
2/18
Automatic genome annotation
Regular approaches
Our approach → deep learning
Extra → word representations & data augmentation
Current experimental results
Conclusions & future research
Outline
3/18
Automatic genome annotation
?
• Which parts of a genome correspond to which functionalities?
• Which anomalies in a genome correspond to which diseases?
• Can we manipulate a genome so to cure or avoid diseases?
→ First step to map the functionality of a genome is to identify its building blocks
We want to develop an end-to-end learning system that is able to automatically
discover the start of genes, and how these genes are split up into introns and exons
+ understand the biological motivation, if any, behind the decisions taken
4/18
Genome structure
CAGACTATATCGACTAATATATCTCATCTACAGATACTGACTAGCATCGATATTATG
… ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| …
GTCTGATATAGCTGATTATATAGAGTAGATGTCTATGACTGATCGTAGCTATAATAC
DNAGene
Proteins
GeneGene Gene
Exon Intron Exon Intron Exon
Splice sites
5/18
Regular approaches
→ Use features that have been manually identified by human experts
Our approach
→ Uses features that have been automatically identified by deep learning
Splice site prediction
Donor site Acceptor site
Exon ExonIntron
10 to 10 000 < 20
Branch site
G T A GA G
C
A
A G
C
A
G
A
C
T
C C C C C C C C C C
T T T T T T T T T T
NA
C T A
T C G
C
T
N
C
T
--------
6/18
He et al., “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv, 2015
Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, arXiv, 2015
The success of deep learning
Introduction of
Deep Learning
7/18
Convolutional neural networks
Very successful in image processing because of the detection of visual patterns
lines shapes structures objects
Currently also very successful for character-level natural language processing
→ this observation inspired us to use CNNs for genomic data analysis
→ condition: need for a proper one-hot encoding (numerical representation)
A C G T
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
8/18
Convolutional
neural networks
A T A T C G
1 0 1 0 0 0
0 0 0 0 1 0
0 0 0 0 0 1
0 1 0 1 0 0
Input
Filters
Filtered output
After max-pooling
3 2 1
1 2 -1
0 -3 -2
0 -2 0
4 -2 2
0 0 -1
0 -1 -1
0 1 -2
2 2 -1 0
7 -4 4 -1
2 2 0
7 4 4
9/18
Our architecture
CGTT...AGGGCGCCATATCGAGCATGTTATCTGCGTA...CAGT
30x
40x 40x 40x 40x
30x 30x 30x Size 64
Size 64
Size 64
Size 64
512 #
256 # Yes | No
128 #
Look for different filter sizes
Introduce some translational invariance
Extra focus on the middle part
Combine results and make final prediction
Use of recurrent Long Short-Term Memory
Networks
10/18
Very important in Natural Language Processing
Each word is represented by a vector of floating-point values
wi = [0.25, 0.30, -0.18, 0.45, …]
Each word vector is trained by its neighbours in a vast collection of texts→ trained to predict its neighbours
The quick brown fox jumps over the lazy dog
Result = powerful word embeddings
Word2Vec representation
King - Man + Woman ≈ Queen
Man
King
Woman
Queen
11/18
Word2Vec algorithm applied on protein/DNA sequences
Each 𝒏-gram represented by a vector
e.g., GTC = [0.359, 0.211, -0.492, …, 0.129]
Each word vector is trained by its neighbours on a genome → trained to predict its neighbours
... ATG TGT GTC TCA CAC …
ProtVec representation
12/18
→ Generate extra (more general) data by introducing uncertainty
Basic string
CGATTATATCATCGCGGCGCATCGCGACTCGAGAATATCATCGGCGACGTACTCATCATGCAA
After mutation
CGATTATATCATCGCGGCGCATCGCGACTCGAGAATATCATCGGCGACGTACTCATCATGCAA
CGATTATNTCATCNCNGCGCATNGCGACTCGAGAATATCATCNGCGACGTACTNATCATGNAA
After insertion
CGATTATNTC ATCNCNGC GCAT NGCGACTCGAGAATATC ATCNGCGAC GTA CTNATCAT GNAA
TTATNTCNATCNCNGCNGCATNNGCGACTCGAGAATATCNATCNGCGACNGTANCTNATCATN
N = unknown =
Data augmentation
¼ A ¼ C
¼ G ¼ T
13/18
Experimental setup
Code written in Python, using the Theano and Lasagne packages
Four GTX Titan X GPUs for training
5-fold cross-validation  train/validate/test = 3/1/1
!! Class imbalance !!
=> weighted cost function (cross entropy), prioritizing positive samples
Publication Degroeve et al., 2005 Lee et al., 2015 Pashaei et al., 2016
Dataset Arabidopsis thaliana UCSC-HG19/38 HS3D (Homo sapiens)
# positives ~ 9028 ~ 160 000 ~ 2800
# negatives ~ 240 000 ~ 800 000 ~ 28 000
14/18
Experimental results (1):
proposed architecture + simple one-hot encoding
15/18
Experimental results (2)
proposed architecture + ProtVect + data augmentation
16/18
Experimental results (2)
proposed architecture + ProtVect + data augmentation
17/18
Deep learning for automatic genome annotation
Facilitates automatic feature learning
CNNs allow for pattern detection in genomic data
Outperforms current techniques for splice site prediction
Future research
Optimization → Find better network topologies
→ Further exploration of word representations and data augmentation
Visualization → Get biological insights into what the network learns
Generalization → Find generic architecture for modeling of any noisy character sequences
Conclusions & future research
18/18
Jasper Zuallaert: jasper.zullaert@ugent.be
Mijung Kim: mijung.kim@ugent.be
Wesley De Neve: wesley.deneve@ugent.be

Towards reading genomic data using deep learning-driven NLP techniques

  • 1.
    1/18 Towards reading genomicdata using deep learning-driven NLP techniques Jasper Zuallaert Mijung Kim Wesley De Neve Data Science Lab @ Ghent University – iMinds, Belgium Center for Biotech Data Science @ Ghent University Global Campus, Korea BIOINFO 2016 – Precision Bioinformatics & Machine Learning Incheon Global Campus, 19/08/2016
  • 2.
    2/18 Automatic genome annotation Regularapproaches Our approach → deep learning Extra → word representations & data augmentation Current experimental results Conclusions & future research Outline
  • 3.
    3/18 Automatic genome annotation ? •Which parts of a genome correspond to which functionalities? • Which anomalies in a genome correspond to which diseases? • Can we manipulate a genome so to cure or avoid diseases? → First step to map the functionality of a genome is to identify its building blocks We want to develop an end-to-end learning system that is able to automatically discover the start of genes, and how these genes are split up into introns and exons + understand the biological motivation, if any, behind the decisions taken
  • 4.
    4/18 Genome structure CAGACTATATCGACTAATATATCTCATCTACAGATACTGACTAGCATCGATATTATG … |||||||||||||||||||||||||||||||||||||||||||||||||||||||||… GTCTGATATAGCTGATTATATAGAGTAGATGTCTATGACTGATCGTAGCTATAATAC DNAGene Proteins GeneGene Gene Exon Intron Exon Intron Exon Splice sites
  • 5.
    5/18 Regular approaches → Usefeatures that have been manually identified by human experts Our approach → Uses features that have been automatically identified by deep learning Splice site prediction Donor site Acceptor site Exon ExonIntron 10 to 10 000 < 20 Branch site G T A GA G C A A G C A G A C T C C C C C C C C C C T T T T T T T T T T NA C T A T C G C T N C T --------
  • 6.
    6/18 He et al.,“Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv, 2015 Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, arXiv, 2015 The success of deep learning Introduction of Deep Learning
  • 7.
    7/18 Convolutional neural networks Verysuccessful in image processing because of the detection of visual patterns lines shapes structures objects Currently also very successful for character-level natural language processing → this observation inspired us to use CNNs for genomic data analysis → condition: need for a proper one-hot encoding (numerical representation) A C G T 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
  • 8.
    8/18 Convolutional neural networks A TA T C G 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 Input Filters Filtered output After max-pooling 3 2 1 1 2 -1 0 -3 -2 0 -2 0 4 -2 2 0 0 -1 0 -1 -1 0 1 -2 2 2 -1 0 7 -4 4 -1 2 2 0 7 4 4
  • 9.
    9/18 Our architecture CGTT...AGGGCGCCATATCGAGCATGTTATCTGCGTA...CAGT 30x 40x 40x40x 40x 30x 30x 30x Size 64 Size 64 Size 64 Size 64 512 # 256 # Yes | No 128 # Look for different filter sizes Introduce some translational invariance Extra focus on the middle part Combine results and make final prediction Use of recurrent Long Short-Term Memory Networks
  • 10.
    10/18 Very important inNatural Language Processing Each word is represented by a vector of floating-point values wi = [0.25, 0.30, -0.18, 0.45, …] Each word vector is trained by its neighbours in a vast collection of texts→ trained to predict its neighbours The quick brown fox jumps over the lazy dog Result = powerful word embeddings Word2Vec representation King - Man + Woman ≈ Queen Man King Woman Queen
  • 11.
    11/18 Word2Vec algorithm appliedon protein/DNA sequences Each 𝒏-gram represented by a vector e.g., GTC = [0.359, 0.211, -0.492, …, 0.129] Each word vector is trained by its neighbours on a genome → trained to predict its neighbours ... ATG TGT GTC TCA CAC … ProtVec representation
  • 12.
    12/18 → Generate extra(more general) data by introducing uncertainty Basic string CGATTATATCATCGCGGCGCATCGCGACTCGAGAATATCATCGGCGACGTACTCATCATGCAA After mutation CGATTATATCATCGCGGCGCATCGCGACTCGAGAATATCATCGGCGACGTACTCATCATGCAA CGATTATNTCATCNCNGCGCATNGCGACTCGAGAATATCATCNGCGACGTACTNATCATGNAA After insertion CGATTATNTC ATCNCNGC GCAT NGCGACTCGAGAATATC ATCNGCGAC GTA CTNATCAT GNAA TTATNTCNATCNCNGCNGCATNNGCGACTCGAGAATATCNATCNGCGACNGTANCTNATCATN N = unknown = Data augmentation ¼ A ¼ C ¼ G ¼ T
  • 13.
    13/18 Experimental setup Code writtenin Python, using the Theano and Lasagne packages Four GTX Titan X GPUs for training 5-fold cross-validation  train/validate/test = 3/1/1 !! Class imbalance !! => weighted cost function (cross entropy), prioritizing positive samples Publication Degroeve et al., 2005 Lee et al., 2015 Pashaei et al., 2016 Dataset Arabidopsis thaliana UCSC-HG19/38 HS3D (Homo sapiens) # positives ~ 9028 ~ 160 000 ~ 2800 # negatives ~ 240 000 ~ 800 000 ~ 28 000
  • 14.
    14/18 Experimental results (1): proposedarchitecture + simple one-hot encoding
  • 15.
    15/18 Experimental results (2) proposedarchitecture + ProtVect + data augmentation
  • 16.
    16/18 Experimental results (2) proposedarchitecture + ProtVect + data augmentation
  • 17.
    17/18 Deep learning forautomatic genome annotation Facilitates automatic feature learning CNNs allow for pattern detection in genomic data Outperforms current techniques for splice site prediction Future research Optimization → Find better network topologies → Further exploration of word representations and data augmentation Visualization → Get biological insights into what the network learns Generalization → Find generic architecture for modeling of any noisy character sequences Conclusions & future research
  • 18.
    18/18 Jasper Zuallaert: jasper.zullaert@ugent.be MijungKim: mijung.kim@ugent.be Wesley De Neve: wesley.deneve@ugent.be