Splice site recognition among different organisms

INTERDEPARTMENTAL POSTGRADUATE PROGRAM
"INFORMATION TECHNOLOGIES IN MEDICINE AND BIOLOGY"
MASTER THESIS
Splice site recognition among
different organisms
Despoina I. Kalfakakou
Supervisors:
Stavros Perantonis, Research Director, NCSR Demokritos
George Paliouras, Research Director, NCSR Demokritos
Anastasia Krithara, Post-Doctoral Researcher, NCSR Demokritos

Structure
• RNA Splicing
• Motivation
• Transfer Learning
• Proposed Approaches
• Conclusion

Central Dogma of Molecular Biology

RNA Splicing process
snRNPs:
small nuclear
ribonucleoproteins
Spliceosome:
Complex formed from
snRNPs which catalyzes
splicing process

RNA Splicing process
Donor, Acceptor:
Splice sites, boundaries
between exons and introns
GU dinucleotide AG dinucleotide

Importance of Accurate Splice Site Prediction
• Typical mammalian gene has 7-8 exons spread out over ~16 kb.
• Splice site prediction leads to identification of these exons.
• Exon identification is the first step to accurate genome annotation.
• Currently, hundreds of genomes have been annotated, but
thousands more remain unknown.
• Moreover, many of the already annotated genomes are incorrectly
annotated.

Existing Splice Site Prediction Techniques
• Models based on SVMs, HMMS, artificial neural networks.
• Variate DNA sequence representations, most using a large
neighborhood around the donor and acceptor dimers.
• Existing techniques using
traditional machine learning
methods perform well.
~150 nt around dimer

Issues of Existing Methods
• Ab initio splice site prediction is a time and money consuming
process.
• Poorly annotated genomes.
• Lack of labeled data.
• Idea: Transfer knowledge from already annotated genomes of
other organisms.
This kind of knowledge transfer is used every day by biologists during their experiments.
In machine learning it is called transfer learning.

Transfer Learning
• Introduced in 1995.
• Goal: to reduce the need of collecting and classifying new training data.
• Applications: Sentiment classification, speech recognition, machine
vision etc.

Transfer Learning Categorization
Category Source Domain Labels Target Domain Labels
Inductive Transfer Learning Available Available
Transductive Transfer Learning Available Unavailable
Unsupervised Transfer Learning Unavailable Unavailable

Transfer Learning Categorization
Category Source Domain Labels Target Domain Labels
Inductive Transfer Learning Available Available
Transductive Transfer Learning Available Unavailable
Unsupervised Transfer Learning Unavailable Unavailable
• Transferring the knowledge of instances: Importance sampling.
• Transferring the knowledge of feature representation: Find “good” feature representations
to minimize domain divergence and classification error.

Proposed Approach
• Bioinformatics Analysis in order to extract the most significant
patterns between organisms.
• Four DNA sequence representations.
• Evaluation of DNA sequence representations using traditional
machine learning.
• Development of two transfer learning models.

Data – Evaluation methods
A. Thaliana C. Elegans D. Melanogaster D. Rerio H. Sapiens
In each classification experiment:
• Training data 10000 decoy and 5000 true splice sites
• Test data 10000 decoy and 5000 true splice sites
Evaluation Methods: Accuracy, Area Under the Receiver Operating Characteristic curve (auROC)
For the statistical analysis, we used the DNA sequences of the splice sites of each organism’s
complete genome.

PPM and Consensus Calculation
• Features based on bioinformatics analysis of the sequences of
the true splice sites.
• Calculation of Position Probability Matrices (PPMs) and
Consensus sequences for each organism in order to extract
patterns.
• PPM calculation: 𝑀 𝑘,𝑗 =
1
𝑁 𝑖
𝑁
𝐼(𝑋𝑖,𝑗 = 𝑘)

Important Positions
• For the next steps, we consider as “important” positions, the
positions in the neighborhood around the splice site dimer
where a nucleotide occurs with a probability > 0.3.
• For the donor splice site the important positions are in a
neighborhood of 11 nt around the donor dimer, with the latter
being at positions 3 and 4 of the neighborhood.
• For the acceptor splice site the important positions are in a
neighborhood of 21 nt around the acceptor dimer, with the
latter being at positions 19 and 20 of the neighborhood.

PPMs
55
19
0 0
58
66
10
20
27 27 28
21
16
0
100
17
16
12
61
52 50 49
11
59
100
0
24 11
77
10 14
14 12
15
7
0 1 2
8
4
11 9 11 13
Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11
C. Elegans Donor PPM
A T G C
65
9
0 0
68
55
21 23
31
27 26
17
10
0
99
17
27
20
52
41
44
43
8
79
100
0
12
6
51
11
12
10
10
11
4
0 2 5
14
10
16 18 21 24
Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11
A. Thaliana Donor PPM
A T G C

PPMs
24 24 22 21 20 19 20 19 19 19 21 21 21 20 20
16
27
6
100
0
26
46 47
48 49 50 51 52 53 53 52 48 50 51 51 53 64
28
28
0
0
12
17 16 18 17 17 17 17 16 16 17 19 16 16 18 15
11
39
1
0
100
54
15 15 15 15 15 14 14 14 14 14 15 15 13 13 14 11 9
67
0 0
10
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
A. Thaliana Acceptor PPM
A T G C
40 43
47
51 50
44
38
34 34 35 38 41 43
29
6
1
9
4
100
0
44
37
36
33
31 32
37
40
43 43 42
40
41
43
58
89 98
68
14
0
0
13
10 9 8
8 7 8 9 10 10 9 8
8
7 7
3
1
8
1
0
100
29
15 15 14 12 13 14 15 15 15 16 16
12
8 8
4 2
17
84
0 0
15
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
C. Elegans Acceptor PPM
A T G C

Consensus Sequences
pos 1 2 3 4 5 6 7 8 9 10 11
AT A G G T A A G T AT T T
CE A G G T A A G T T T T
DM A G G T A A G T AT AT AT
DR A G G T A A G T A AT AT
HS A G G T A A G T X X X
pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
AT T T T T T T T T T T T T T T T T G C A G G
CE AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A
DM AT AT AT T T T T T T T T T T T T T T C A G A
DR T T T T T T T T T T T T T T T T T C A G G
HS T T T T T T T T T T T T T T T T X C A G G
Donor:
Acceptor:
GT dinucleotide at positions 3-4
of the examined sequence
AG dinucleotide at positions 19-20
of the examined sequence

DNA Sequence Representation
• Per se representation:
𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶

𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
• Binary representation:
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||

𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
• Score Matrix representation:
𝑓𝑁 𝑖
𝑥𝑖 =
2, 𝑥𝑖 = 𝑁𝑖
1, 𝑥𝑖 = 𝑀𝑖
0, 𝑒𝑙𝑠𝑒

𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
𝑓𝑁 𝑖
𝑥𝑖 =
0, 𝑒𝑙𝑠𝑒
A T C G
A 2 0 0 1
T 0 2 1 0
C 0 1 2 0
G 1 0 0 2
Score Matrix
A and G belong to the purine family
T and C belong to the pyrimidine family

𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
𝑓𝑁 𝑖
𝑥𝑖 =
0, 𝑒𝑙𝑠𝑒
• Weighted representation:
𝑓 𝑥𝑖 = 𝑝(𝑥𝑖|𝑖)

𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
𝑓𝑁 𝑖
𝑥𝑖 =
0, 𝑒𝑙𝑠𝑒
• Weighted representation:
𝑓 𝑥𝑖 = 𝑝(𝑥𝑖|𝑖)
Examples, given the PPM
the consensus TAGGTAAGT
and the sequence ATGGTCGTT:
A 0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1
C 0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2
G 0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1
T 0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6
Per se representation 0 1 2 2 1 3 2 1 1
Binary representation 0 0 1 1 1 0 0 0 1
Score Matrix representation 0 0 2 2 2 0 1 0 0
Weighted representation 0.3 0.1 0.7 1.0 1.0 0.2 0.1 0.2 0.6

Feature Evaluation
• Traditional machine learning classification using SVM
and kNN.
• The values of the parameters used were tested
experimentally.
• SVM: Linear kernel.
• kNN: 5 neighbors, Manhattan distance.

Feature Evaluation Results
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Per Se Binary Score Matrix Weights
Accuracy
Representation
Test Organism: D. Melanogaster
Method: SVM
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens 0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Accuracy
Representation
Method: kNN
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens
Donor Splice Site

Feature Evaluation Results
Acceptor Splice Site
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Accuracy
Representation
Method: kNN
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Accuracy
Representation
Test Organism: D.Melanogaster
Method: SVM
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens

Proposed Models: kNN based
• Iterative algorithm.
• Train a kNN classifier with the data of an organism, e.g. train on
C. Elegans (source domain) and predict on A. Thaliana (target
domain).
• In each iteration recalculate the features for both the source
and the target domain based on the predicted true splice sites.
• Objective: Both source and target domain features approach
target domain distribution.

Proposed Models: kNN based
Algorithm 1. kNN based approach
- Represent all sequences in one of the three representations,
based on the source domain data.
- Repeat
• Train kNN classifier with the source domain data.
• Classify the target domain data.
• Recalculate the PPM and/or the consensus.
• Represent all sequences based on the new PPM or consensus.
- Until divergence or a number of iterations.

• Iterative algorithm.
• Initiate target centroids to be the same as the source centroids
and predict on target domain organism.
• In each iteration recalculate the features for the target domain
based on the predicted true splice sites and recalculate the
target domain centroids.
• The source domain centroids remain stable and contribute to a
percentage to the classification.
• Objective: The target centroids are “moved” closer to the
target domain data.
Proposed Models: kMeans based

Proposed Models: kMeans based
• Algorithm 2. kMeans based approach
• Represent all sequences in one of the three representations, based on the source
domain data.
• Compute source domain centers.
• Initialize target domain centers to be the same as the source domain centers.
• Repeat
• Classify the target domain data based on the function
• Recalculate the PPM and/or the consensus from the target domain instances that are
classified as true splice sites.
• Represent the target domain sequences based on the new PPM or consensus.
• Calculate the new target domain centroids.
• Until divergence or a number of iterations.

Evaluation on Proposed Approaches
• In the cases where the consensus sequences and the PPMs of
the source and the target domain data are similar, we don’t
gain much from the transfer learning algorithms.
• In the cases where the consensus sequences differ a lot, both
approaches manage to increase a lot AuROC and Accuracy
percentages.
• kMeans based algorithm performs better than kNN based
algorithm.
• In particular, best results are obtained when the source domain
centroids don’t contribute at all after the first iteration.

Evaluation: Binary Sequence Representation
• Accurate and stable representation.
• Consensus sequence extracted from the classify data converges
to target data consensus.
• Example, when trained with C. Elegans data and tested on A.
Thaliana data:
C. Elegans Consensus:
AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A
A. Thaliana Consensus:
T T T T T T T T T T T T T T T T G C A G G
Extracted Consensus:
T T T T T T T T T T T T T T T T T C A G G

0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
1st 2nd 3rd 4th
AuROC
Iteration
kNN based classification
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
0.83
0.88
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 0% Source data
contribution
0.58
0.63
0.68
0.73
0.78
1st 2nd 3rd 4th
AuROC
Iteration
contribution
0.58
0.63
0.68
0.73
1st 2nd 3rd 4th
AuROC
Iteration
contribution
Evaluation: Binary Sequence Representation
Acceptor
Splice Site
Train Organism:
C. Elegans

Evaluation: Score Matrix Sequence Representation
• Accurate and stable representation as well.
• Performs better than binary representation.
• Consensus sequence extracted from the classify data converges to target
data consensus.
• Example, when trained with C. Elegans data and tested on A. Thaliana
data:
C. Elegans Consensus:
AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A
A. Thaliana Consensus:
T T T T T T T T T T T T T T T T G C A G G
Extracted Consensus:
T T T T T T T T T T T T T T T T T C A G G

0.58
0.63
0.68
0.73
0.78
0.83
1st 2nd 3rd 4th
AuROC
Iteration
0.58
0.68
0.78
0.88
0.98
1st 2nd 3rd 4th
AuROC
Iteration
contribution
0.58
0.68
0.78
0.88
0.98
1st 2nd 3rd 4th
AuROC
Iteration
contribution
0.58
0.63
0.68
0.73
0.78
0.83
0.88
1st 2nd 3rd 4th
AuROC
Iteration
contribution
Evaluation: Score Matrix Sequence Representation
Acceptor
Splice Site
Train Organism:
C. Elegans

Evaluation: Weights Sequence Representation
• Although seemed promising in the first set of experiments, it doesn’t perform well using transfer
learning methods.
• The PPM extracted from the classify data does not converges to target data PPM.
• This was expected, as the extracted PPM was constructed using a subset of the available data.
27% 26% 24% 24% 24% 24% 25% 23% 24% 24% 24% 25% 25% 24% 24%
14%
22% 19%
100%
0%
28%
39% 39% 41% 40% 41% 41% 40% 42% 41% 43% 39% 40% 41% 40% 42% 67%
28%
25%
0%
0%
22%
19% 19% 18% 19% 19% 18% 18% 19% 19% 18% 20% 19% 19% 21% 17%
10%
37%
12%
0%
100%
35%
16% 16% 17% 17% 16% 16% 17% 16% 16% 15% 16% 17% 15% 15% 16%
9% 13%
44%
0% 0%
15%
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
C. Elegans - A. Thaliana Acceptor extracted PPM
A T G C
24% 24% 22% 21% 20% 19% 20% 19% 19% 19% 21% 21% 21% 20% 20% 16%
27%
6%
100%
0%
26%
46% 47% 48% 49% 50% 51% 52% 53% 53% 52% 48% 50% 51% 51% 53% 64%
28%
28%
0%
0%
12%
17% 16% 18% 17% 17% 17% 17% 16% 16% 17% 19% 16% 16% 18% 15%
11%
39%
1%
0%
100%
54%
15% 15% 15% 15% 15% 14% 14% 14% 14% 14% 15% 15% 13% 13% 14% 11% 9%
67%
0% 0%
10%
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
A. Thaliana Acceptor target PPM
A T G C

Evaluation: Weights Sequence Representation
0.58
0.63
0.68
0.73
0.78
1st 2nd 3rd 4th
AuROC
Iteration
0.58
0.63
0.68
0.73
0.78
AuROC
Iteration
contribution
1st 2nd 3rd 4th
0.55
0.6
0.65
0.7
0.75
1st 2nd 3rd 4th
AuROC
Iteration
contribution
0.58
0.63
0.68
0.73
0.78
1st 2nd 3rd 4th
AuROC
Iteration
contribution
Acceptor
Splice Site
Train Organism:
C. Elegans

Summary
• The common patterns in the sequences of the five studied
organisms were extracted using bioinformatics analysis.
• For the classification task, only the “important” positions of the
neighborhood were used.
• Four DNA sequence representations were proposed, namely:
• Sequence per se.
• Binary representation.
• Score Matrix representation.
• Weighted representation.
• Binary, Score Matrix and Weighted representations perform well
even when using traditional machine learning.

• Two transfer learning algorithms are proposed:
• kNN based.
• kMeans based.
• When the patterns in the sequences are similar between
organisms, transfer learning doesn’t contribute a lot as the
results are already good.
• When the patterns differ a lot, kMeans based algorithm with no
source data contribution after first iteration, helps reducing the
gap.
• Best performance: Score Matrix Representation.
Summary

Future Steps
• More experiments using all the available data.
• Study more organisms.
• Perform detailed comparison with other approaches.
• Variation of the proposed transfer learning models, using in the
training set in each iteration, the most certainly classified target
data from the previous iteration.

Splice site recognition among different organisms

Recommended

Recommended

More Related Content

Similar to Splice site recognition among different organisms

Similar to Splice site recognition among different organisms (20)

Recently uploaded

Recently uploaded (20)

Splice site recognition among different organisms