INTERDEPARTMENTAL POSTGRADUATE PROGRAM
"INFORMATION TECHNOLOGIES IN MEDICINE AND BIOLOGY"
MASTER THESIS
Splice site recognition among
different organisms
Despoina I. Kalfakakou
Supervisors:
Stavros Perantonis, Research Director, NCSR Demokritos
George Paliouras, Research Director, NCSR Demokritos
Anastasia Krithara, Post-Doctoral Researcher, NCSR Demokritos
Structure
• RNA Splicing
• Motivation
• Transfer Learning
• Proposed Approaches
• Conclusion
Central Dogma of Molecular Biology
RNA Splicing process
snRNPs:
small nuclear
ribonucleoproteins
Spliceosome:
Complex formed from
snRNPs which catalyzes
splicing process
RNA Splicing process
Donor, Acceptor:
Splice sites, boundaries
between exons and introns
GU dinucleotide AG dinucleotide
Importance of Accurate Splice Site Prediction
• Typical mammalian gene has 7-8 exons spread out over ~16 kb.
• Splice site prediction leads to identification of these exons.
• Exon identification is the first step to accurate genome annotation.
• Currently, hundreds of genomes have been annotated, but
thousands more remain unknown.
• Moreover, many of the already annotated genomes are incorrectly
annotated.
Existing Splice Site Prediction Techniques
• Models based on SVMs, HMMS, artificial neural networks.
• Variate DNA sequence representations, most using a large
neighborhood around the donor and acceptor dimers.
• Existing techniques using
traditional machine learning
methods perform well.
~150 nt around dimer
Issues of Existing Methods
• Ab initio splice site prediction is a time and money consuming
process.
• Poorly annotated genomes.
• Lack of labeled data.
• Idea: Transfer knowledge from already annotated genomes of
other organisms.
This kind of knowledge transfer is used every day by biologists during their experiments.
In machine learning it is called transfer learning.
Transfer Learning
• Introduced in 1995.
• Goal: to reduce the need of collecting and classifying new training data.
• Applications: Sentiment classification, speech recognition, machine
vision etc.
Transfer Learning Categorization
Category Source Domain Labels Target Domain Labels
Inductive Transfer Learning Available Available
Transductive Transfer Learning Available Unavailable
Unsupervised Transfer Learning Unavailable Unavailable
Transfer Learning Categorization
Category Source Domain Labels Target Domain Labels
Inductive Transfer Learning Available Available
Transductive Transfer Learning Available Unavailable
Unsupervised Transfer Learning Unavailable Unavailable
• Transferring the knowledge of instances: Importance sampling.
• Transferring the knowledge of feature representation: Find “good” feature representations
to minimize domain divergence and classification error.
Proposed Approach
• Bioinformatics Analysis in order to extract the most significant
patterns between organisms.
• Four DNA sequence representations.
• Evaluation of DNA sequence representations using traditional
machine learning.
• Development of two transfer learning models.
Data – Evaluation methods
A. Thaliana C. Elegans D. Melanogaster D. Rerio H. Sapiens
In each classification experiment:
• Training data 10000 decoy and 5000 true splice sites
• Test data 10000 decoy and 5000 true splice sites
Evaluation Methods: Accuracy, Area Under the Receiver Operating Characteristic curve (auROC)
For the statistical analysis, we used the DNA sequences of the splice sites of each organism’s
complete genome.
PPM and Consensus Calculation
• Features based on bioinformatics analysis of the sequences of
the true splice sites.
• Calculation of Position Probability Matrices (PPMs) and
Consensus sequences for each organism in order to extract
patterns.
• PPM calculation: 𝑀 𝑘,𝑗 =
1
𝑁 𝑖
𝑁
𝐼(𝑋𝑖,𝑗 = 𝑘)
Important Positions
• For the next steps, we consider as “important” positions, the
positions in the neighborhood around the splice site dimer
where a nucleotide occurs with a probability > 0.3.
• For the donor splice site the important positions are in a
neighborhood of 11 nt around the donor dimer, with the latter
being at positions 3 and 4 of the neighborhood.
• For the acceptor splice site the important positions are in a
neighborhood of 21 nt around the acceptor dimer, with the
latter being at positions 19 and 20 of the neighborhood.
PPMs
55
19
0 0
58
66
10
20
27 27 28
21
16
0
100
17
16
12
61
52 50 49
11
59
100
0
24 11
77
10 14
14 12
15
7
0 1 2
8
4
11 9 11 13
Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11
C. Elegans Donor PPM
A T G C
65
9
0 0
68
55
21 23
31
27 26
17
10
0
99
17
27
20
52
41
44
43
8
79
100
0
12
6
51
11
12
10
10
11
4
0 2 5
14
10
16 18 21 24
Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11
A. Thaliana Donor PPM
A T G C
PPMs
24 24 22 21 20 19 20 19 19 19 21 21 21 20 20
16
27
6
100
0
26
46 47
48 49 50 51 52 53 53 52 48 50 51 51 53 64
28
28
0
0
12
17 16 18 17 17 17 17 16 16 17 19 16 16 18 15
11
39
1
0
100
54
15 15 15 15 15 14 14 14 14 14 15 15 13 13 14 11 9
67
0 0
10
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
A. Thaliana Acceptor PPM
A T G C
40 43
47
51 50
44
38
34 34 35 38 41 43
29
6
1
9
4
100
0
44
37
36
33
31 32
37
40
43 43 42
40
41
43
58
89 98
68
14
0
0
13
10 9 8
8 7 8 9 10 10 9 8
8
7 7
3
1
8
1
0
100
29
15 15 14 12 13 14 15 15 15 16 16
12
8 8
4 2
17
84
0 0
15
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
C. Elegans Acceptor PPM
A T G C
Consensus Sequences
pos 1 2 3 4 5 6 7 8 9 10 11
AT A G G T A A G T AT T T
CE A G G T A A G T T T T
DM A G G T A A G T AT AT AT
DR A G G T A A G T A AT AT
HS A G G T A A G T X X X
pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
AT T T T T T T T T T T T T T T T T G C A G G
CE AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A
DM AT AT AT T T T T T T T T T T T T T T C A G A
DR T T T T T T T T T T T T T T T T T C A G G
HS T T T T T T T T T T T T T T T T X C A G G
Donor:
Acceptor:
GT dinucleotide at positions 3-4
of the examined sequence
AG dinucleotide at positions 19-20
of the examined sequence
DNA Sequence Representation
• Per se representation:
𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
DNA Sequence Representation
• Per se representation:
𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
• Binary representation:
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
DNA Sequence Representation
• Per se representation:
𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
• Binary representation:
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
• Score Matrix representation:
𝑓𝑁 𝑖
𝑥𝑖 =
2, 𝑥𝑖 = 𝑁𝑖
1, 𝑥𝑖 = 𝑀𝑖
0, 𝑒𝑙𝑠𝑒
DNA Sequence Representation
• Per se representation:
𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
• Binary representation:
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
• Score Matrix representation:
𝑓𝑁 𝑖
𝑥𝑖 =
2, 𝑥𝑖 = 𝑁𝑖
1, 𝑥𝑖 = 𝑀𝑖
0, 𝑒𝑙𝑠𝑒
A T C G
A 2 0 0 1
T 0 2 1 0
C 0 1 2 0
G 1 0 0 2
Score Matrix
A and G belong to the purine family
T and C belong to the pyrimidine family
DNA Sequence Representation
• Per se representation:
𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
• Binary representation:
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
• Score Matrix representation:
𝑓𝑁 𝑖
𝑥𝑖 =
2, 𝑥𝑖 = 𝑁𝑖
1, 𝑥𝑖 = 𝑀𝑖
0, 𝑒𝑙𝑠𝑒
• Weighted representation:
𝑓 𝑥𝑖 = 𝑝(𝑥𝑖|𝑖)
DNA Sequence Representation
• Per se representation:
𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
• Binary representation:
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
• Score Matrix representation:
𝑓𝑁 𝑖
𝑥𝑖 =
2, 𝑥𝑖 = 𝑁𝑖
1, 𝑥𝑖 = 𝑀𝑖
0, 𝑒𝑙𝑠𝑒
• Weighted representation:
𝑓 𝑥𝑖 = 𝑝(𝑥𝑖|𝑖)
Examples, given the PPM
the consensus TAGGTAAGT
and the sequence ATGGTCGTT:
A 0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1
C 0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2
G 0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1
T 0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6
Per se representation 0 1 2 2 1 3 2 1 1
Binary representation 0 0 1 1 1 0 0 0 1
Score Matrix representation 0 0 2 2 2 0 1 0 0
Weighted representation 0.3 0.1 0.7 1.0 1.0 0.2 0.1 0.2 0.6
Feature Evaluation
• Traditional machine learning classification using SVM
and kNN.
• The values of the parameters used were tested
experimentally.
• SVM: Linear kernel.
• kNN: 5 neighbors, Manhattan distance.
Feature Evaluation Results
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Per Se Binary Score Matrix Weights
Accuracy
Representation
Test Organism: D. Melanogaster
Method: SVM
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens 0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Per Se Binary Score Matrix Weights
Accuracy
Representation
Test Organism: D. Melanogaster
Method: kNN
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens
Donor Splice Site
Feature Evaluation Results
Acceptor Splice Site
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Per Se Binary Score Matrix Weights
Accuracy
Representation
Test Organism: D. Melanogaster
Method: kNN
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Per Se Binary Score Matrix Weights
Accuracy
Representation
Test Organism: D.Melanogaster
Method: SVM
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens
Proposed Models: kNN based
• Iterative algorithm.
• Train a kNN classifier with the data of an organism, e.g. train on
C. Elegans (source domain) and predict on A. Thaliana (target
domain).
• In each iteration recalculate the features for both the source
and the target domain based on the predicted true splice sites.
• Objective: Both source and target domain features approach
target domain distribution.
Proposed Models: kNN based
Algorithm 1. kNN based approach
- Represent all sequences in one of the three representations,
based on the source domain data.
- Repeat
• Train kNN classifier with the source domain data.
• Classify the target domain data.
• Recalculate the PPM and/or the consensus.
• Represent all sequences based on the new PPM or consensus.
- Until divergence or a number of iterations.
• Iterative algorithm.
• Initiate target centroids to be the same as the source centroids
and predict on target domain organism.
• In each iteration recalculate the features for the target domain
based on the predicted true splice sites and recalculate the
target domain centroids.
• The source domain centroids remain stable and contribute to a
percentage to the classification.
• Objective: The target centroids are “moved” closer to the
target domain data.
Proposed Models: kMeans based
Proposed Models: kMeans based
• Algorithm 2. kMeans based approach
• Represent all sequences in one of the three representations, based on the source
domain data.
• Compute source domain centers.
• Initialize target domain centers to be the same as the source domain centers.
• Repeat
• Classify the target domain data based on the function
• Recalculate the PPM and/or the consensus from the target domain instances that are
classified as true splice sites.
• Represent the target domain sequences based on the new PPM or consensus.
• Calculate the new target domain centroids.
• Until divergence or a number of iterations.
Evaluation on Proposed Approaches
• In the cases where the consensus sequences and the PPMs of
the source and the target domain data are similar, we don’t
gain much from the transfer learning algorithms.
• In the cases where the consensus sequences differ a lot, both
approaches manage to increase a lot AuROC and Accuracy
percentages.
• kMeans based algorithm performs better than kNN based
algorithm.
• In particular, best results are obtained when the source domain
centroids don’t contribute at all after the first iteration.
Evaluation: Binary Sequence Representation
• Accurate and stable representation.
• Consensus sequence extracted from the classify data converges
to target data consensus.
• Example, when trained with C. Elegans data and tested on A.
Thaliana data:
C. Elegans Consensus:
AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A
A. Thaliana Consensus:
T T T T T T T T T T T T T T T T G C A G G
Extracted Consensus:
T T T T T T T T T T T T T T T T T C A G G
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
1st 2nd 3rd 4th
AuROC
Iteration
kNN based classification
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
0.83
0.88
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 0% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 40% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 80% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
Evaluation: Binary Sequence Representation
Acceptor
Splice Site
Train Organism:
C. Elegans
Evaluation: Score Matrix Sequence Representation
• Accurate and stable representation as well.
• Performs better than binary representation.
• Consensus sequence extracted from the classify data converges to target
data consensus.
• Example, when trained with C. Elegans data and tested on A. Thaliana
data:
C. Elegans Consensus:
AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A
A. Thaliana Consensus:
T T T T T T T T T T T T T T T T G C A G G
Extracted Consensus:
T T T T T T T T T T T T T T T T T C A G G
0.58
0.63
0.68
0.73
0.78
0.83
1st 2nd 3rd 4th
AuROC
Iteration
kNN based classification
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.68
0.78
0.88
0.98
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 0% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.68
0.78
0.88
0.98
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 40% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
0.83
0.88
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 80% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
Evaluation: Score Matrix Sequence Representation
Acceptor
Splice Site
Train Organism:
C. Elegans
Evaluation: Weights Sequence Representation
• Although seemed promising in the first set of experiments, it doesn’t perform well using transfer
learning methods.
• The PPM extracted from the classify data does not converges to target data PPM.
• This was expected, as the extracted PPM was constructed using a subset of the available data.
27% 26% 24% 24% 24% 24% 25% 23% 24% 24% 24% 25% 25% 24% 24%
14%
22% 19%
100%
0%
28%
39% 39% 41% 40% 41% 41% 40% 42% 41% 43% 39% 40% 41% 40% 42% 67%
28%
25%
0%
0%
22%
19% 19% 18% 19% 19% 18% 18% 19% 19% 18% 20% 19% 19% 21% 17%
10%
37%
12%
0%
100%
35%
16% 16% 17% 17% 16% 16% 17% 16% 16% 15% 16% 17% 15% 15% 16%
9% 13%
44%
0% 0%
15%
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
C. Elegans - A. Thaliana Acceptor extracted PPM
A T G C
24% 24% 22% 21% 20% 19% 20% 19% 19% 19% 21% 21% 21% 20% 20% 16%
27%
6%
100%
0%
26%
46% 47% 48% 49% 50% 51% 52% 53% 53% 52% 48% 50% 51% 51% 53% 64%
28%
28%
0%
0%
12%
17% 16% 18% 17% 17% 17% 17% 16% 16% 17% 19% 16% 16% 18% 15%
11%
39%
1%
0%
100%
54%
15% 15% 15% 15% 15% 14% 14% 14% 14% 14% 15% 15% 13% 13% 14% 11% 9%
67%
0% 0%
10%
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
A. Thaliana Acceptor target PPM
A T G C
Evaluation: Weights Sequence Representation
0.58
0.63
0.68
0.73
0.78
1st 2nd 3rd 4th
AuROC
Iteration
kNN based classification
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
A Thaliana D Melanogaster D Rerio H Sapiens
AuROC
Iteration
kMeans based classification - 0% Source data
contribution
1st 2nd 3rd 4th
0.55
0.6
0.65
0.7
0.75
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 40% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 80% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
Acceptor
Splice Site
Train Organism:
C. Elegans
Summary
• The common patterns in the sequences of the five studied
organisms were extracted using bioinformatics analysis.
• For the classification task, only the “important” positions of the
neighborhood were used.
• Four DNA sequence representations were proposed, namely:
• Sequence per se.
• Binary representation.
• Score Matrix representation.
• Weighted representation.
• Binary, Score Matrix and Weighted representations perform well
even when using traditional machine learning.
• Two transfer learning algorithms are proposed:
• kNN based.
• kMeans based.
• When the patterns in the sequences are similar between
organisms, transfer learning doesn’t contribute a lot as the
results are already good.
• When the patterns differ a lot, kMeans based algorithm with no
source data contribution after first iteration, helps reducing the
gap.
• Best performance: Score Matrix Representation.
Summary
Future Steps
• More experiments using all the available data.
• Study more organisms.
• Perform detailed comparison with other approaches.
• Variation of the proposed transfer learning models, using in the
training set in each iteration, the most certainly classified target
data from the previous iteration.
Thank you!

Splice site recognition among different organisms

  • 1.
    INTERDEPARTMENTAL POSTGRADUATE PROGRAM "INFORMATIONTECHNOLOGIES IN MEDICINE AND BIOLOGY" MASTER THESIS Splice site recognition among different organisms Despoina I. Kalfakakou Supervisors: Stavros Perantonis, Research Director, NCSR Demokritos George Paliouras, Research Director, NCSR Demokritos Anastasia Krithara, Post-Doctoral Researcher, NCSR Demokritos
  • 2.
    Structure • RNA Splicing •Motivation • Transfer Learning • Proposed Approaches • Conclusion
  • 3.
    Central Dogma ofMolecular Biology
  • 4.
    RNA Splicing process snRNPs: smallnuclear ribonucleoproteins Spliceosome: Complex formed from snRNPs which catalyzes splicing process
  • 5.
    RNA Splicing process Donor,Acceptor: Splice sites, boundaries between exons and introns GU dinucleotide AG dinucleotide
  • 6.
    Importance of AccurateSplice Site Prediction • Typical mammalian gene has 7-8 exons spread out over ~16 kb. • Splice site prediction leads to identification of these exons. • Exon identification is the first step to accurate genome annotation. • Currently, hundreds of genomes have been annotated, but thousands more remain unknown. • Moreover, many of the already annotated genomes are incorrectly annotated.
  • 7.
    Existing Splice SitePrediction Techniques • Models based on SVMs, HMMS, artificial neural networks. • Variate DNA sequence representations, most using a large neighborhood around the donor and acceptor dimers. • Existing techniques using traditional machine learning methods perform well. ~150 nt around dimer
  • 8.
    Issues of ExistingMethods • Ab initio splice site prediction is a time and money consuming process. • Poorly annotated genomes. • Lack of labeled data. • Idea: Transfer knowledge from already annotated genomes of other organisms. This kind of knowledge transfer is used every day by biologists during their experiments. In machine learning it is called transfer learning.
  • 9.
    Transfer Learning • Introducedin 1995. • Goal: to reduce the need of collecting and classifying new training data. • Applications: Sentiment classification, speech recognition, machine vision etc.
  • 10.
    Transfer Learning Categorization CategorySource Domain Labels Target Domain Labels Inductive Transfer Learning Available Available Transductive Transfer Learning Available Unavailable Unsupervised Transfer Learning Unavailable Unavailable
  • 11.
    Transfer Learning Categorization CategorySource Domain Labels Target Domain Labels Inductive Transfer Learning Available Available Transductive Transfer Learning Available Unavailable Unsupervised Transfer Learning Unavailable Unavailable • Transferring the knowledge of instances: Importance sampling. • Transferring the knowledge of feature representation: Find “good” feature representations to minimize domain divergence and classification error.
  • 12.
    Proposed Approach • BioinformaticsAnalysis in order to extract the most significant patterns between organisms. • Four DNA sequence representations. • Evaluation of DNA sequence representations using traditional machine learning. • Development of two transfer learning models.
  • 13.
    Data – Evaluationmethods A. Thaliana C. Elegans D. Melanogaster D. Rerio H. Sapiens In each classification experiment: • Training data 10000 decoy and 5000 true splice sites • Test data 10000 decoy and 5000 true splice sites Evaluation Methods: Accuracy, Area Under the Receiver Operating Characteristic curve (auROC) For the statistical analysis, we used the DNA sequences of the splice sites of each organism’s complete genome.
  • 14.
    PPM and ConsensusCalculation • Features based on bioinformatics analysis of the sequences of the true splice sites. • Calculation of Position Probability Matrices (PPMs) and Consensus sequences for each organism in order to extract patterns. • PPM calculation: 𝑀 𝑘,𝑗 = 1 𝑁 𝑖 𝑁 𝐼(𝑋𝑖,𝑗 = 𝑘)
  • 15.
    Important Positions • Forthe next steps, we consider as “important” positions, the positions in the neighborhood around the splice site dimer where a nucleotide occurs with a probability > 0.3. • For the donor splice site the important positions are in a neighborhood of 11 nt around the donor dimer, with the latter being at positions 3 and 4 of the neighborhood. • For the acceptor splice site the important positions are in a neighborhood of 21 nt around the acceptor dimer, with the latter being at positions 19 and 20 of the neighborhood.
  • 16.
    PPMs 55 19 0 0 58 66 10 20 27 2728 21 16 0 100 17 16 12 61 52 50 49 11 59 100 0 24 11 77 10 14 14 12 15 7 0 1 2 8 4 11 9 11 13 Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11 C. Elegans Donor PPM A T G C 65 9 0 0 68 55 21 23 31 27 26 17 10 0 99 17 27 20 52 41 44 43 8 79 100 0 12 6 51 11 12 10 10 11 4 0 2 5 14 10 16 18 21 24 Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11 A. Thaliana Donor PPM A T G C
  • 17.
    PPMs 24 24 2221 20 19 20 19 19 19 21 21 21 20 20 16 27 6 100 0 26 46 47 48 49 50 51 52 53 53 52 48 50 51 51 53 64 28 28 0 0 12 17 16 18 17 17 17 17 16 16 17 19 16 16 18 15 11 39 1 0 100 54 15 15 15 15 15 14 14 14 14 14 15 15 13 13 14 11 9 67 0 0 10 Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11 Pos. 12 Pos. 13 Pos. 14 Pos. 15 Pos. 16 Pos. 17 Pos. 18 Pos. 19 Pos. 20 Pos. 21 A. Thaliana Acceptor PPM A T G C 40 43 47 51 50 44 38 34 34 35 38 41 43 29 6 1 9 4 100 0 44 37 36 33 31 32 37 40 43 43 42 40 41 43 58 89 98 68 14 0 0 13 10 9 8 8 7 8 9 10 10 9 8 8 7 7 3 1 8 1 0 100 29 15 15 14 12 13 14 15 15 15 16 16 12 8 8 4 2 17 84 0 0 15 Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11 Pos. 12 Pos. 13 Pos. 14 Pos. 15 Pos. 16 Pos. 17 Pos. 18 Pos. 19 Pos. 20 Pos. 21 C. Elegans Acceptor PPM A T G C
  • 18.
    Consensus Sequences pos 12 3 4 5 6 7 8 9 10 11 AT A G G T A A G T AT T T CE A G G T A A G T T T T DM A G G T A A G T AT AT AT DR A G G T A A G T A AT AT HS A G G T A A G T X X X pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 AT T T T T T T T T T T T T T T T T G C A G G CE AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A DM AT AT AT T T T T T T T T T T T T T T C A G A DR T T T T T T T T T T T T T T T T T C A G G HS T T T T T T T T T T T T T T T T X C A G G Donor: Acceptor: GT dinucleotide at positions 3-4 of the examined sequence AG dinucleotide at positions 19-20 of the examined sequence
  • 19.
    DNA Sequence Representation •Per se representation: 𝑓 𝑥𝑖 = 0, 𝑥𝑖 = 𝐴 1, 𝑥𝑖 = 𝑇 2, 𝑥𝑖 = 𝐺 3, 𝑥𝑖 = 𝐶
  • 20.
    DNA Sequence Representation •Per se representation: 𝑓 𝑥𝑖 = 0, 𝑥𝑖 = 𝐴 1, 𝑥𝑖 = 𝑇 2, 𝑥𝑖 = 𝐺 3, 𝑥𝑖 = 𝐶 • Binary representation: 𝑓𝑁 𝑖 𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
  • 21.
    DNA Sequence Representation •Per se representation: 𝑓 𝑥𝑖 = 0, 𝑥𝑖 = 𝐴 1, 𝑥𝑖 = 𝑇 2, 𝑥𝑖 = 𝐺 3, 𝑥𝑖 = 𝐶 • Binary representation: 𝑓𝑁 𝑖 𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 || • Score Matrix representation: 𝑓𝑁 𝑖 𝑥𝑖 = 2, 𝑥𝑖 = 𝑁𝑖 1, 𝑥𝑖 = 𝑀𝑖 0, 𝑒𝑙𝑠𝑒
  • 22.
    DNA Sequence Representation •Per se representation: 𝑓 𝑥𝑖 = 0, 𝑥𝑖 = 𝐴 1, 𝑥𝑖 = 𝑇 2, 𝑥𝑖 = 𝐺 3, 𝑥𝑖 = 𝐶 • Binary representation: 𝑓𝑁 𝑖 𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 || • Score Matrix representation: 𝑓𝑁 𝑖 𝑥𝑖 = 2, 𝑥𝑖 = 𝑁𝑖 1, 𝑥𝑖 = 𝑀𝑖 0, 𝑒𝑙𝑠𝑒 A T C G A 2 0 0 1 T 0 2 1 0 C 0 1 2 0 G 1 0 0 2 Score Matrix A and G belong to the purine family T and C belong to the pyrimidine family
  • 23.
    DNA Sequence Representation •Per se representation: 𝑓 𝑥𝑖 = 0, 𝑥𝑖 = 𝐴 1, 𝑥𝑖 = 𝑇 2, 𝑥𝑖 = 𝐺 3, 𝑥𝑖 = 𝐶 • Binary representation: 𝑓𝑁 𝑖 𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 || • Score Matrix representation: 𝑓𝑁 𝑖 𝑥𝑖 = 2, 𝑥𝑖 = 𝑁𝑖 1, 𝑥𝑖 = 𝑀𝑖 0, 𝑒𝑙𝑠𝑒 • Weighted representation: 𝑓 𝑥𝑖 = 𝑝(𝑥𝑖|𝑖)
  • 24.
    DNA Sequence Representation •Per se representation: 𝑓 𝑥𝑖 = 0, 𝑥𝑖 = 𝐴 1, 𝑥𝑖 = 𝑇 2, 𝑥𝑖 = 𝐺 3, 𝑥𝑖 = 𝐶 • Binary representation: 𝑓𝑁 𝑖 𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 || • Score Matrix representation: 𝑓𝑁 𝑖 𝑥𝑖 = 2, 𝑥𝑖 = 𝑁𝑖 1, 𝑥𝑖 = 𝑀𝑖 0, 𝑒𝑙𝑠𝑒 • Weighted representation: 𝑓 𝑥𝑖 = 𝑝(𝑥𝑖|𝑖) Examples, given the PPM the consensus TAGGTAAGT and the sequence ATGGTCGTT: A 0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1 C 0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2 G 0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1 T 0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6 Per se representation 0 1 2 2 1 3 2 1 1 Binary representation 0 0 1 1 1 0 0 0 1 Score Matrix representation 0 0 2 2 2 0 1 0 0 Weighted representation 0.3 0.1 0.7 1.0 1.0 0.2 0.1 0.2 0.6
  • 25.
    Feature Evaluation • Traditionalmachine learning classification using SVM and kNN. • The values of the parameters used were tested experimentally. • SVM: Linear kernel. • kNN: 5 neighbors, Manhattan distance.
  • 26.
    Feature Evaluation Results 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% PerSe Binary Score Matrix Weights Accuracy Representation Test Organism: D. Melanogaster Method: SVM A Thaliana C Elegans D Melanogaster D Rerio H Sapiens 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% Per Se Binary Score Matrix Weights Accuracy Representation Test Organism: D. Melanogaster Method: kNN A Thaliana C Elegans D Melanogaster D Rerio H Sapiens Donor Splice Site
  • 27.
    Feature Evaluation Results AcceptorSplice Site 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% Per Se Binary Score Matrix Weights Accuracy Representation Test Organism: D. Melanogaster Method: kNN A Thaliana C Elegans D Melanogaster D Rerio H Sapiens0.00% 20.00% 40.00% 60.00% 80.00% 100.00% Per Se Binary Score Matrix Weights Accuracy Representation Test Organism: D.Melanogaster Method: SVM A Thaliana C Elegans D Melanogaster D Rerio H Sapiens
  • 28.
    Proposed Models: kNNbased • Iterative algorithm. • Train a kNN classifier with the data of an organism, e.g. train on C. Elegans (source domain) and predict on A. Thaliana (target domain). • In each iteration recalculate the features for both the source and the target domain based on the predicted true splice sites. • Objective: Both source and target domain features approach target domain distribution.
  • 29.
    Proposed Models: kNNbased Algorithm 1. kNN based approach - Represent all sequences in one of the three representations, based on the source domain data. - Repeat • Train kNN classifier with the source domain data. • Classify the target domain data. • Recalculate the PPM and/or the consensus. • Represent all sequences based on the new PPM or consensus. - Until divergence or a number of iterations.
  • 30.
    • Iterative algorithm. •Initiate target centroids to be the same as the source centroids and predict on target domain organism. • In each iteration recalculate the features for the target domain based on the predicted true splice sites and recalculate the target domain centroids. • The source domain centroids remain stable and contribute to a percentage to the classification. • Objective: The target centroids are “moved” closer to the target domain data. Proposed Models: kMeans based
  • 31.
    Proposed Models: kMeansbased • Algorithm 2. kMeans based approach • Represent all sequences in one of the three representations, based on the source domain data. • Compute source domain centers. • Initialize target domain centers to be the same as the source domain centers. • Repeat • Classify the target domain data based on the function • Recalculate the PPM and/or the consensus from the target domain instances that are classified as true splice sites. • Represent the target domain sequences based on the new PPM or consensus. • Calculate the new target domain centroids. • Until divergence or a number of iterations.
  • 32.
    Evaluation on ProposedApproaches • In the cases where the consensus sequences and the PPMs of the source and the target domain data are similar, we don’t gain much from the transfer learning algorithms. • In the cases where the consensus sequences differ a lot, both approaches manage to increase a lot AuROC and Accuracy percentages. • kMeans based algorithm performs better than kNN based algorithm. • In particular, best results are obtained when the source domain centroids don’t contribute at all after the first iteration.
  • 33.
    Evaluation: Binary SequenceRepresentation • Accurate and stable representation. • Consensus sequence extracted from the classify data converges to target data consensus. • Example, when trained with C. Elegans data and tested on A. Thaliana data: C. Elegans Consensus: AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A A. Thaliana Consensus: T T T T T T T T T T T T T T T T G C A G G Extracted Consensus: T T T T T T T T T T T T T T T T T C A G G
  • 34.
    0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 1st 2nd 3rd4th AuROC Iteration kNN based classification A Thaliana D Melanogaster D Rerio H Sapiens 0.58 0.63 0.68 0.73 0.78 0.83 0.88 1st 2nd 3rd 4th AuROC Iteration kMeans based classification - 0% Source data contribution A Thaliana D Melanogaster D Rerio H Sapiens 0.58 0.63 0.68 0.73 0.78 1st 2nd 3rd 4th AuROC Iteration kMeans based classification - 40% Source data contribution A Thaliana D Melanogaster D Rerio H Sapiens 0.58 0.63 0.68 0.73 1st 2nd 3rd 4th AuROC Iteration kMeans based classification - 80% Source data contribution A Thaliana D Melanogaster D Rerio H Sapiens Evaluation: Binary Sequence Representation Acceptor Splice Site Train Organism: C. Elegans
  • 35.
    Evaluation: Score MatrixSequence Representation • Accurate and stable representation as well. • Performs better than binary representation. • Consensus sequence extracted from the classify data converges to target data consensus. • Example, when trained with C. Elegans data and tested on A. Thaliana data: C. Elegans Consensus: AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A A. Thaliana Consensus: T T T T T T T T T T T T T T T T G C A G G Extracted Consensus: T T T T T T T T T T T T T T T T T C A G G
  • 36.
    0.58 0.63 0.68 0.73 0.78 0.83 1st 2nd 3rd4th AuROC Iteration kNN based classification A Thaliana D Melanogaster D Rerio H Sapiens 0.58 0.68 0.78 0.88 0.98 1st 2nd 3rd 4th AuROC Iteration kMeans based classification - 0% Source data contribution A Thaliana D Melanogaster D Rerio H Sapiens 0.58 0.68 0.78 0.88 0.98 1st 2nd 3rd 4th AuROC Iteration kMeans based classification - 40% Source data contribution A Thaliana D Melanogaster D Rerio H Sapiens 0.58 0.63 0.68 0.73 0.78 0.83 0.88 1st 2nd 3rd 4th AuROC Iteration kMeans based classification - 80% Source data contribution A Thaliana D Melanogaster D Rerio H Sapiens Evaluation: Score Matrix Sequence Representation Acceptor Splice Site Train Organism: C. Elegans
  • 37.
    Evaluation: Weights SequenceRepresentation • Although seemed promising in the first set of experiments, it doesn’t perform well using transfer learning methods. • The PPM extracted from the classify data does not converges to target data PPM. • This was expected, as the extracted PPM was constructed using a subset of the available data. 27% 26% 24% 24% 24% 24% 25% 23% 24% 24% 24% 25% 25% 24% 24% 14% 22% 19% 100% 0% 28% 39% 39% 41% 40% 41% 41% 40% 42% 41% 43% 39% 40% 41% 40% 42% 67% 28% 25% 0% 0% 22% 19% 19% 18% 19% 19% 18% 18% 19% 19% 18% 20% 19% 19% 21% 17% 10% 37% 12% 0% 100% 35% 16% 16% 17% 17% 16% 16% 17% 16% 16% 15% 16% 17% 15% 15% 16% 9% 13% 44% 0% 0% 15% Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11 Pos. 12 Pos. 13 Pos. 14 Pos. 15 Pos. 16 Pos. 17 Pos. 18 Pos. 19 Pos. 20 Pos. 21 C. Elegans - A. Thaliana Acceptor extracted PPM A T G C 24% 24% 22% 21% 20% 19% 20% 19% 19% 19% 21% 21% 21% 20% 20% 16% 27% 6% 100% 0% 26% 46% 47% 48% 49% 50% 51% 52% 53% 53% 52% 48% 50% 51% 51% 53% 64% 28% 28% 0% 0% 12% 17% 16% 18% 17% 17% 17% 17% 16% 16% 17% 19% 16% 16% 18% 15% 11% 39% 1% 0% 100% 54% 15% 15% 15% 15% 15% 14% 14% 14% 14% 14% 15% 15% 13% 13% 14% 11% 9% 67% 0% 0% 10% Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5 Pos. 6 Pos. 7 Pos. 8 Pos. 9 Pos. 10 Pos. 11 Pos. 12 Pos. 13 Pos. 14 Pos. 15 Pos. 16 Pos. 17 Pos. 18 Pos. 19 Pos. 20 Pos. 21 A. Thaliana Acceptor target PPM A T G C
  • 38.
    Evaluation: Weights SequenceRepresentation 0.58 0.63 0.68 0.73 0.78 1st 2nd 3rd 4th AuROC Iteration kNN based classification A Thaliana D Melanogaster D Rerio H Sapiens 0.58 0.63 0.68 0.73 0.78 A Thaliana D Melanogaster D Rerio H Sapiens AuROC Iteration kMeans based classification - 0% Source data contribution 1st 2nd 3rd 4th 0.55 0.6 0.65 0.7 0.75 1st 2nd 3rd 4th AuROC Iteration kMeans based classification - 40% Source data contribution A Thaliana D Melanogaster D Rerio H Sapiens 0.58 0.63 0.68 0.73 0.78 1st 2nd 3rd 4th AuROC Iteration kMeans based classification - 80% Source data contribution A Thaliana D Melanogaster D Rerio H Sapiens Acceptor Splice Site Train Organism: C. Elegans
  • 39.
    Summary • The commonpatterns in the sequences of the five studied organisms were extracted using bioinformatics analysis. • For the classification task, only the “important” positions of the neighborhood were used. • Four DNA sequence representations were proposed, namely: • Sequence per se. • Binary representation. • Score Matrix representation. • Weighted representation. • Binary, Score Matrix and Weighted representations perform well even when using traditional machine learning.
  • 40.
    • Two transferlearning algorithms are proposed: • kNN based. • kMeans based. • When the patterns in the sequences are similar between organisms, transfer learning doesn’t contribute a lot as the results are already good. • When the patterns differ a lot, kMeans based algorithm with no source data contribution after first iteration, helps reducing the gap. • Best performance: Score Matrix Representation. Summary
  • 41.
    Future Steps • Moreexperiments using all the available data. • Study more organisms. • Perform detailed comparison with other approaches. • Variation of the proposed transfer learning models, using in the training set in each iteration, the most certainly classified target data from the previous iteration.
  • 42.