SlideShare a Scribd company logo
Computational Prediction of Gene Functions
through Machine Learning methods
and Multiple Validation Procedures
candidate: Davide Chicco davide.chicco@polimi.it
supervisor: Marco Masseroli
PhD Thesis Defense Dissertation
20th March 2014
“Computational Prediction of Gene Functions
through Machine Learning methods
and Multiple Validation Procedures”
1) Analyzed scientific problem
2) Machine learning methods used
3) Validation procedures
4) Main results
5) Annotation list correlation measures
6) Novelty indicator
7) Final list of likely predicted annotations
8) Conclusions
Biomolecular annotations
• The concept of annotation: association of nucleotide or amino
acid sequences with useful information describing their features
• The association of a gene and an information feature term
corresponds to a biomolecular annotation
• This information is expressed through controlled
vocabularies, sometimes structured as ontologies (e.g. Gene
Ontology), where every controlled term of the vocabulary is
associated with a unique alphanumeric code
Gene Biological function feature
Annotation
gene2bff
Biomolecular annotations
• The association of an information/feature with a gene ID
constitutes an annotation
• Annotation example:
• Scientific fact: “the gene GD4 is present in the
mitochondrial membrane”
• Corresponds to the coupling:
<GD4, mitochondrial membrane>
GD4 mitochondrial membrane
GD4 is present in the
mitochondrial membrane
The problem
• Many available annotations in different databanks
• However, available annotations are incomplete
• Only a few of them represent highly reliable, human–curated
information
• In vitro experiments are expensive (e.g. 1,000 € and 3 weeks)
• To support and quicken the time–consuming curation process,
prioritized lists of computationally predicted annotations are
extremely useful
• These lists could be generated by softwares based on
Machine Learning algorithms
The problem
• Other scientists and researchers dealt with the problem in the
past by using:
• Support Vector Machines (SVM) [Barutcuoglu et al., 2006]
• k-nearest neighbor algorithm (kNN) [Tao et al., 2007]
• Decision trees [King et al., 2003]
• Hidden Markov models (HMM) [Mi et al. 2013]
• …
• These methods were all good in stating if a predicted
annotation was correct or not, but were not able to make
extrapolations, that is to suggest new annotations absent
from the input dataset
The software
input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
BioAnnotationPredictor:
A pipeline of steps and tools to predict,
validate and analyze biomolecular
annotation lists
input
matrix
outputStatistical
method
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• The software reads the data from the db GPDW
• The software creates the input matrix:
Input Annotation matrix A  {0, 1} m x n
m rows: genes
n columns: annotation features
A(i,j) = 1 if gene i is annotated to feature j or to
any descendant of j in the considered ontology
structure (true path rule)
A(i,j) = 0 otherwise (it is unknown)
feat 1 feat 2 feat 3 feat 4 … feat N
gene 1 0 0 0 0 … 0
gene 2 0 1 1 0 … 1
… … … … … … …
gene M 0 0 0 0 … 0
input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• The software applies a statistical method
(Truncated Singular Value Decomposition,
Semantically Improved SVD with gene
clustering, Semantically Improved SVD with
clustering and term-term similarity weights) to
a binary A input matrix
• Returns a real output A~ matrix
• Every element of the A matrix is compared to
its corresponding element of the A~ matrix
• After the computation, we compare the Aij element to
the Aij~
input
matrix
outputStatistical
method
0 0 0 0 … 0
0 1 1 0 … 1
… … … … … …
0 0 0 0 … 0
0.1 0.3 0.6 0.5 … 0.2
0.6 0.8 0.1 0.9 … 0.8
… … … … … …
0.3 0.2 0.4 0.6 … 0.8
Input Aij Output: Aij~
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
if Aij = 1 & Aij~ > τ: AC TP
if Aij = 1 & Aij~ ≤ τ: AR FN
if Aij = 0 & Aij~ ≤ τ: NAC TN
if Aij = 0 & Aij~ > τ: AP FP
AC: Annotation Confirmed; AR: Annotation to be Reviewed
NAC: No Annotation Confirmed; AP: Annotation Predicted
τ: minimizes the sum APs + ARs
Input Output
Yes Yes
Yes No
No No
No Yes
input
matrix
outputStatistical
method
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
AC: Annotation Confirmed
AR: Annotation to be Reviewed
NAC: No Annotation Confirmed
AP: Annotation Predicted
• The Annotations Predicted - AP (FP) are the
annotations absent in input and predicted by our
software: we suggest them as present
• We record them in ranked lists:
Input Output
Yes Yes
Yes No
No No
No Yes
Rank Annotation ID Likelihood
value
1 218405 0.9742584
2 222571 0.8545574
… …
n 203145 0.1673128
input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• An annotation prediction is performed by computing
a reduced rank approximation A~ of the annotation
matrix A
(where 0 < k < r, with r the number of non zero
singular values of A, i.e. the rank of A)
Truncated Singular Value Decomposition (tSVD)
input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• Only the first most «important» k columns of A are
used for reconstruction
(where 0 < k < r, with r the number of non zero
singular values of A, i.e. the rank of A)
• In [P. Khatri et al. "A semantic analysis of the annotations of the
human genome“, Bioinformatics, 2005], the authors argued
that the study of the matrix A shows the semantic
relationships of the gene-function associations.
• A large value of a~ij suggests that gene i should be
annotated to term j, whereas a value close to zero
suggests the opposite.
Truncated Singular Value Decomposition (tSVD)
input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• We departed from this method developed by Khatri
et al. (2005) Wayne State Univeristy, Detroit, and
implemented it
• Improvement:
• Khatri et al. used a fixed SVD truncation level
k=500
• We developed a method for automated data-
driven selection of k based on Receiver
Opearating Characteristic (ROC) curve
• We got better results shown in several
publications
Truncated Singular Value Decomposition (tSVD)
input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• Semantically improved (SIM1) version of the
Truncated SVD, based on gene clustering [P. Drineas et al.,
"Clustering large graphs via the singular value decomposition",
Machine Learning, 2004]
• Inspiring idea: similar genes can be grouped in
clusters, that have different weights
Truncated SVD with gene clustering (SIM1)
input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Truncated SVD with gene clustering (SIM1)
1. We choose a number C of clusters, and completely
discard the columns of matrix U where j = C+1, ..., n.
(we have an algorithm for the choice of C)
2. Each column uc of SVD matrix U represents a cluster,
and the value U(i,c) indicates the membership of
gene i to the c-th cluster.
3. For each cluster, first we generate Wc = diag(uc), and
then the modified gene-to-term matrix Ac = Wc A, in
which the i-th row of A is weighted by the
membership score of the corresponding gene to the
c-cluster.
input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Truncated SVD with gene clustering (SIM1)
4. Then, we compute Tc = Ac
T Ac, and its SVD(Tc)
5. Then, every element of the A~ matrix is computed
considering the c_th cluster that minimize its
Euclidean norm distance to the original vector:
ai~ = ai * Vk,c,i * Vk,c,i
T
6. Output matrix is produced
Tc = x
input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• Semantically improved (SIM2) version of the
Truncated SVD, based on gene clustering and term-
term similarity weights [P. Resnik, "Using information content to
evaluate semantic similarity in a taxonomy“, arXiv.org, 1995]
• Inspiring idea: functionally similar terms, should be
annotated to the same genes
Truncated SVD with gene clustering and term-
similarity weights (SIM2)
input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Truncated SVD with gene clustering and term-
similarity weights (SIM2)
In the algorithm shown before, we would add the
following step:
6. a) Furthermore, to effect more accurate clustering, we
compute the eigenvectors of the matrix G~ = ASAT
where real n*n matrix S is the term similarity matrix.
Starting from a pair of ontology terms, j1 and j2, the
term functional similarity S(j1, j2) can be calculated
using different methods.
Similarity is based on Resnik measure [P. Resnik, "Using
information content to evaluate semantic similarity in a
taxonomy", arXiv.org, 1995]
input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Other methods
With some colleagues at Politecnico di Milano we also
implemented other methods (not included in this thesis):
• Probabilistic Latent Semantic Analysis (pLSA)
• Latent Dirichlet Allocation with Gibbs sampling (LDA)
And with some colleagues at University of California
Irvine we have been trying to design and implement
other models:
• Auto-Encoder Deep Neural Network
• After the computation, we compare the Aij element to
the Aij~
input
matrix
outputStatistical
method
0 0 0 0 … 0
0 1 1 0 … 1
… … … … … …
0 0 0 0 … 0
0.1 0.3 0.6 0.5 … 0.2
0.6 0.8 0.1 0.9 … 0.8
… … … … … …
0.3 0.2 0.4 0.6 … 0.8
Input Aij Output: Aij~
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
if Aij = 1 & Aij~ > τ: AC TP
if Aij = 1 & Aij~ ≤ τ: AR FN
if Aij = 0 & Aij~ ≤ τ: NAC TN
if Aij = 0 & Aij~ > τ: AP FP
AC: Annotation Confirmed; AR: Annotation to be Reviewed
NAC: No Annotation Confirmed; AP: Annotation Predicted
τ: minimizes the sum APs + ARs
Input Output
Yes Yes
Yes No
No No
No Yes
input
matrix
outputStatistical
method
Data
reading
Statisical
method
Predicted
annotation
lists
Validation
A input
matrix
A~ output
matrix
• These four class results could be considered similar to
TP, FN, TN, FP
AC: Annotation Confirmed (TP)
AR: Annotation to be Reviewed (FN)
NAC: No Annotation Confirmed (TN)
AP: Annotation Predicted (FP)
• The software depicts ROC curves
AC rate =
𝐴𝐶
𝐴𝐶+𝐴𝑅
AP rate =
𝐴𝑃
𝐴𝑃+𝑁𝐴𝐶
Input Output
Yes Yes
Yes No
No No
No Yes
ROC Analysis Validation
input
matrix
outputStatistical
method
Data
reading
Statisical
method
Predicted
annotation
lists
Validation
A input
matrix
A~ output
matrix
• Ten-fold cross validation
• The software depicts the ROC curve
AC rate =
𝐴𝐶
𝐴𝐶+𝐴𝑅
AP rate =
𝐴𝑃
𝐴𝑃+𝑁𝐴𝐶
• Compute Area Under the Curve (AUC)
• If AUC ≥ 66.67% = 2/3, then good matrix reconstruction
• Otherwise, bad matrix reconstruction
ROC Analysis Validation
Database Validation
input
matrix
outputStatistical
method
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Since more recent database versions contain
better data and information
• Compute the prediction of annotations on a
former database version (e.g. July 2009)
• Compare these predictions to a newer version
of that database (e.g. March 2013)
• More Annotation Predicted found in the new
version => better predictions
• Percentage of accuracyValidation
July 2009 -> March 2013
Database Validation
input
matrix
outputStatistical
method
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Two main issues:
- Retrieve the annotation IDs in the former database
version to be used in the updated database
version;
- Management of duplicate annotations (i.e.
annotations having different evidence code)
Validation
Text Mining and Web Tool Validation
input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Literature text mining and web tools validation
procedure
Databanks may be not updated, so we manually
searched for the predicted annotations through
• literature resources such as PubMed
• Web tools such as AmiGO and GeneCards
Validation
Results
input
matrix
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
ROC Curves
Validation ROC curves for the Homo sapiens CC dataset. SVD-Khatri has k = 500;
SVD-us, SIM1, SIM2 have k = 378; SIM1 and SIM2 use C = 2,
and SIM2 uses Resnik measure.
Results
input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Results on the following annotation datasets:
• Homo sapiens genes and CC feature terms
• Homo sapiens genes and MF feature terms
• Homo sapiens genes and BP feature terms
• Homo sapiens genes and CC+MF+BP feature terms
Validation
Results
input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
The literature review allowed us to confirm some
additional predicted annotations
Validation
List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Comparing methods and parameters
• When we have different lists of predicted
annotations and we want to know how
similar/different they are:
• How much similar are they?
• Answering this question will help us to
understand how method parameters
behave
Annotation ID
10,000
20,000
…
90,000
Annotation ID
40,000
10,000
…
90,000
Predicted
annotation
lists
Comparison
of the lists
Validation
List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
How much similar are these lists?
• Spearman's rank correlation coefficient
the total sum of the difference position between
each element (e.g. 3rd position – 1st position = 2)
Annotation ID
10,000
20,000
30,000
…
Annotation ID
30,000
10,000
40,000
…
Predicted
annotation
lists
Comparison
of the lists
Validation
List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
How much similar are these lists?
• Kendall tau distance:
the total sum of all the bubble-sort changes
needed to get a list equal to the other
outputAnnotation ID
10,000
20,000
…
90,000
Annotation ID
20,000
10,000
…
90,000
Predicted
annotation
lists
Comparison
of the lists
Validation
List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Extended Kendall distance
Extended Spearman coefficient
output
Predicted
annotation
lists
Validation
Comparison
of the lists
output
Annotation ID
AP List
10,000
20,000
30,000
...
NAC List
70,000
80,000
90,000
...
Annotation ID
AP List
30,000
10,000
40,000
...
NAC List
70,000
20,000
90,000
...
• We assign a high
penalty if an element
is absent from one of
the lists
And a low
penalty if an element
is absent from one of
the AP lists
but present
in its NAC list
List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Significant patterns:
• Extended Kendall distances show that the similar
SVD truncations are, the lower is the Extended
Kendall distance is, and so the more similar the
lists are.
• Lists generated by predictions that produced
similar AUC have similar low Extended Spearman
coefficients.
This means that lists from
predictions having similar AUC
percentages have element
difference very low.
Predicted
annotation
lists
Comparison
of the lists
Validation
Example: DAG tree of the Molecular Function
terms predicted for the Homo sapiens gene P2RY14.
Black balls: terms already present in the database.
Blue exagons: predicted terms.
Novelty Indicator
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Predicted
annotation
lists
Novelty
indicator
Schlicker rate based on DAG
An indicator to express the “novelty” rate of a
prediction in a gene tree
• Statistical rate
• Visual DAG viewer
Comparison
of the lists
Validation
Example: DAG tree of the Molecular Function
terms predicted for the Homo sapiens gene CCR2.
Black balls: terms already present in the database.
Blue exagons: predicted terms.
Novelty Indicator
input
matrix
Statistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Predicted
annotation
lists
Validation
Novelty
indicator
Schlicker rate based on DAG
An indicator to express the “novelty” rate of a
prediction into a gene
• Statistical rate
• Visual DAG viewer
Comparison
of the lists
Final predictions
input
matrix
output
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
We finally get a list of the most likely predicted
annotations that have the following characteristics:
- predicted by all the three methods tSVD, SIM1,
SIM2
- prediction ranking in the first 50% of the list
- having at least one validated parent.
output
Predicted
annotation
lists
Gene symbol Feature term
PPME1 Organelle organization. [BP]
CHST14 Chondroitin sulfate proteoglycan biosynthetic process. [BP]
CHST14 Biopolymer biosynthetic process. [BP]
ROPN1B Microtubule-based agellum. [CC]
CHST14 Dermatan sulfate proteoglycan biosynthetic process. [BP]
CPA2 Proteolysis involved in cellular protein catabolic process. [BP]
PPME1 Chromosome organization. [BP]
CNOT2 Positive regulation of cellular metabolic process. [BP]
Validation
Recap
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
output
Predicted
annotation
lists
Comparison
of the lists
Truncated SVD with the automatically chosen truncation
showed better results (percentage of predicted
annotations found on the updated database version)
than previous method version with fixed parameters.
New methods (SIM1 and SIM2) outperformed
Truncated SVD.
ROC analysis, Database version, and text mining and
web tool validation procedure resulted very efficient.
Extended Kendall and Spearman
coefficients showed interesting patterns,
otherwise invisible.
Novelty indicator rate resulted very
useful in explaining which are the most
interesting prediction tree, showing
relevant research paths.
Novelty
indicator
Validation
Future
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
output
Predicted
annotation
lists
Comparison
of the lists
Future developments:
• integrate the software as a web application into
the Search Computing platform
• Implement and test the Auto-Encoder Deep
Neural Network algorithm
• Develop a text mining automated validation
procedure
• Add statistical tools to analyze the ROC
curves
Novelty
indicator
Validation
Doctoral Thesis Dissertation 2014-03-20 @PoliMi

More Related Content

What's hot

CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression Data
IRJET Journal
 
Discovery of Jumping Emerging Patterns Using Genetic Algorithm
Discovery of Jumping Emerging Patterns Using Genetic AlgorithmDiscovery of Jumping Emerging Patterns Using Genetic Algorithm
Discovery of Jumping Emerging Patterns Using Genetic Algorithm
IJCSIS Research Publications
 
Image translated data analysis
Image translated data analysisImage translated data analysis
Image translated data analysis
YOUNGSEOPKIM
 
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for ClusteringAutomatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
idescitation
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Alexandros Karatzoglou
 
Dynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application TestingDynamic Symbolic Database Application Testing
New c sharp3_features_(linq)_part_v
New c sharp3_features_(linq)_part_vNew c sharp3_features_(linq)_part_v
New c sharp3_features_(linq)_part_v
Nico Ludwig
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
cscpconf
 
Sql Patterns
Sql PatternsSql Patterns
Sql Patterns
phanleson
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender System
Jacek Wasilewski
 
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
Rudradityo Saha
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
Kumud Arora
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
kenluck2001
 
Graph Based Methods For The Representation And Analysis Of.
Graph Based Methods For The Representation And Analysis Of.Graph Based Methods For The Representation And Analysis Of.
Graph Based Methods For The Representation And Analysis Of.
legal2
 
Tracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First PrinciplesTracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First Principles
kenluck2001
 

What's hot (15)

CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression Data
 
Discovery of Jumping Emerging Patterns Using Genetic Algorithm
Discovery of Jumping Emerging Patterns Using Genetic AlgorithmDiscovery of Jumping Emerging Patterns Using Genetic Algorithm
Discovery of Jumping Emerging Patterns Using Genetic Algorithm
 
Image translated data analysis
Image translated data analysisImage translated data analysis
Image translated data analysis
 
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for ClusteringAutomatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 
Dynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application TestingDynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application Testing
 
New c sharp3_features_(linq)_part_v
New c sharp3_features_(linq)_part_vNew c sharp3_features_(linq)_part_v
New c sharp3_features_(linq)_part_v
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
Sql Patterns
Sql PatternsSql Patterns
Sql Patterns
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender System
 
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
 
Graph Based Methods For The Representation And Analysis Of.
Graph Based Methods For The Representation And Analysis Of.Graph Based Methods For The Representation And Analysis Of.
Graph Based Methods For The Representation And Analysis Of.
 
Tracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First PrinciplesTracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First Principles
 

Viewers also liked

Integration of Bioinformatics Web Services through the Search Computing Techn...
Integration of Bioinformatics Web Services through the Search Computing Techn...Integration of Bioinformatics Web Services through the Search Computing Techn...
Integration of Bioinformatics Web Services through the Search Computing Techn...
Davide Chicco
 
Brian Durkin Resume - July 2016
Brian Durkin Resume - July 2016Brian Durkin Resume - July 2016
Brian Durkin Resume - July 2016
Brian Durkin
 
Curriculum Vitae
Curriculum VitaeCurriculum Vitae
Curriculum Vitae
Kathy Georgiadis
 
DANNY POIRIER RESUME 2014 -1 copy 3
DANNY POIRIER RESUME 2014 -1 copy 3DANNY POIRIER RESUME 2014 -1 copy 3
DANNY POIRIER RESUME 2014 -1 copy 3
Danny Poirier
 
Perkataan Berlawan
Perkataan BerlawanPerkataan Berlawan
Perkataan Berlawan
Bibie
 
Resume-Manish_Agrahari_IBM_BPM
Resume-Manish_Agrahari_IBM_BPMResume-Manish_Agrahari_IBM_BPM
Resume-Manish_Agrahari_IBM_BPM
Manish Agrahari
 

Viewers also liked (6)

Integration of Bioinformatics Web Services through the Search Computing Techn...
Integration of Bioinformatics Web Services through the Search Computing Techn...Integration of Bioinformatics Web Services through the Search Computing Techn...
Integration of Bioinformatics Web Services through the Search Computing Techn...
 
Brian Durkin Resume - July 2016
Brian Durkin Resume - July 2016Brian Durkin Resume - July 2016
Brian Durkin Resume - July 2016
 
Curriculum Vitae
Curriculum VitaeCurriculum Vitae
Curriculum Vitae
 
DANNY POIRIER RESUME 2014 -1 copy 3
DANNY POIRIER RESUME 2014 -1 copy 3DANNY POIRIER RESUME 2014 -1 copy 3
DANNY POIRIER RESUME 2014 -1 copy 3
 
Perkataan Berlawan
Perkataan BerlawanPerkataan Berlawan
Perkataan Berlawan
 
Resume-Manish_Agrahari_IBM_BPM
Resume-Manish_Agrahari_IBM_BPMResume-Manish_Agrahari_IBM_BPM
Resume-Manish_Agrahari_IBM_BPM
 

Similar to Doctoral Thesis Dissertation 2014-03-20 @PoliMi

презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
Valeriya Simeonova
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Johann Petrak
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
IJAEMSJORNAL
 
Experiments on Design Pattern Discovery
Experiments on Design Pattern DiscoveryExperiments on Design Pattern Discovery
Experiments on Design Pattern Discovery
Tim Menzies
 
MS Thesis
MS ThesisMS Thesis
MS Thesis
Jatin Agarwal
 
MS Thesis
MS ThesisMS Thesis
MS Thesis
Jatin Agarwal
 
Presentation
PresentationPresentation
Presentation
butest
 
Optimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithmOptimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithm
csandit
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
Dmitry Grapov
 
Information Integration and Knowledge Acquisition from Semantically Heterogen...
Information Integration and Knowledge Acquisition from Semantically Heterogen...Information Integration and Knowledge Acquisition from Semantically Heterogen...
Information Integration and Knowledge Acquisition from Semantically Heterogen...
Jie Bao
 
Comparative analysis of dynamic programming
Comparative analysis of dynamic programmingComparative analysis of dynamic programming
Comparative analysis of dynamic programming
eSAT Publishing House
 
Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...
eSAT Journals
 
Svd filtered temporal usage clustering
Svd filtered temporal usage clusteringSvd filtered temporal usage clustering
Svd filtered temporal usage clustering
Liang Xie, PhD
 
Missing Data imputation
Missing Data imputationMissing Data imputation
Missing Data imputation
Наталя Шаховська
 
May workshop
May workshopMay workshop
May workshop
Fahadahammed2
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
IJRES Journal
 
May 15 workshop
May 15  workshopMay 15  workshop
May 15 workshop
Fahadahammed2
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
George Roth
 
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
Davide Chicco
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
Jhoirene Clemente
 

Similar to Doctoral Thesis Dissertation 2014-03-20 @PoliMi (20)

презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
 
Experiments on Design Pattern Discovery
Experiments on Design Pattern DiscoveryExperiments on Design Pattern Discovery
Experiments on Design Pattern Discovery
 
MS Thesis
MS ThesisMS Thesis
MS Thesis
 
MS Thesis
MS ThesisMS Thesis
MS Thesis
 
Presentation
PresentationPresentation
Presentation
 
Optimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithmOptimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithm
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Information Integration and Knowledge Acquisition from Semantically Heterogen...
Information Integration and Knowledge Acquisition from Semantically Heterogen...Information Integration and Knowledge Acquisition from Semantically Heterogen...
Information Integration and Knowledge Acquisition from Semantically Heterogen...
 
Comparative analysis of dynamic programming
Comparative analysis of dynamic programmingComparative analysis of dynamic programming
Comparative analysis of dynamic programming
 
Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...
 
Svd filtered temporal usage clustering
Svd filtered temporal usage clusteringSvd filtered temporal usage clustering
Svd filtered temporal usage clustering
 
Missing Data imputation
Missing Data imputationMissing Data imputation
Missing Data imputation
 
May workshop
May workshopMay workshop
May workshop
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
May 15 workshop
May 15  workshopMay 15  workshop
May 15 workshop
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
 

Recently uploaded

Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 

Recently uploaded (20)

Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 

Doctoral Thesis Dissertation 2014-03-20 @PoliMi

  • 1. Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures candidate: Davide Chicco davide.chicco@polimi.it supervisor: Marco Masseroli PhD Thesis Defense Dissertation 20th March 2014
  • 2. “Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures” 1) Analyzed scientific problem 2) Machine learning methods used 3) Validation procedures 4) Main results 5) Annotation list correlation measures 6) Novelty indicator 7) Final list of likely predicted annotations 8) Conclusions
  • 3. Biomolecular annotations • The concept of annotation: association of nucleotide or amino acid sequences with useful information describing their features • The association of a gene and an information feature term corresponds to a biomolecular annotation • This information is expressed through controlled vocabularies, sometimes structured as ontologies (e.g. Gene Ontology), where every controlled term of the vocabulary is associated with a unique alphanumeric code Gene Biological function feature Annotation gene2bff
  • 4. Biomolecular annotations • The association of an information/feature with a gene ID constitutes an annotation • Annotation example: • Scientific fact: “the gene GD4 is present in the mitochondrial membrane” • Corresponds to the coupling: <GD4, mitochondrial membrane> GD4 mitochondrial membrane GD4 is present in the mitochondrial membrane
  • 5. The problem • Many available annotations in different databanks • However, available annotations are incomplete • Only a few of them represent highly reliable, human–curated information • In vitro experiments are expensive (e.g. 1,000 € and 3 weeks) • To support and quicken the time–consuming curation process, prioritized lists of computationally predicted annotations are extremely useful • These lists could be generated by softwares based on Machine Learning algorithms
  • 6. The problem • Other scientists and researchers dealt with the problem in the past by using: • Support Vector Machines (SVM) [Barutcuoglu et al., 2006] • k-nearest neighbor algorithm (kNN) [Tao et al., 2007] • Decision trees [King et al., 2003] • Hidden Markov models (HMM) [Mi et al. 2013] • … • These methods were all good in stating if a predicted annotation was correct or not, but were not able to make extrapolations, that is to suggest new annotations absent from the input dataset
  • 7. The software input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix BioAnnotationPredictor: A pipeline of steps and tools to predict, validate and analyze biomolecular annotation lists
  • 8. input matrix outputStatistical method Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • The software reads the data from the db GPDW • The software creates the input matrix: Input Annotation matrix A  {0, 1} m x n m rows: genes n columns: annotation features A(i,j) = 1 if gene i is annotated to feature j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) feat 1 feat 2 feat 3 feat 4 … feat N gene 1 0 0 0 0 … 0 gene 2 0 1 1 0 … 1 … … … … … … … gene M 0 0 0 0 … 0
  • 9. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • The software applies a statistical method (Truncated Singular Value Decomposition, Semantically Improved SVD with gene clustering, Semantically Improved SVD with clustering and term-term similarity weights) to a binary A input matrix • Returns a real output A~ matrix • Every element of the A matrix is compared to its corresponding element of the A~ matrix
  • 10. • After the computation, we compare the Aij element to the Aij~ input matrix outputStatistical method 0 0 0 0 … 0 0 1 1 0 … 1 … … … … … … 0 0 0 0 … 0 0.1 0.3 0.6 0.5 … 0.2 0.6 0.8 0.1 0.9 … 0.8 … … … … … … 0.3 0.2 0.4 0.6 … 0.8 Input Aij Output: Aij~ Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix if Aij = 1 & Aij~ > τ: AC TP if Aij = 1 & Aij~ ≤ τ: AR FN if Aij = 0 & Aij~ ≤ τ: NAC TN if Aij = 0 & Aij~ > τ: AP FP AC: Annotation Confirmed; AR: Annotation to be Reviewed NAC: No Annotation Confirmed; AP: Annotation Predicted τ: minimizes the sum APs + ARs Input Output Yes Yes Yes No No No No Yes
  • 11. input matrix outputStatistical method Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix AC: Annotation Confirmed AR: Annotation to be Reviewed NAC: No Annotation Confirmed AP: Annotation Predicted • The Annotations Predicted - AP (FP) are the annotations absent in input and predicted by our software: we suggest them as present • We record them in ranked lists: Input Output Yes Yes Yes No No No No Yes Rank Annotation ID Likelihood value 1 218405 0.9742584 2 222571 0.8545574 … … n 203145 0.1673128
  • 12. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • An annotation prediction is performed by computing a reduced rank approximation A~ of the annotation matrix A (where 0 < k < r, with r the number of non zero singular values of A, i.e. the rank of A) Truncated Singular Value Decomposition (tSVD)
  • 13. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • Only the first most «important» k columns of A are used for reconstruction (where 0 < k < r, with r the number of non zero singular values of A, i.e. the rank of A) • In [P. Khatri et al. "A semantic analysis of the annotations of the human genome“, Bioinformatics, 2005], the authors argued that the study of the matrix A shows the semantic relationships of the gene-function associations. • A large value of a~ij suggests that gene i should be annotated to term j, whereas a value close to zero suggests the opposite. Truncated Singular Value Decomposition (tSVD)
  • 14. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • We departed from this method developed by Khatri et al. (2005) Wayne State Univeristy, Detroit, and implemented it • Improvement: • Khatri et al. used a fixed SVD truncation level k=500 • We developed a method for automated data- driven selection of k based on Receiver Opearating Characteristic (ROC) curve • We got better results shown in several publications Truncated Singular Value Decomposition (tSVD)
  • 15. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • Semantically improved (SIM1) version of the Truncated SVD, based on gene clustering [P. Drineas et al., "Clustering large graphs via the singular value decomposition", Machine Learning, 2004] • Inspiring idea: similar genes can be grouped in clusters, that have different weights Truncated SVD with gene clustering (SIM1)
  • 16. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Truncated SVD with gene clustering (SIM1) 1. We choose a number C of clusters, and completely discard the columns of matrix U where j = C+1, ..., n. (we have an algorithm for the choice of C) 2. Each column uc of SVD matrix U represents a cluster, and the value U(i,c) indicates the membership of gene i to the c-th cluster. 3. For each cluster, first we generate Wc = diag(uc), and then the modified gene-to-term matrix Ac = Wc A, in which the i-th row of A is weighted by the membership score of the corresponding gene to the c-cluster.
  • 17. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Truncated SVD with gene clustering (SIM1) 4. Then, we compute Tc = Ac T Ac, and its SVD(Tc) 5. Then, every element of the A~ matrix is computed considering the c_th cluster that minimize its Euclidean norm distance to the original vector: ai~ = ai * Vk,c,i * Vk,c,i T 6. Output matrix is produced Tc = x
  • 18. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • Semantically improved (SIM2) version of the Truncated SVD, based on gene clustering and term- term similarity weights [P. Resnik, "Using information content to evaluate semantic similarity in a taxonomy“, arXiv.org, 1995] • Inspiring idea: functionally similar terms, should be annotated to the same genes Truncated SVD with gene clustering and term- similarity weights (SIM2)
  • 19. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Truncated SVD with gene clustering and term- similarity weights (SIM2) In the algorithm shown before, we would add the following step: 6. a) Furthermore, to effect more accurate clustering, we compute the eigenvectors of the matrix G~ = ASAT where real n*n matrix S is the term similarity matrix. Starting from a pair of ontology terms, j1 and j2, the term functional similarity S(j1, j2) can be calculated using different methods. Similarity is based on Resnik measure [P. Resnik, "Using information content to evaluate semantic similarity in a taxonomy", arXiv.org, 1995]
  • 20. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Other methods With some colleagues at Politecnico di Milano we also implemented other methods (not included in this thesis): • Probabilistic Latent Semantic Analysis (pLSA) • Latent Dirichlet Allocation with Gibbs sampling (LDA) And with some colleagues at University of California Irvine we have been trying to design and implement other models: • Auto-Encoder Deep Neural Network
  • 21. • After the computation, we compare the Aij element to the Aij~ input matrix outputStatistical method 0 0 0 0 … 0 0 1 1 0 … 1 … … … … … … 0 0 0 0 … 0 0.1 0.3 0.6 0.5 … 0.2 0.6 0.8 0.1 0.9 … 0.8 … … … … … … 0.3 0.2 0.4 0.6 … 0.8 Input Aij Output: Aij~ Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix if Aij = 1 & Aij~ > τ: AC TP if Aij = 1 & Aij~ ≤ τ: AR FN if Aij = 0 & Aij~ ≤ τ: NAC TN if Aij = 0 & Aij~ > τ: AP FP AC: Annotation Confirmed; AR: Annotation to be Reviewed NAC: No Annotation Confirmed; AP: Annotation Predicted τ: minimizes the sum APs + ARs Input Output Yes Yes Yes No No No No Yes
  • 22. input matrix outputStatistical method Data reading Statisical method Predicted annotation lists Validation A input matrix A~ output matrix • These four class results could be considered similar to TP, FN, TN, FP AC: Annotation Confirmed (TP) AR: Annotation to be Reviewed (FN) NAC: No Annotation Confirmed (TN) AP: Annotation Predicted (FP) • The software depicts ROC curves AC rate = 𝐴𝐶 𝐴𝐶+𝐴𝑅 AP rate = 𝐴𝑃 𝐴𝑃+𝑁𝐴𝐶 Input Output Yes Yes Yes No No No No Yes ROC Analysis Validation
  • 23. input matrix outputStatistical method Data reading Statisical method Predicted annotation lists Validation A input matrix A~ output matrix • Ten-fold cross validation • The software depicts the ROC curve AC rate = 𝐴𝐶 𝐴𝐶+𝐴𝑅 AP rate = 𝐴𝑃 𝐴𝑃+𝑁𝐴𝐶 • Compute Area Under the Curve (AUC) • If AUC ≥ 66.67% = 2/3, then good matrix reconstruction • Otherwise, bad matrix reconstruction ROC Analysis Validation
  • 24. Database Validation input matrix outputStatistical method Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Since more recent database versions contain better data and information • Compute the prediction of annotations on a former database version (e.g. July 2009) • Compare these predictions to a newer version of that database (e.g. March 2013) • More Annotation Predicted found in the new version => better predictions • Percentage of accuracyValidation July 2009 -> March 2013
  • 25. Database Validation input matrix outputStatistical method Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Two main issues: - Retrieve the annotation IDs in the former database version to be used in the updated database version; - Management of duplicate annotations (i.e. annotations having different evidence code) Validation
  • 26. Text Mining and Web Tool Validation input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Literature text mining and web tools validation procedure Databanks may be not updated, so we manually searched for the predicted annotations through • literature resources such as PubMed • Web tools such as AmiGO and GeneCards Validation
  • 27. Results input matrix Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix ROC Curves Validation ROC curves for the Homo sapiens CC dataset. SVD-Khatri has k = 500; SVD-us, SIM1, SIM2 have k = 378; SIM1 and SIM2 use C = 2, and SIM2 uses Resnik measure.
  • 28. Results input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Results on the following annotation datasets: • Homo sapiens genes and CC feature terms • Homo sapiens genes and MF feature terms • Homo sapiens genes and BP feature terms • Homo sapiens genes and CC+MF+BP feature terms Validation
  • 29. Results input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix The literature review allowed us to confirm some additional predicted annotations Validation
  • 30. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix Comparing methods and parameters • When we have different lists of predicted annotations and we want to know how similar/different they are: • How much similar are they? • Answering this question will help us to understand how method parameters behave Annotation ID 10,000 20,000 … 90,000 Annotation ID 40,000 10,000 … 90,000 Predicted annotation lists Comparison of the lists Validation
  • 31. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix How much similar are these lists? • Spearman's rank correlation coefficient the total sum of the difference position between each element (e.g. 3rd position – 1st position = 2) Annotation ID 10,000 20,000 30,000 … Annotation ID 30,000 10,000 40,000 … Predicted annotation lists Comparison of the lists Validation
  • 32. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix How much similar are these lists? • Kendall tau distance: the total sum of all the bubble-sort changes needed to get a list equal to the other outputAnnotation ID 10,000 20,000 … 90,000 Annotation ID 20,000 10,000 … 90,000 Predicted annotation lists Comparison of the lists Validation
  • 33. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix Extended Kendall distance Extended Spearman coefficient output Predicted annotation lists Validation Comparison of the lists output Annotation ID AP List 10,000 20,000 30,000 ... NAC List 70,000 80,000 90,000 ... Annotation ID AP List 30,000 10,000 40,000 ... NAC List 70,000 20,000 90,000 ... • We assign a high penalty if an element is absent from one of the lists And a low penalty if an element is absent from one of the AP lists but present in its NAC list
  • 34. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix Significant patterns: • Extended Kendall distances show that the similar SVD truncations are, the lower is the Extended Kendall distance is, and so the more similar the lists are. • Lists generated by predictions that produced similar AUC have similar low Extended Spearman coefficients. This means that lists from predictions having similar AUC percentages have element difference very low. Predicted annotation lists Comparison of the lists Validation
  • 35. Example: DAG tree of the Molecular Function terms predicted for the Homo sapiens gene P2RY14. Black balls: terms already present in the database. Blue exagons: predicted terms. Novelty Indicator input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix Predicted annotation lists Novelty indicator Schlicker rate based on DAG An indicator to express the “novelty” rate of a prediction in a gene tree • Statistical rate • Visual DAG viewer Comparison of the lists Validation
  • 36. Example: DAG tree of the Molecular Function terms predicted for the Homo sapiens gene CCR2. Black balls: terms already present in the database. Blue exagons: predicted terms. Novelty Indicator input matrix Statistical method Data reading Statisical method A input matrix A~ output matrix Predicted annotation lists Validation Novelty indicator Schlicker rate based on DAG An indicator to express the “novelty” rate of a prediction into a gene • Statistical rate • Visual DAG viewer Comparison of the lists
  • 37. Final predictions input matrix output Data reading Statisical method A input matrix A~ output matrix We finally get a list of the most likely predicted annotations that have the following characteristics: - predicted by all the three methods tSVD, SIM1, SIM2 - prediction ranking in the first 50% of the list - having at least one validated parent. output Predicted annotation lists Gene symbol Feature term PPME1 Organelle organization. [BP] CHST14 Chondroitin sulfate proteoglycan biosynthetic process. [BP] CHST14 Biopolymer biosynthetic process. [BP] ROPN1B Microtubule-based agellum. [CC] CHST14 Dermatan sulfate proteoglycan biosynthetic process. [BP] CPA2 Proteolysis involved in cellular protein catabolic process. [BP] PPME1 Chromosome organization. [BP] CNOT2 Positive regulation of cellular metabolic process. [BP] Validation
  • 38. Recap input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix output Predicted annotation lists Comparison of the lists Truncated SVD with the automatically chosen truncation showed better results (percentage of predicted annotations found on the updated database version) than previous method version with fixed parameters. New methods (SIM1 and SIM2) outperformed Truncated SVD. ROC analysis, Database version, and text mining and web tool validation procedure resulted very efficient. Extended Kendall and Spearman coefficients showed interesting patterns, otherwise invisible. Novelty indicator rate resulted very useful in explaining which are the most interesting prediction tree, showing relevant research paths. Novelty indicator Validation
  • 39. Future input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix output Predicted annotation lists Comparison of the lists Future developments: • integrate the software as a web application into the Search Computing platform • Implement and test the Auto-Encoder Deep Neural Network algorithm • Develop a text mining automated validation procedure • Add statistical tools to analyze the ROC curves Novelty indicator Validation