Doctoral Thesis Dissertation 2014-03-20 @PoliMi

2,786 views

Published on

Slides of my doctoral thesis dissertation talk, given on 20 March 2014 at Politecnico di Milano. Title: "Computational prediction of gene functions through machine learning methods and multiple validation procedures"

Published in: Education
2 Comments
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
2,786
On SlideShare
0
From Embeds
0
Number of Embeds
1,675
Actions
Shares
0
Downloads
19
Comments
2
Likes
0
Embeds 0
No embeds

No notes for slide

Doctoral Thesis Dissertation 2014-03-20 @PoliMi

  1. 1. Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures candidate: Davide Chicco davide.chicco@polimi.it supervisor: Marco Masseroli PhD Thesis Defense Dissertation 20th March 2014
  2. 2. “Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures” 1) Analyzed scientific problem 2) Machine learning methods used 3) Validation procedures 4) Main results 5) Annotation list correlation measures 6) Novelty indicator 7) Final list of likely predicted annotations 8) Conclusions
  3. 3. Biomolecular annotations • The concept of annotation: association of nucleotide or amino acid sequences with useful information describing their features • The association of a gene and an information feature term corresponds to a biomolecular annotation • This information is expressed through controlled vocabularies, sometimes structured as ontologies (e.g. Gene Ontology), where every controlled term of the vocabulary is associated with a unique alphanumeric code Gene Biological function feature Annotation gene2bff
  4. 4. Biomolecular annotations • The association of an information/feature with a gene ID constitutes an annotation • Annotation example: • Scientific fact: “the gene GD4 is present in the mitochondrial membrane” • Corresponds to the coupling: <GD4, mitochondrial membrane> GD4 mitochondrial membrane GD4 is present in the mitochondrial membrane
  5. 5. The problem • Many available annotations in different databanks • However, available annotations are incomplete • Only a few of them represent highly reliable, human–curated information • In vitro experiments are expensive (e.g. 1,000 € and 3 weeks) • To support and quicken the time–consuming curation process, prioritized lists of computationally predicted annotations are extremely useful • These lists could be generated by softwares based on Machine Learning algorithms
  6. 6. The problem • Other scientists and researchers dealt with the problem in the past by using: • Support Vector Machines (SVM) [Barutcuoglu et al., 2006] • k-nearest neighbor algorithm (kNN) [Tao et al., 2007] • Decision trees [King et al., 2003] • Hidden Markov models (HMM) [Mi et al. 2013] • … • These methods were all good in stating if a predicted annotation was correct or not, but were not able to make extrapolations, that is to suggest new annotations absent from the input dataset
  7. 7. The software input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix BioAnnotationPredictor: A pipeline of steps and tools to predict, validate and analyze biomolecular annotation lists
  8. 8. input matrix outputStatistical method Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • The software reads the data from the db GPDW • The software creates the input matrix: Input Annotation matrix A  {0, 1} m x n m rows: genes n columns: annotation features A(i,j) = 1 if gene i is annotated to feature j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) feat 1 feat 2 feat 3 feat 4 … feat N gene 1 0 0 0 0 … 0 gene 2 0 1 1 0 … 1 … … … … … … … gene M 0 0 0 0 … 0
  9. 9. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • The software applies a statistical method (Truncated Singular Value Decomposition, Semantically Improved SVD with gene clustering, Semantically Improved SVD with clustering and term-term similarity weights) to a binary A input matrix • Returns a real output A~ matrix • Every element of the A matrix is compared to its corresponding element of the A~ matrix
  10. 10. • After the computation, we compare the Aij element to the Aij~ input matrix outputStatistical method 0 0 0 0 … 0 0 1 1 0 … 1 … … … … … … 0 0 0 0 … 0 0.1 0.3 0.6 0.5 … 0.2 0.6 0.8 0.1 0.9 … 0.8 … … … … … … 0.3 0.2 0.4 0.6 … 0.8 Input Aij Output: Aij~ Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix if Aij = 1 & Aij~ > τ: AC TP if Aij = 1 & Aij~ ≤ τ: AR FN if Aij = 0 & Aij~ ≤ τ: NAC TN if Aij = 0 & Aij~ > τ: AP FP AC: Annotation Confirmed; AR: Annotation to be Reviewed NAC: No Annotation Confirmed; AP: Annotation Predicted τ: minimizes the sum APs + ARs Input Output Yes Yes Yes No No No No Yes
  11. 11. input matrix outputStatistical method Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix AC: Annotation Confirmed AR: Annotation to be Reviewed NAC: No Annotation Confirmed AP: Annotation Predicted • The Annotations Predicted - AP (FP) are the annotations absent in input and predicted by our software: we suggest them as present • We record them in ranked lists: Input Output Yes Yes Yes No No No No Yes Rank Annotation ID Likelihood value 1 218405 0.9742584 2 222571 0.8545574 … … n 203145 0.1673128
  12. 12. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • An annotation prediction is performed by computing a reduced rank approximation A~ of the annotation matrix A (where 0 < k < r, with r the number of non zero singular values of A, i.e. the rank of A) Truncated Singular Value Decomposition (tSVD)
  13. 13. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • Only the first most «important» k columns of A are used for reconstruction (where 0 < k < r, with r the number of non zero singular values of A, i.e. the rank of A) • In [P. Khatri et al. "A semantic analysis of the annotations of the human genome“, Bioinformatics, 2005], the authors argued that the study of the matrix A shows the semantic relationships of the gene-function associations. • A large value of a~ij suggests that gene i should be annotated to term j, whereas a value close to zero suggests the opposite. Truncated Singular Value Decomposition (tSVD)
  14. 14. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • We departed from this method developed by Khatri et al. (2005) Wayne State Univeristy, Detroit, and implemented it • Improvement: • Khatri et al. used a fixed SVD truncation level k=500 • We developed a method for automated data- driven selection of k based on Receiver Opearating Characteristic (ROC) curve • We got better results shown in several publications Truncated Singular Value Decomposition (tSVD)
  15. 15. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • Semantically improved (SIM1) version of the Truncated SVD, based on gene clustering [P. Drineas et al., "Clustering large graphs via the singular value decomposition", Machine Learning, 2004] • Inspiring idea: similar genes can be grouped in clusters, that have different weights Truncated SVD with gene clustering (SIM1)
  16. 16. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Truncated SVD with gene clustering (SIM1) 1. We choose a number C of clusters, and completely discard the columns of matrix U where j = C+1, ..., n. (we have an algorithm for the choice of C) 2. Each column uc of SVD matrix U represents a cluster, and the value U(i,c) indicates the membership of gene i to the c-th cluster. 3. For each cluster, first we generate Wc = diag(uc), and then the modified gene-to-term matrix Ac = Wc A, in which the i-th row of A is weighted by the membership score of the corresponding gene to the c-cluster.
  17. 17. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Truncated SVD with gene clustering (SIM1) 4. Then, we compute Tc = Ac T Ac, and its SVD(Tc) 5. Then, every element of the A~ matrix is computed considering the c_th cluster that minimize its Euclidean norm distance to the original vector: ai~ = ai * Vk,c,i * Vk,c,i T 6. Output matrix is produced Tc = x
  18. 18. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix • Semantically improved (SIM2) version of the Truncated SVD, based on gene clustering and term- term similarity weights [P. Resnik, "Using information content to evaluate semantic similarity in a taxonomy“, arXiv.org, 1995] • Inspiring idea: functionally similar terms, should be annotated to the same genes Truncated SVD with gene clustering and term- similarity weights (SIM2)
  19. 19. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Truncated SVD with gene clustering and term- similarity weights (SIM2) In the algorithm shown before, we would add the following step: 6. a) Furthermore, to effect more accurate clustering, we compute the eigenvectors of the matrix G~ = ASAT where real n*n matrix S is the term similarity matrix. Starting from a pair of ontology terms, j1 and j2, the term functional similarity S(j1, j2) can be calculated using different methods. Similarity is based on Resnik measure [P. Resnik, "Using information content to evaluate semantic similarity in a taxonomy", arXiv.org, 1995]
  20. 20. input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Other methods With some colleagues at Politecnico di Milano we also implemented other methods (not included in this thesis): • Probabilistic Latent Semantic Analysis (pLSA) • Latent Dirichlet Allocation with Gibbs sampling (LDA) And with some colleagues at University of California Irvine we have been trying to design and implement other models: • Auto-Encoder Deep Neural Network
  21. 21. • After the computation, we compare the Aij element to the Aij~ input matrix outputStatistical method 0 0 0 0 … 0 0 1 1 0 … 1 … … … … … … 0 0 0 0 … 0 0.1 0.3 0.6 0.5 … 0.2 0.6 0.8 0.1 0.9 … 0.8 … … … … … … 0.3 0.2 0.4 0.6 … 0.8 Input Aij Output: Aij~ Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix if Aij = 1 & Aij~ > τ: AC TP if Aij = 1 & Aij~ ≤ τ: AR FN if Aij = 0 & Aij~ ≤ τ: NAC TN if Aij = 0 & Aij~ > τ: AP FP AC: Annotation Confirmed; AR: Annotation to be Reviewed NAC: No Annotation Confirmed; AP: Annotation Predicted τ: minimizes the sum APs + ARs Input Output Yes Yes Yes No No No No Yes
  22. 22. input matrix outputStatistical method Data reading Statisical method Predicted annotation lists Validation A input matrix A~ output matrix • These four class results could be considered similar to TP, FN, TN, FP AC: Annotation Confirmed (TP) AR: Annotation to be Reviewed (FN) NAC: No Annotation Confirmed (TN) AP: Annotation Predicted (FP) • The software depicts ROC curves AC rate = 𝐴𝐶 𝐴𝐶+𝐴𝑅 AP rate = 𝐴𝑃 𝐴𝑃+𝑁𝐴𝐶 Input Output Yes Yes Yes No No No No Yes ROC Analysis Validation
  23. 23. input matrix outputStatistical method Data reading Statisical method Predicted annotation lists Validation A input matrix A~ output matrix • Ten-fold cross validation • The software depicts the ROC curve AC rate = 𝐴𝐶 𝐴𝐶+𝐴𝑅 AP rate = 𝐴𝑃 𝐴𝑃+𝑁𝐴𝐶 • Compute Area Under the Curve (AUC) • If AUC ≥ 66.67% = 2/3, then good matrix reconstruction • Otherwise, bad matrix reconstruction ROC Analysis Validation
  24. 24. Database Validation input matrix outputStatistical method Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Since more recent database versions contain better data and information • Compute the prediction of annotations on a former database version (e.g. July 2009) • Compare these predictions to a newer version of that database (e.g. March 2013) • More Annotation Predicted found in the new version => better predictions • Percentage of accuracyValidation July 2009 -> March 2013
  25. 25. Database Validation input matrix outputStatistical method Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Two main issues: - Retrieve the annotation IDs in the former database version to be used in the updated database version; - Management of duplicate annotations (i.e. annotations having different evidence code) Validation
  26. 26. Text Mining and Web Tool Validation input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Literature text mining and web tools validation procedure Databanks may be not updated, so we manually searched for the predicted annotations through • literature resources such as PubMed • Web tools such as AmiGO and GeneCards Validation
  27. 27. Results input matrix Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix ROC Curves Validation ROC curves for the Homo sapiens CC dataset. SVD-Khatri has k = 500; SVD-us, SIM1, SIM2 have k = 378; SIM1 and SIM2 use C = 2, and SIM2 uses Resnik measure.
  28. 28. Results input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix Results on the following annotation datasets: • Homo sapiens genes and CC feature terms • Homo sapiens genes and MF feature terms • Homo sapiens genes and BP feature terms • Homo sapiens genes and CC+MF+BP feature terms Validation
  29. 29. Results input matrix output Data reading Statisical method Predicted annotation lists A input matrix A~ output matrix The literature review allowed us to confirm some additional predicted annotations Validation
  30. 30. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix Comparing methods and parameters • When we have different lists of predicted annotations and we want to know how similar/different they are: • How much similar are they? • Answering this question will help us to understand how method parameters behave Annotation ID 10,000 20,000 … 90,000 Annotation ID 40,000 10,000 … 90,000 Predicted annotation lists Comparison of the lists Validation
  31. 31. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix How much similar are these lists? • Spearman's rank correlation coefficient the total sum of the difference position between each element (e.g. 3rd position – 1st position = 2) Annotation ID 10,000 20,000 30,000 … Annotation ID 30,000 10,000 40,000 … Predicted annotation lists Comparison of the lists Validation
  32. 32. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix How much similar are these lists? • Kendall tau distance: the total sum of all the bubble-sort changes needed to get a list equal to the other outputAnnotation ID 10,000 20,000 … 90,000 Annotation ID 20,000 10,000 … 90,000 Predicted annotation lists Comparison of the lists Validation
  33. 33. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix Extended Kendall distance Extended Spearman coefficient output Predicted annotation lists Validation Comparison of the lists output Annotation ID AP List 10,000 20,000 30,000 ... NAC List 70,000 80,000 90,000 ... Annotation ID AP List 30,000 10,000 40,000 ... NAC List 70,000 20,000 90,000 ... • We assign a high penalty if an element is absent from one of the lists And a low penalty if an element is absent from one of the AP lists but present in its NAC list
  34. 34. List Comparison Measures input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix Significant patterns: • Extended Kendall distances show that the similar SVD truncations are, the lower is the Extended Kendall distance is, and so the more similar the lists are. • Lists generated by predictions that produced similar AUC have similar low Extended Spearman coefficients. This means that lists from predictions having similar AUC percentages have element difference very low. Predicted annotation lists Comparison of the lists Validation
  35. 35. Example: DAG tree of the Molecular Function terms predicted for the Homo sapiens gene P2RY14. Black balls: terms already present in the database. Blue exagons: predicted terms. Novelty Indicator input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix Predicted annotation lists Novelty indicator Schlicker rate based on DAG An indicator to express the “novelty” rate of a prediction in a gene tree • Statistical rate • Visual DAG viewer Comparison of the lists Validation
  36. 36. Example: DAG tree of the Molecular Function terms predicted for the Homo sapiens gene CCR2. Black balls: terms already present in the database. Blue exagons: predicted terms. Novelty Indicator input matrix Statistical method Data reading Statisical method A input matrix A~ output matrix Predicted annotation lists Validation Novelty indicator Schlicker rate based on DAG An indicator to express the “novelty” rate of a prediction into a gene • Statistical rate • Visual DAG viewer Comparison of the lists
  37. 37. Final predictions input matrix output Data reading Statisical method A input matrix A~ output matrix We finally get a list of the most likely predicted annotations that have the following characteristics: - predicted by all the three methods tSVD, SIM1, SIM2 - prediction ranking in the first 50% of the list - having at least one validated parent. output Predicted annotation lists Gene symbol Feature term PPME1 Organelle organization. [BP] CHST14 Chondroitin sulfate proteoglycan biosynthetic process. [BP] CHST14 Biopolymer biosynthetic process. [BP] ROPN1B Microtubule-based agellum. [CC] CHST14 Dermatan sulfate proteoglycan biosynthetic process. [BP] CPA2 Proteolysis involved in cellular protein catabolic process. [BP] PPME1 Chromosome organization. [BP] CNOT2 Positive regulation of cellular metabolic process. [BP] Validation
  38. 38. Recap input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix output Predicted annotation lists Comparison of the lists Truncated SVD with the automatically chosen truncation showed better results (percentage of predicted annotations found on the updated database version) than previous method version with fixed parameters. New methods (SIM1 and SIM2) outperformed Truncated SVD. ROC analysis, Database version, and text mining and web tool validation procedure resulted very efficient. Extended Kendall and Spearman coefficients showed interesting patterns, otherwise invisible. Novelty indicator rate resulted very useful in explaining which are the most interesting prediction tree, showing relevant research paths. Novelty indicator Validation
  39. 39. Future input matrix outputStatistical method Data reading Statisical method A input matrix A~ output matrix output Predicted annotation lists Comparison of the lists Future developments: • integrate the software as a web application into the Search Computing platform • Implement and test the Auto-Encoder Deep Neural Network algorithm • Develop a text mining automated validation procedure • Add statistical tools to analyze the ROC curves Novelty indicator Validation

×