• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012
 

"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

on

  • 212 views

Presentation at International Society of Computational Biology European Student Council Symposium in Basel, Switzerland. September 2012

Presentation at International Society of Computational Biology European Student Council Symposium in Basel, Switzerland. September 2012

Statistics

Views

Total Views
212
Views on SlideShare
212
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012 "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012 Presentation Transcript

    • Escs 2012 ISCB European Student Council Symposium September 8th 2012, Basel, SwitzerlandGenome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis Davide Chicco, Marco Masseroli davide.chicco@elet.polimi.it
    • Summary 1. The context & the problem • Biomolecular annotations • Prediction of biomolecular annotations • SVD (Singular Value Decomposition) • SVD Truncation 2. The proposed solution • ROC Area Under the Curve comparison • Truncation level choices 3. Evaluation • Evaluation data set & results 4. Conclusions“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 2
    • Biomolecular annotations • The concept of annotation: association of nucleotide or amino acid sequences with useful information describing their features • This information is expressed through controlled vocabularies, sometimes structured as ontologies, where every controlled term of the vocabulary is associated with a unique alphanumeric code • The association of such a code with a gene or protein ID constitutes an annotation Gene / Biological function feature Protein Annotation gene2bff“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 3
    • Biomolecular annotations (2) • The association of an information/feature with a gene or protein ID constitutes an annotation • Annotation example: • gene: GD4 • feature: “is present in the mitochondrial membrane” Gene / Biological function feature Protein Annotation gene2bff“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 4
    • Prediction of biomolecular annotations • Many available annotations in different databanks • However, available annotations are incomplete • Only a few of them represent highly reliable, human–curated information • To support and quicken the time–consuming curation process, prioritized lists of computationally predicted annotations are extremely useful • These lists could be generated softwares based that implement Machine Learning algorithms“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 5
    • Annotation prediction through Singular Value Decomposition – SVD • Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 0“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 7
    • Annotation prediction through Singular Value Decomposition – SVD • Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 0“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 8
    • Singular Value Decomposition – SVD Compute SVD: A  U V T  U V T V TA  U V T A A U A  U V T  Compute reduced rank approximation: Ak  U k kkVk U kU kVkkTVkTU k  kVkT A AT     k A Ak  U k kVkT  k k • An annotation prediction is performed by computing a reduced rank approximation Ak of the annotation matrix A (where 0 < k < r, with r the number of non zero singular values of A, i.e. the rank of A)“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 9
    • Singular Value Decomposition – SVD (2) • Ak contains real valued entries related to the likelihood that gene i shall be annotated to term j For a certain real threshold τ: if Ak(i,j) > τ, gene i is predicted to be annotated to term j − The threshold τ can be chosen in order to obtain the best predicted annotations [Khatri et al., 2005]“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 10
    • Singular Value Decomposition – SVD (3) • It is possible to rewrite the SVD decomposition in an equivalent form, such that the predicted annotation profile is given by: ak,iT = aiT Vk VkT where ak,iT is a row vector containing the predictions for gene i • Note that Vk depends on the whole set of genes • Indeed, the columns of Vk are a set of eigenvectors of the global term-to-term correlation matrix T = ATA, estimated from the whole set of available annotations“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 11
    • Evaluation of the prediction To evaluate the prediction, we compare each A(i,j) element to its corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0 • if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed (AC <- AC+1) • if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed (AR <- AR+1) • if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed (NAC <- NAC+1) • if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted (AP <- AP+1)“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 12
    • SVD truncation • The main problem of truncated SVD: how to choose the truncation? • Where to truncate? How to choose the k here?“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 13
    • New concept: Receiver Operating Characteristic (ROC) curve Starting from the annotation prediction evaluation factor we just introduced Input Output  AC: Annotation Confirmed Yes Yes  AR: Annotation to be Reviewed Yes No  NAC: No Annotation Confirmed No No  AP: Annotation Predicted No Yes We can design the Receiver Operating Characteristic curves for every prediction:  On the x, the annotation to be reviewed rate:  On the y, the annotation predicted rate:“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 14
    • New concept: Receiver Operating Characteristic (ROC) curve (2)  On the x, the annotation to be reviewed rate:  On the y, the annotation predicted rate:“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 15
    • SVD truncation choice Algorithm: 1) Choose some possible truncation levels 2) Compute the Receiver Operating Characteristic for each SVD prediction of those truncation levels 3) Compute the Area Under the Curve (AUC) of each ROC 4) Choose the truncation level of the ROC that has minimum AUC“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 16
    • SVD truncation choice (2) Algorithm: 1) Choose some possible truncation levels 2) Compute the Receiver Operating Characteristic for each SVD prediction of those truncation levels 3) Compute the Area Under the Curve (AUC) of each ROC 4) Choose the truncation level of the ROC that has minimum AUC Quite easy!“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 17
    • SVD truncation choice (3) Algorithm: Quite challenging! 1) Choose some possible truncation levels 2) Compute the Receiver Operating Characteristic for each SVD prediction of those truncation levels 3) Compute the Area Under the Curve (AUC) of each ROC 4) Choose the truncation level of the ROC that has minimum AUC“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 18
    • Minimum AUC between all the ROCs of various truncation levels 1) Choose some possible truncation levels We cannot compute the SVD, its ROC and its AUC for every truncation values because would be too expensive (for time and resources). Algorithm: 1) Since the matrix A(i,j) has m rows (genes) and n columns (annotation terms), we take p = min(m, n) 2) Since r ≤ p is the number of non-zero singular values along the diagonal of , the best truncation value is in the interval [1; r] 3) We limited the range to [r*10% ; r*90%], to avoid taking truncation levels that, during SVD reconstruction phase, would consider too few main singular values, or almost all the non-zero singular values of A“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 19
    • Minimum AUC between all the ROCs of various truncation levels (2) 4. We take the 25%*r value as first possible truncation, and compute the SVD for it and the next four levels: q1, q2, q3, q4, q5 5. We compute ROC and its AUC for q1, q2, q3, q4, q5 6. We take the level that has minimum AUC“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 20
    • Minimum AUC between all the ROCs of various truncation levels (3) If the minimum AUC between those of (q1, q2, q3, q4, q5) is the middle element q3, it is takes as the best truncation value, and the algorithm finishes. This means we found a local minimum.“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 21
    • Minimum AUC between all the ROCs of various truncation levels (4) If the minimum AUC between those of (q1, q2, q3, q4, q5) is the 4th element q4, it is takes as the best truncation value, and the algorithm finishes. This means we found a local minimum.“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 22
    • Minimum AUC between all the ROCs of various truncation levels (5) If the minimum AUC between those of (q1, q2, q3, q4, q5) is the 2nd element q2, it is takes as the best truncation value, and the algorithm finishes. This means we found a local minimum.“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 23
    • Minimum AUC between all the ROCs of various truncation levels (6) If the minimum between (q1, q2, q3, q4, q5) is q5, the last, that means that probably the AUC values will decrease again moving to left so we move the truncation interval to the left“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 24
    • Minimum AUC between all the ROCs of various truncation levels (7) If the minimum between (q1, q2, q3, q4, q5) is q5, the last, that means that probably the AUC values will decrease again moving to right so we move the truncation interval to the right.“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 25
    • Minimum AUC between all the ROCs of various truncation levels (8) The levels are computed by adding 2*q5-q1 to each element of the first analysis“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 26
    • Minimum AUC between all the ROCs of various truncation levels (9) On the new group of levels, we repeat the minimum computation and the choice If q7, q8 or q9 ROC has minimum AUC, the algorithm stops. If this local minimum is lower than previous ones, it is considered as global minimum and elected best truncation value.“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 27
    • Minimum AUC between all the ROCs of various truncation levels (10) On the new group of levels, we repeat the minimum computation and the choice“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 28
    • Minimum AUC between all the ROCs of various truncation levels (11) On the new group of levels, we repeat the minimum computation and the choice The algorithm stops when: • One of the middle elements is chosen, or • Max number of attempts (e.g. 5) is made“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 29
    • Evaluation data set • We considered the Gene Ontology annotations of organisms: Gallus gallus (Chicken), and Bos taurus (Cattle) − Excluding less reliable Inferred Electronic Annotations • After this, the four organism data set were: Annotations Organism Ontology Genes Terms (direct ) Gallus gallus BiologicalProcess 275 527 738 Gallus gallus CellularComponent 260 148 478 Gallus gallus MolecularFunction 309 225 509 Bos taurus BiologicalProcess 512 930 1,557 Bos taurus CellularComponent 497 234 921 Bos taurus MolecularFunction 543 422 934 with total (true-path-rule) annotations about 10-times more than the direct annotations“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 30
    • Results • To evaluate the performance of our method, we used annotations of  terms: Biological process (BP), Cellular component (CC) and Molecular function (MF) GO features  organisms Gallus gallus and Bos taurus genes • Available on July 2009 in an old version of the Gene Ontology Annotation (GOA) database ( http://GeneOntology.org/ ). • For example, by analyzing Gallus gallus annotations between genes and BP (8,731 annotations; 275 genes; 610 MF terms), our method suggested k=77 as best truncation value for the SVD.“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 31
    • Results (2) • This value of k led to a ROC curve having AUC=40.27%, while the 2nd best k value, 59, led to AUC=40.46% • From the 8,731 input annotations, with t=0.4, the SVD method with best truncation level k=77 predicted 44 annotations as APs. • Out of these, 28 (63.63%) turned out to be present among the GO annotations in a 27 month more recent GOA database version (Oct. 2011); these 28 APs included 14 annotations (50%) with GO evidence different from IEA or ND. • Other truncation levels lead to worst results“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 32
    • Results (3) • Costs (time & resources): maximum number of SVD computation: 5 * 5 = 25 << min(#genes, #terms) Maximum number of SVD computations if all Maximum number of the possible truncation truncation intervals level were considered (in the previous table, from 148 to over) Maximum number of elements in the truncation interval“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 33
    • Conclusions Problem: SVD truncation in the prediction of genomic annotations context Proposed solution: finding the truncation level corresponding to the minimum AUC of the ROC curve“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 34
    • Conclusions (2) •To avoid computing SVD for all the possible truncation levels (too expensive!), we proposed an algorithm for the search of local and global minima. •The best SVD truncation levels suggested by this algorithm for our dataset (annotations of Bos taurus and Gallus gallus genes, and GO terms) gave better results than other truncation levels, in a reasonable time.“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 35
    • Future developments • To obtain the best sampling, we could study the gradient variations in the distribution of the AUC values for different truncation levels and the histogram of the eigenvalues • Our approach is not limited to the Gene Ontology and can be applied to any controlled annotations“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 36
    • Genome-Wide Annotation Prediction with SVD truncation based on ROC Analysis Thanks for your attention!!! www.DavideChicco.it davide.chicco@elet.polimi.it Fellowship“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 37