• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012
 

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

on

  • 441 views

Talk delivered by Davide Chicco at PhDay 2012 at Dipartimento di Elettronica e Informazione of Politecnico di Milano, Milan, September 2012.

Talk delivered by Davide Chicco at PhDay 2012 at Dipartimento di Elettronica e Informazione of Politecnico di Milano, Milan, September 2012.

Statistics

Views

Total Views
441
Views on SlideShare
441
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012 "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012 Presentation Transcript

    • 2012 DIPARTIMENTO DI ELETTRONICA E INFORMAZIONEProbabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations Davide Chicco, Pietro Pinoli, Marco Masseroli davide.chicco@elet.polimi.it
    • Summary1. The problem • Biomolecular annotations • Prediction of biomolecular annotations2. The methods • SVD – Singular Value Decomposition • pLSA – Probabilistic Latent Semantic Analysis3. Evaluation • Evaluation data set • Evaluation results4. Conclusions Davide Chicco @ PhDay2012 2
    • Biomolecular annotations• The concept of annotation: association of nucleotide or amino acid sequences with useful information describing their features• This information is expressed through controlled vocabularies, sometimes structured as ontologies, where every controlled term of the vocabulary is associated with a unique alphanumeric code• The association of such a code with a gene or protein ID constitutes an annotation Gene / Biological function feature Protein Annotation gene2bff Davide Chicco @ PhDay2012 3
    • Biomolecular annotations (2)• The association of an information/feature with a gene or protein ID constitutes an annotation• Annotation example: • gene: GD4 • feature: “is present in the mitochondrial membrane” Gene / Biological function feature Protein Annotation gene2bff Davide Chicco @ PhDay2012 4
    • Prediction of biomolecular annotations• Many available annotations in different databanks• However, available annotations are incomplete• Only a few of them represent highly reliable, human–curated information• To support and quicken the time–consuming curation process, prioritized lists of computationally predicted annotations are extremely useful• These lists could be generated softwares based that implement Machine Learning algorithms Davide Chicco @ PhDay2012 5
    • Annotation prediction through Singular Value Decomposition – SVD• Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 0 Davide Chicco @ PhDay2012 7
    • Annotation prediction through Singular Value Decomposition – SVD• Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 0 Davide Chicco @ PhDay2012 8
    • Singular Value Decomposition – SVDCompute SVD: A  U V T  U V T V TA  U V T A A U A  U V T Compute reduced rank approximation: Ak  U k kkVk U kU kVkkTVkTU k  kVkT A AT     k A Ak  U k kVkT  k • An annotation prediction is performed by computing a reduced k rank approximation Ak of the annotation matrix A (where 0 < k < r, with r the number of non zero singular values of A, i.e. the rank of A) Davide Chicco @ PhDay2012 9
    • Probabilistic Latent Semantic Analysis - pLSApLSA: • An alternative to the SVD method • Based on Latent Semantic Indexing (LSI)Latent Semantic Indexing – LSI: • Identifies latent relationships between different elements in a certain class − e.g. between documents and words within them − between genes and their biomolecular features described by controlled annotation terms • Maps class elements to a vector space of reduced dimensionality, and then analyzes it Davide Chicco @ PhDay2012 10
    • Probabilistic Latent Semantic Analysis - pLSA (2)Suppose you have; • A set of genes G = {g1, …, gn} related to a set of feature terms F = {f1, …, fn} which, together, form a set of controlled biomolecular annotations • A set of class variables T = {t1, …, tn}, called topics, with every feature term f  F that can be associated with a topic t  TThe pLSA statistical model associates every unobserved class variable (topic) with each observation (feature term and gene) Davide Chicco @ PhDay2012 11
    • Probabilistic Latent Semantic Analysis - pLSA (3)• P(f | t): probability of a feature term f to be associated with a topic t• P(t | g): probability of getting a topic t by selecting a gene g• The following conditions hold: • t T ,  P( f | t )  1 f F •  g  G,  P(t | g )  1 tT• The joint probability between g and f is given by: P( g , f )   P(t ) P( g | t ) P( f | t ) tT Davide Chicco @ PhDay2012 12
    • Probabilistic Latent Semantic Analysis - pLSA (4)Model training • Aim: maximum likelihood estimation of P(f|t) by using Expectation Maximization (EM) algorithm, on a training set L  a( g , f ) log P( g , f ) [1] gG f FModel validation • Gene and feature term validation set with the same feature terms, but completely different genes, respect to the ones in the training set • Aim: maximize the formula in [1], but by using the P(f|t) calculated in the training phase and varying the parameters P(t|g) related to the new genes in the validation set Davide Chicco @ PhDay2012 13
    • Probabilistic Latent Semantic Analysis - pLSA (5)EM Algorithm:It seeks to find a Maximum Likelihood Estimation by iteratively applying:• Expectation step: in which the a posteriori probabilities for the latent variables t are computed, as P(t | g , f )  P(t ) P( g , f | t )• Maximization step: in which the parameters values are updated in order to maximize the log-likelihood. Davide Chicco @ PhDay2012 14
    • Probabilistic Latent Semantic Analysis - pLSA (5)In comparison to SVD: Uk = [ P(gi|tk) ] ik k = diag[ P(tk) ] k Vk = [ P(fi|tk) ]jk Ak = [ P(gi, fj) ]ij = Uk k VkT Ak  U k kkVk U kU kVkkTVkTU k  kVkT A AT    k  k A Ak  U k kVkT  k k Davide Chicco @ PhDay2012 15
    • Probabilistic Latent Semantic Analysis - pLSA (6)Since the pLSA model constraints: g  G ,  P( f | g )  1 f F• This can bias the prediction because the more annotations a gene has, the lower its average conditional probability is• To avoid such bias we propose a normalized extension of pLSA: •  g G : i. Compute: M  max P ( f | g ) fF ii. Compute the normalized P(f | g) vector as: 1 P( f | g )norm  P( f | g ) M• Thus, the feature terms with the highest conditional probability for a gene always result predicted to be annotated to that gene Davide Chicco @ PhDay2012 16
    • Evaluation of the prediction To evaluate the prediction, we compare each A(i,j) element to its corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0• if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed (AC AC+1)• if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed (AR AR+1)• if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed (NAC NAC+1)• if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted (AP AP+1) Davide Chicco @ PhDay2012 17
    • New concept: Receiver Operating Characteristic (ROC) curveStarting from the annotation prediction evaluation factor we justintroduced Input Output AC: Annotation Confirmed Yes Yes AR: Annotation to be Reviewed Yes No NAC: No Annotation Confirmed No No AP: Annotation Predicted No YesWe can design the Receiver Operating Characteristic curves forevery prediction: On the x, the annotation to be reviewed rate: On the y, the annotation predicted rate: Davide Chicco @ PhDay2012 18
    • Evaluation data set• We considered the Gene Ontology annotations of organisms: Gallus gallus (Chicken), and Bos taurus (Cattle) − Excluding less reliable Inferred Electronic Annotations• After this, the four organism data set were: Annotations Organism Ontology Genes Terms (direct ) Gallus gallus BP 275 527 738 Gallus gallus CC 260 148 478 Gallus gallus MF 309 225 509 Bos taurus BP 512 930 1,557 Bos taurus CC 497 234 921 Bos taurus MF 543 422 934 with total (true-path-rule) annotations about 10-times more than the direct annotations Davide Chicco @ PhDay2012 19
    • Evaluation results •The ROC curve of annotation to be reviewed rate AR / (AC + AR) and annotation predicted rate AP / (AP + NAC) of Bos taurus (Cattle) Cellular Component (top left), Molecular Function (top) and Biological Process (left), for SVD with best truncation value (in red) and for pLSAnorm with best topics number (in green) Davide Chicco @ PhDay2012 20
    • Evaluation results (3)• As an aggregated indicator of prediction performance, we computed the Area Under the Curve(AUC) in the [0; 0.01] range of AP rate values − We are interested in the low range of AP rate, since it corresponds to top-ranked predictions of newly inferred annotations (AP) with the highest score Area under ROC curves (AUC) % and Execution Time (sec)Taxonomy ID Ontology SVD pLSAnorm Time(SVD) Time(pLSAnorm) Bos taurus BP 44.30 34.75 33 28 188 Bos taurus CC 53.03 27.31 36 4 674 Bos taurus MF 80.96 30.69 11 1 890 Gallus gallus BP 47.33 44.83 98 3 990 Gallus gallus CC 75.39 37.22 10 796 Gallus gallus MF 65.76 29.87 5 422 Davide Chicco @ PhDay2012 22
    • Conclusions• We proposed the pLSAnorm method as a novel contribution in the context of prediction of genomic ontological annotations - Our pLSAnorm method gives better predictions than the Singular Value Decomposition (SVD) method - Higher execution time of pLSAnorm vs. SVD requires better optimizations, currently limiting its use to off-line analysis or small dimension data sets Davide Chicco @ PhDay2012 23
    • Conclusions (2)• Our approach is not limited to the here considered Gene Ontology and can be applied to any controlled annotations• Increasingly available multiple annotations from different controlled vocabularies and ontologies could be jointly considered to further improve prediction reliability (both in SVD and pLSAnorm) Davide Chicco @ PhDay2012 24
    • Probabilistic Latent Semantic Analysis forprediction of Gene Ontology annotations Thank you for your attention Davide Chicco @ PhDay2012 25