A Task-based Approach to Gene Ontology Evaluation

2,768 views

Published on

Published in: Education, Health & Medicine
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,768
On SlideShare
0
From Embeds
0
Number of Embeds
954
Actions
Shares
0
Downloads
13
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

A Task-based Approach to Gene Ontology Evaluation

  1. 1. A Task-Based Approach to Gene Ontology Evaluation Erik Clarke, Benjamin Good, and Andrew Su The Scripps Research Institute Bio-Ontologies SIG – ISMB – July 2012Monday, July 16, 12
  2. 2. mitotic cell cycle interphase secretory pathway nuclear division ubiquitin cycle mitotic cell cycle RNA processing interphase of mitotic cell cycle vesicle-mediated transport cell division regulation of cell cycle mitosis intracellular protein transport organelle fission mRNA metabolic process angiogenesis ! 2006 2012Monday, July 16, 12
  3. 3. What happened?Monday, July 16, 12
  4. 4. % of terms in top* 100 of both years 100% 100 75 59% percentage 50 38% 35% 32% 25 18% 20% 16% 11% 2004 2005 2006 2007 2008 2009 2010 2011 2012 year *top ranked terms by lowest p-valueMonday, July 16, 12This shows the percentage of terms in the top 100 each year (ranked by p-value) that appearin the top 100 for 2012.This is from a real dataset! Note the significant change occurring after 2010: we are clearly ina state of flux
  5. 5. Human GO Annotations 400000 Total IEAs 300000 200000 100000 2004 2005 2006 2007 2008 2009 2010 2011 2012Monday, July 16, 12run by Uniprot since 2004grown by more than 200k since thenthe GO has also been changing significantly during this timethese factors contribute to our researcher’s differing results
  6. 6. The Problem With all this work, are things improving? And how can we tell either way?Monday, July 16, 12Define improvement: the ability of the GO and annotations to give us relevant, *accurate*results when we use them
  7. 7. • Depth of terms? • Number of annotations? • Evidence codes? • Other “meta-analyses”? • Ex: GAQ [1]: annotation quality = evidence code x depth in ontology [1]: Buza et. al. Nuc. acids research, 2008 doi: 10.1093/nar/gkm1167Monday, July 16, 12The truth is that you could build a totally useless ontology that scores well with these ad-hocmetrics
  8. 8. Instead...Monday, July 16, 12
  9. 9. ...evaluate performance Ontology Application performance results Porzel, R. and Malaka, R. A Task-based Approach for Ontology Evaluation, 2004Monday, July 16, 12
  10. 10. Enrichment AnalysisMonday, July 16, 12
  11. 11. Gene Gene Enrichment Ontology + Annotations Analysis p-value scores 2004 2005 2006 2007 2008 2009 2010 2011 2012Monday, July 16, 12Use the results from each year’s GO + annotations to evaluate that year’s relativeperformance
  12. 12. 1. Identify a term or area of interest 2. Find datasets that should express the term(s) 3. Run an enrichment analysis for each version of the ontology and annotations under test 4. Plot the change in p-values over each version 1E+00 enrichment p-value (log scale) 1E-01 1E-02 1E-03 1E-04 1E-05 term of interest 1E-06 some ontology aspect under testMonday, July 16, 12
  13. 13. Brain tumor dataset: GDS1962 • Samples of different types of brain tumors • Glioblastomas are known to be highly angiogenic • Do we see “angiogenesis” as an enriched term with current GO+annotations? • Using GOAs from 2004-12, do we see improvement in p-values and/or rank?Monday, July 16, 12
  14. 14. Enrichment of angiogenesis in glioblastomas subset 1E+00 1E-01 enrichment p-value (log scale) 1E-02 1E-03 1E-04 1E-05 1E-06 2004 2005 2006 2007 2008 2009 2010 2011 2012 yearMonday, July 16, 12- Note the 100,000x difference in p-values- So we know that GOA is getting better at describing this dataset, and we can imaginepulling those terms for many datasets across many fields to get a broader picture
  15. 15. Year! 2004! 2005! 2006! 2007! 2008! 2009! 2010! 2011! 2012! 1.E-01! 1.E-05! Enrichment (p-value) 1.E-09! Top ranked for 1.E-13! 2012 1.E-17! 1.E-21! 1.E-25! Year! 2004! 2005! 2006! 2007! 2008! 2009! 2010! 2011! 2012! 1.E-01! 1.E-05! Enrichment (p-value) Top ranked for 1.E-09! 2006 1.E-13! 1.E-17! (GDS1962: 1.E-21! glioblastomas vs rest) 1.E-25!Monday, July 16, 12Suggestion of a trend here: The decreasing p-values for the 2012 terms suggest that they arein fact more biologically accurate than those from 2006, or at least that the annotations and/or ontology structure is narrowing in on these particular terms
  16. 16. • We’re doing a mass analysis of > 200 GEO datasets • Task-based analysis across representative sample of terms • Analyzing trends of top-ranked terms across time [historical annotations] [enrichment analysis]Monday, July 16, 12Do we see the same convergence towards 2012 p-values for many other datasets?
  17. 17. A tool to evaluate potential annotationsMonday, July 16, 12
  18. 18. • We can evaluate: • Natural language processing results • New methods of electronic inference • Crowdsourced annotationsMonday, July 16, 12
  19. 19. Example:Monday, July 16, 12
  20. 20. % of terms in top 100 of both years 100 With “helpful” candidate annotations 75 Baseline percentage 50 25 With “bad” candidate annotations 2004 2005 2006 2007 2008 2009 2010 2011 2012 yearMonday, July 16, 12We take our candidate annotations and insert them into a set of annotations from years past.Does it improve our coverage? You can imagine other ways of measuring its delta relative to2012
  21. 21. • First method that evaluates the GO based on effectiveness at a task • Demonstrated the GO/ human annotations are improving • Shown sensitivity of EA to gene set composition and ontology structure • Broad-scale analysis of the GO underway • Created tool to evaluate candidate annotations using historical EA+GOA results With many thanks to Ben Good, Andrew Su, and the Su Lab @ The Scripps Research Institute, and to BMC for travel supportMonday, July 16, 12
  22. 22. • Contact: • eclarke@scripps.edu • @pleiotrope (twitter) • http://github.com/eclarke/go-historical- analysisMonday, July 16, 12

×