Cultivating and mining the Gene Wiki for crowdsourcedgene annotation<br />ISMB<br />Bio-Ontologies SIG<br />July 14, 2011<...
Few genes are well annotated…<br />2<br />Counts<br />TP53<br />TNF<br />APOE<br />MTHFR<br />IL6<br />HLA-DRB1<br />VEGFA...
… because the literature is sparsely curated?<br />3<br />
… because the literature is sparsely curated?<br />4<br />Number of articles read by typical scientist<br />
5<br />311,696 articles (1.5% of PubMed)<br />have been cited by GO annotations<br />
6<br />0<br />Sooner or later, the research community will need to be involved in the annotation effort to scale up to the...
The Long Tail is a prolific source of content<br />7<br />Short<br />Head<br />Content produced<br />Long Tail<br />Contri...
Wikipedia is reasonably accurate<br />8<br />
Wikipedia has breadth and depth<br />9<br />Articles<br />Words<br />(millions)<br />Words/ article<br />Wikipedia<br />Br...
10<br />We can harness the Long Tail of scientists to directly participate in the gene annotation process.<br />
10,000 gene “stubs” within Wikipedia<br />11<br />Protein structure<br />Gene summary<br />Symbols and identifiers<br />Ge...
Wiki success depends on a positive feedback<br />12<br />Gene wiki page utility<br />1<br />100<br />2<br />200<br />Numbe...
Filtering, extracting, and summarizing PubMed<br />Documents<br />Concepts<br />
A review article for every gene is powerful<br />14<br />Reelin: 68 editors, 543 edits since July 2002<br />Heparin: 175 e...
Gene Wiki has a diverse critical mass of readers<br />15<br />Utility<br />Rank 101-110: Scientists<br />Tau protein<br />...
Readership is poised to grow<br />16<br />Utility<br />Users<br />Contributors<br />
The Gene Wiki has a critical mass of editors<br />17<br />Utility<br />Users<br />Contributors<br />Editors<br />Editor co...
Making the Gene Wiki more reliable<br />18<br />The company name is derived from old Greek, and means "destroyer of birds"...
Making the Gene Wiki more reliable<br />19<br />The company name is derived from old Greek, and means "destroyer of birds"...
Making the Gene Wiki more computable<br />20<br />Structured annotations<br />Free text<br />!<br />
Example text from 5-HT1A receptor<br />Agonists<br />Heart rate<br />Receptor<br />Blood pressure<br />Snippet from articl...
Example text from 5-HT1A receptor<br />Agonists<br />Heart rate<br />Receptor<br />Blood pressure<br />5-HT1A receptor<br ...
23<br />
Re-discovering common knowledge<br />24<br />NCBI Entrez Gene: 3362<br />Wikilink<br />Candidate assertion<br />GO:0004993...
Mining the most recent literature<br />25<br />NCBI Entrez Gene: 57620<br />Wikilink<br />Candidate assertion<br />GO:0030...
Filling the gaps in gene annotation<br />26<br />NCBI Entrez Gene: 334<br />Wikilink<br />Candidate assertion<br />GO:0006...
Disease associations mined from the Gene Wiki<br />Gene Wiki Articles (10,271)<br />23% exact match<br />Filter out seeded...
Disease associations mined from the Gene Wiki<br />Expert curation<br />Correct<br />Maybe<br />Incorrect<br />86%<br />10...
GO associations mined from the Gene Wiki<br />Gene Wiki Articles (10,271)<br />17% exact match<br />Filter out seeded text...
GO associations mined from the Gene Wiki<br />Expert curation<br />Correct<br />Maybe<br />Incorrect<br />14%<br />26%<br ...
Common sources of error in GO associations<br />31<br />1)  Incorrect concept recognition<br />OR2F1: “Olfactory receptors...
Common sources of error in GO associations<br />32<br />Dephosphorylation<br />Excretion<br />Gene expression<br />Glycosy...
Is 48 – 64 % specificity useful?<br />33<br />Enrichment analysis<br />muscle contraction (GO:0006936)<br />GO term<br />5...
GO associations improve enrichment analyses<br />34<br />p-value (PubMed + Gene Wiki)<br />Muscle contraction<br />p-value...
35<br />“Like the image of the [mammoth] hairball, it is equally unhelpful in understanding the object’s properties. You c...
36<br />TOP 100 GENES<br />
Mapping to many biomedical semantic groups<br />37<br />
Semantic representation<br />From text mining to a Semantic Gene Wiki<br />38<br />Community contributions<br />Semantics<...
Semantic Wiki Links<br />39<br />Semantic Gene Wiki<br />Rendered text<br />Gene Wiki<br />Based on Semantic MediaWiki (SM...
For community-based science, data is king<br />40<br />Data without structure    is valuable, but structure    without dat...
For community-based science, data is king<br />41<br />Data without structure    is valuable, but structure    without dat...
The Gene Wiki<br />successfully harnesses the <br />Long Tail of scientists <br />for community annotation <br />of gene f...
43<br />Collaborators<br />Group members<br />Doug Howe, ZFIN<br />Salvatore Loguercio (*), TU Dresden<br />John Hogenesch...
Upcoming SlideShare
Loading in …5
×

Cultivating and mining the Gene Wiki for crowdsourced gene annotation

1,688 views

Published on

Keynote presentation at the ISMB Bio-ontologies SIG (Vienna, Austria) on July 15, 2011.

(Apologies, I occasionally use animations that obscures some slide content, so feel free to download the PowerPoint version to see what's underneath...)

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,688
On SlideShare
0
From Embeds
0
Number of Embeds
242
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Reverted four minutes later
  • Reverted four minutes later
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • 5-HT1a is a serotonin receptorTODO: add real ontology identifiers
  • 5-HT1a is a serotonin receptor
  • TODO: update example?
  • Transduction accounts for 70% of the concept recognition problems
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Cultivating and mining the Gene Wiki for crowdsourced gene annotation

    1. 1. Cultivating and mining the Gene Wiki for crowdsourcedgene annotation<br />ISMB<br />Bio-Ontologies SIG<br />July 14, 2011<br />Andrew Su, Ph.D.<br />
    2. 2. Few genes are well annotated…<br />2<br />Counts<br />TP53<br />TNF<br />APOE<br />MTHFR<br />IL6<br />HLA-DRB1<br />VEGFA<br />EGFR<br />TGFB1<br />ACE<br />59%<br />PubMed<br />38%<br />23,278 protein-coding genes<br />Gene ontology<br />Genes, sorted by decreasing counts<br />Data: NCBI gene2pubmed, August 2010<br />
    3. 3. … because the literature is sparsely curated?<br />3<br />
    4. 4. … because the literature is sparsely curated?<br />4<br />Number of articles read by typical scientist<br />
    5. 5. 5<br />311,696 articles (1.5% of PubMed)<br />have been cited by GO annotations<br />
    6. 6. 6<br />0<br />Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.<br />
    7. 7. The Long Tail is a prolific source of content<br />7<br />Short<br />Head<br />Content produced<br />Long Tail<br />Contributors (sorted)<br />Publishing:<br />Video:<br />Product reviews:<br />Food reviews:<br />Judging:<br />Newspapers<br />TV/Hollywood<br />Consumer reports<br />Food critics<br />Olympics<br />Blogs<br />YouTube<br />Amazon reviews<br />Yelp<br />American Idol<br />
    8. 8. Wikipedia is reasonably accurate<br />8<br />
    9. 9. Wikipedia has breadth and depth<br />9<br />Articles<br />Words<br />(millions)<br />Words/ article<br />Wikipedia<br />Britannica Online<br />http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008<br />
    10. 10. 10<br />We can harness the Long Tail of scientists to directly participate in the gene annotation process.<br />
    11. 11. 10,000 gene “stubs” within Wikipedia<br />11<br />Protein structure<br />Gene summary<br />Symbols and identifiers<br />Gene Ontology annotations<br />Protein interactions<br />Tissue expression pattern<br />Linked references<br />Links to structured databases<br />
    12. 12. Wiki success depends on a positive feedback<br />12<br />Gene wiki page utility<br />1<br />100<br />2<br />200<br />Number of<br />users<br />Number of<br />contributors<br />
    13. 13. Filtering, extracting, and summarizing PubMed<br />Documents<br />Concepts<br />
    14. 14. A review article for every gene is powerful<br />14<br />Reelin: 68 editors, 543 edits since July 2002<br />Heparin: 175 editors, 320 edits since June 2003<br />AMPK: 44 editors, 84 edits since March 2004<br />RNAi: 232 editors, 708 edits since October 2002<br />References to the literature<br />Hyperlinks to related concepts<br />
    15. 15. Gene Wiki has a diverse critical mass of readers<br />15<br />Utility<br />Rank 101-110: Scientists<br />Tau protein<br />Interleukin 10<br />APC<br />C-Met<br />Factor V<br />Interleukin 8<br />CD44<br />Histamine H1 receptor<br />Kappa Opioid receptor<br />Dihydrofolatereductase<br />Rank 1001-1010: Specialists<br />CSDA<br />CNTNAP2<br />IGSF8<br />Adenosine A3 receptor<br />RYR1<br />ETV6<br />Small heterodimer partner<br />5-HT1D receptor<br />TRPC6<br />Interleukin-6 receptor<br />Users<br />Contributors<br />Rank 1-10: General society<br />Insulin<br />Titin<br />Human chorionic gonadotropin<br />Vasopressin<br />ANKH<br />CLOCK<br />Catalase<br />Erythropoietin<br />Glucagon<br />Parathyroid hormone<br />Total: 5.0 million views / month<br />
    16. 16. Readership is poised to grow<br />16<br />Utility<br />Users<br />Contributors<br />
    17. 17. The Gene Wiki has a critical mass of editors<br />17<br />Utility<br />Users<br />Contributors<br />Editors<br />Editor count<br />Edit count<br />Edits<br />In Jan – Jun 2010 …<br />… 7474 edits were made by 2109 unique users <br />… total increase in text ≈ 20 PLoS Biology research articles<br />
    18. 18. Making the Gene Wiki more reliable<br />18<br />The company name is derived from old Greek, and means "destroyer of birds".<br />Novartis is a multinational pharmaceutical company based in Basel, Switzerland that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), …<br />2<br />2<br />
    19. 19. Making the Gene Wiki more reliable<br />19<br />The company name is derived from old Greek, and means "destroyer of birds".<br />Novartis is a multinational pharmaceutical company based in Basel, Switzerland that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), …<br />2<br />36211 total edits<br />36 total edits<br />*<br />*<br />*<br />*<br />*<br />*<br />*<br />*<br />*<br />*<br />*<br />*<br />*<br />*<br />High-trust author<br />Low-trust author<br />http://www.wikitrust.net/<br />
    20. 20. Making the Gene Wiki more computable<br />20<br />Structured annotations<br />Free text<br />!<br />
    21. 21. Example text from 5-HT1A receptor<br />Agonists<br />Heart rate<br />Receptor<br />Blood pressure<br />Snippet from article on 5-HT1A receptor:<br />Snippet from article on 5-HT1A receptor:<br />“…5-HT1A receptor agonistsdecrease blood pressureand heart rateor cause hypotension via a central mechanism, by inducing peripheral vasodilation, and by stimulating the vagus nerve…”<br />“…5-HT1A receptor agonists decrease blood pressure and heart rate or cause hypotension via a central mechanism, by inducing peripheral vasodilation, and by stimulating the vagus nerve…”<br />Vasodilation<br />Hypotension<br />Vagus nerve<br />
    22. 22. Example text from 5-HT1A receptor<br />Agonists<br />Heart rate<br />Receptor<br />Blood pressure<br />5-HT1A receptor<br />Vasodilation<br />Hypotension<br />Vagus nerve<br />
    23. 23. 23<br />
    24. 24. Re-discovering common knowledge<br />24<br />NCBI Entrez Gene: 3362<br />Wikilink<br />Candidate assertion<br />GO:0004993<br />GO exact synonym<br />Gene Wiki mapping<br />
    25. 25. Mining the most recent literature<br />25<br />NCBI Entrez Gene: 57620<br />Wikilink<br />Candidate assertion<br />GO:0030154<br />GO related concept<br />Gene Wiki mapping<br />
    26. 26. Filling the gaps in gene annotation<br />26<br />NCBI Entrez Gene: 334<br />Wikilink<br />Candidate assertion<br />GO:0006897<br />GO exact match<br />Gene Wiki mapping<br />
    27. 27. Disease associations mined from the Gene Wiki<br />Gene Wiki Articles (10,271)<br />23% exact match<br />Filter out seeded text<br />5% match parent<br />2% match child<br />NCBO Annotator<br />70% have no match<br />Compare to DO database<br />Matched Disease Ontology terms<br />(2983)<br />2147 candidate <br />annotations<br />
    28. 28. Disease associations mined from the Gene Wiki<br />Expert curation<br />Correct<br />Maybe<br />Incorrect<br />86%<br />10%<br />Overall specificity: 90-93%<br />4%<br />
    29. 29. GO associations mined from the Gene Wiki<br />Gene Wiki Articles (10,271)<br />17% exact match<br />Filter out seeded text<br />26% match parent<br />NCBO Annotator<br />55% have no match<br />2% match child<br />Compare to GO database<br />Matched Gene Ontology terms<br />(11,022)<br />6319 candidate <br />annotations<br />
    30. 30. GO associations mined from the Gene Wiki<br />Expert curation<br />Correct<br />Maybe<br />Incorrect<br />14%<br />26%<br />Overall specificity: 48-64%<br />60%<br />
    31. 31. Common sources of error in GO associations<br />31<br />1) Incorrect concept recognition<br />OR2F1: “Olfactory receptors … are responsible for the recognition and G protein-mediated transductionof odorant signals.”<br />Transduction (GO:0009293)<br />The transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector. <br />Signal transduction (GO:0007165)<br />The cellular process in which a signal is conveyed to trigger a change in the activity or state of a cell. Signal transduction begins with reception of a signal, e.g. a ligand binding to a receptor or receptor activation by a stimulus such as light, and ends with regulation of a downstream cellular process…<br />
    32. 32. Common sources of error in GO associations<br />32<br />Dephosphorylation<br />Excretion<br />Gene expression<br />Glycosylation<br />Localization<br />Methylation<br />Proteolysis<br />Secretion<br />Transport<br />Transcription<br />Translation<br />2) Incorrect sentence context<br />Phosporylation<br />MEF2C: “Several post translational modifications have been identified including phosphorylation on serine-59 …”<br />MEF2C<br />Neurogenesis<br />Myelination<br />
    33. 33. Is 48 – 64 % specificity useful?<br />33<br />Enrichment analysis<br />muscle contraction (GO:0006936)<br />GO term<br />5449 articles<br />Concept recognition<br />PubMed abstracts<br />Gene list<br />87 genes<br />+<br />Gene Wiki<br />87 articles<br />GO:0006936<br />GO:0006936<br />Linked genes by PubMed only<br />Linked genes by PubMed + Gene Wiki<br />P = 1.0<br />P = 1.22 E-09<br />
    34. 34. GO associations improve enrichment analyses<br />34<br />p-value (PubMed + Gene Wiki)<br />Muscle contraction<br />p-value (PubMed only)<br />
    35. 35. 35<br />“Like the image of the [mammoth] hairball, it is equally unhelpful in understanding the object’s properties. You can guess that the network is large and its connectivity is complex, but not more. At best, the visualization is merely decorative.”<br />- Martin Krzywinski<br />http://mkweb.bcgsc.ca/linnet/talks/linnet-informatics2010.pdf<br />
    36. 36. 36<br />TOP 100 GENES<br />
    37. 37. Mapping to many biomedical semantic groups<br />37<br />
    38. 38. Semantic representation<br />From text mining to a Semantic Gene Wiki<br />38<br />Community contributions<br />Semantics<br />Semantic querying<br />û<br />ü<br />ü<br />Home-grown wiki<br />ü<br />ü<br />û<br />?<br />Gene Wiki/ Wikipedia<br />ü<br />ü<br />– <br />Semantic Gene Wiki<br />
    39. 39. Semantic Wiki Links<br />39<br />Semantic Gene Wiki<br />Rendered text<br />Gene Wiki<br />Based on Semantic MediaWiki (SMW)<br />Based on MediaWiki<br />apoptosis<br />apoptosis<br />apoptosis<br />Mirror and translate<br />apoptosis<br />[[apoptosis]]<br />[[apoptosis]]<br />[[repress::apoptosis]]<br />{{SWL|target=apoptosis|type=promotes}}<br />apoptosis<br />[[promote::apoptosis]]<br />[[modulate::apoptosis]]<br />Semantic queries, RDF, etc<br />
    40. 40. For community-based science, data is king<br />40<br />Data without structure is valuable, but structure without data is not.<br />
    41. 41. For community-based science, data is king<br />41<br />Data without structure is valuable, but structure without data is not.<br />X<br />X<br />Wikipedia<br />WP:MCB, Boghog<br />Artists and illustrators<br />Wiki links, infoboxes<br />DOI bot, CitationBot<br />WikiTrust<br />Copy-editing<br />Figures<br />Structure<br />Citations<br />Provenance<br />=<br />X<br />Domain expert<br />Information scientist<br />
    42. 42. The Gene Wiki<br />successfully harnesses the <br />Long Tail of scientists <br />for community annotation <br />of gene function<br />42<br />
    43. 43. 43<br />Collaborators<br />Group members<br />Doug Howe, ZFIN<br />Salvatore Loguercio (*), TU Dresden<br />John Hogenesch, U Penn<br />Jon Huss, GNF<br />Angel Pizzaro, U Penn<br />Faramarz Valafar, SDSU<br />Pierre Lindenbaum, <br />FondationJean Dausset<br />Michael Martone, Rush<br />Konrad Koehler, Karo Bio<br />Warren Kibbe, Simon Lim, Northwestern<br />Many Wikipedia editors<br /> WP:MCB Project<br />Erik Clarke<br />Ben Good (*)<br />Ian Macleod<br />ChunleiWu<br />(*) See talk on SNPediamashup at 1:55 PM<br />WikiTrust<br />(UCSC)<br />Luca de Alfaro<br />Bo Adler<br />Ian Pye<br />Contact<br />http://sulab.org<br />asu@scripps.edu<br />@andrewsu<br />+Andrew Su<br />ISMB travel support<br />Funding and Support<br />(BioGPS: GM83924, Gene Wiki: GM089820)<br />

    ×