The Gene Wiki:Synthesizing knowledge about human       genes with Wikipedia             Benjamin Good                   Fe...
2“Knowledge about human genes”
3“Knowledge about human genes” 1) There is a lot 2) It is scattered
4Biological knowledge is growing, rapidly • More than 22 million articles indexed in PubMed                    • Growing a...
5Scattered genomic knowledge is a problem                                   GNF                     Hits                  ...
6Knowledge synthesis      “the pulling together of ideas or      information to develop a common      framework for unders...
7Knowledge synthesis in biology, aka biocuration   • The production of structured data    Unstructured                    ...
8Gene Ontology    “Tool for the unification of biology”[1]  A shared, controlled vocabulary for describing gene function  ...
9Gene Ontology Annotation Database („GOA‟)• Records gene function  using gene ontology terms• Expert synthesis of the  kno...
1033k articles become 31 gene annotations             Gene Ontology Curators            31 function annotations for       ...
11Great!
12BUT
13GO annotation is not complete
14Many genes are not thoroughly annotated     GO Annotation        Counts                                                 ...
151 million articles per year....
16    Sooner or later, the research community willneed to be involved in the             0annotation effort to scale   up ...
17The Long Tail is a prolific source of content                        Short                        Head            Conten...
18Wikipedia successfully harnesses the long tail   • Within top 10 most    Articles     visited websites                  ...
19Wikipedia is reasonably accurate
20The Gene Wiki Hypothesis    “We can harness the   Long Tail of scientists   to directly participate in     the gene anno...
21Goal of the Gene Wiki project   • Enable the creation of a collaboratively     written, continuously updated, high     q...
Filtering, extracting, and summarizing PubMed
23Success depends on a positive feedback loop                  Value of service                          1   100          ...
Gene “stubs” seed community contributions                                                            24                   ...
25A review article for every gene is powerful        68 editors, 543 edits (as of July 2010)                              ...
26The Gene Wiki project – 2010 stats                     Value of service   10,300 articles                               ...
Monthly growth of words in Gene Wiki articles, page views per month and edits per month                          between 1...
28Why is it working?
29Google loves Wikipedia • 1.86 million   results from   Google • courses • products • databases • ...
30The Gene Wiki hitches a ride on Wikipedia     CC photo by ff137 on flickr
31Take home messages                                   Value • Success depends on   a positive feedback                   ...
32But still, many genes lack structured annotation…     GO Annotation        Counts                                       ...
33Can we generate structured annotations fromthe text of the gene wiki?                               Gene          Proper...
Filtering, extracting, and summarizing PubMedDocuments Concepts
35Document- and concept-centric text mining                          Predicate                Subject               Object
36 Simple text mining for gene annotations                                                   NCBI Entrez Gene: 334        ...
Finding concepts• NCBO Annotator Web Service    – Gene Ontology    – Human Disease Ontology• Annotator service selected fo...
Mining workflow             Gene Wiki Articles                 (10,271)                 Filtering,                 cleanup...
Compared to               current dbs                                      Results                     Manual evaluation  ...
!# "   ,                                  GO problems !#+" !#*" !# "   ) !# "   ( !# "    !#&" !# "   % !#$"   !"        -...
Applications• Enrichment analysis   • even with false positives, text-mined annotations can     improve statistical analys...
Gene Wiki+ for integrative queries                                        mwsyncGood, J Biomed Semantics, 2012.           ...
Dynamic queries across genes, diseases, SNPs                                                           43Good, J Biomed Se...
Gene Wiki+ for integrative queries                                    mwsync                                              ...
Gene Wiki+ for integrative queries                                        mwsync                                          ...
Text mining take home• Depends a lot on the ontology  • (same text, same algorithm,    completely different results)• Appr...
Can we skip text mining?http://fiehnlab.ucdavis.edu/projects/Rice_metabolome/
WikidataProvide a database of the world‟s knowledge that    anyone can edit            - Denny Vrandečić                  ...
Q414043                        Wikidata                    Reelin                                            Protein      ...
Q414043                                 Wikidata                                                                          ...
Wikidatahttp://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force   51
Wikidatahttp://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force   52
53 “We can harness theLong Tail of scientiststo directly participate in  the gene annotation        process.”             ...
54Gene Wiki acknowledgements..                                                                                            ...
My sister Erin has a PhD in linguististics, lives in Raleigh    and is looking for work in research or teaching..         ...
56 Gene Wiki content improves enrichment analysis                                More      p-value                signific...
Upcoming SlideShare
Loading in …5
×

Gene Wiki at Phenotype RCN annual meeting

1,836 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,836
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 645,647 articles that have been explicitly linked to human genes within the NCBI Gene database. (gene2pubmed)Search through PubMed and Google will unearth many many more that are clearly relevant but have not been linked yet.
  • More is produced every day.
  • The definition that best met my usage here was ...Oddly, it didn’t come from wordnet or even the Wiktionary, it came from the glossary of a document describing a preschool curriculum.Not sure why I chose that one, but he might have had something to do with it..
  • Manual curation.
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • In 2008, a group of genome database curators got together and wrote an article about the state of art of biocuration. In it, they expressed deep concern about the amount of data that they were already processing and the knowledge that there would only be more of it coming. One of the things they said in this article was that ‘sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation’. The Gene Wiki and related efforts are an attempt to meet that need.
  • Now at more than3.5 million articles
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Feb. 14, 201110,290 articles> 75 megabytes text content> 1.3 million words35,997 PubMed citations (about 1 for every two sentences)In past year34,839 edits by 3,599 editorsIncrease of 2.2 megabytes 55 million page views
  • Just looking at the citations in PubMed actually understates the situation dramatically.
  • Much easier to start from a large community with a very high page rank then it is to start from scratch…
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Category 1: Yes, this would lead to a new annotation:1A: perfect match – the candidate annotation is exactly as it would be from a curator (e.g., Titin Scleroderma)1B: not specific enough – the candidate annotation is correct but a more specific term should be used instead (e.g., Titin Autoimmune disease)1C: too specific – the candidate annotation is close to correct, but is too specific given the evidence at hand (e.g., Titin Pulmonary Systemic Sclerosis) Category 2: Maybe, but insufficient evidence:2A: evaluator could not find enough supporting evidence in the literature after about 10 minutes of looking (e.g., DUSP7  cellular proliferation; there is literature indicating that DUSP7 is a phosphatase that dephosphorylates MAPK, and hence may play a role in regulating cell proliferation stimulated through MAPK. Although no direct evidence supporting this contention for Human DUSP7 was found, it seems plausible.)2B: there is disagreement in the literature about the truth of this annotationCategory 3: No, this candidate annotation is incorrect:3A: incorrect concept recognition (e.g., “Olfactory receptors share a 7-transmembrane domain structure with many neurotransmitter and hormone receptors and are responsible for the recognition and G protein-mediated transduction of odorant signals.” [24] The system incorrectly identifies ‘transduction’ (GO:0009293 ) which is defined as the transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector - a completely different concept from signal transduction as intended in the sentence.)3B: incorrect sentence context  - the sentence is a negation or otherwise does not support the predicted annotation for the given gene (e.g., "The protein is composed of ~300 amino acid residues and has ~30 carbohydrate residues attached including 10 sialic acid residues, which are attached to the protein during posttranslational modification in the Golgi apparatus." [25] Such sentences may lead to incorrect candidate annotations of 'Golgi apparatus' and 'Posttranslational modification’.)3C: this sentence seems factually false (e.g., a hypothetical example: “Insulin injections have been shown to cure Parkinson’s disease and lead to the growth of additional toes”.)
  • GO terms are more common (we found more than twice as many occurences), are more prone to polysemy, and are more likely to show up in contexts that don’t indicate a direct annotation.
  • Combines open editing of a wiki, with the robust community of editors at Wikipedia, with the structured data model of a database
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Gene Wiki at Phenotype RCN annual meeting

    1. 1. The Gene Wiki:Synthesizing knowledge about human genes with Wikipedia Benjamin Good Feb. 26, 2013 http://www.slideshare.net/goodb
    2. 2. 2“Knowledge about human genes”
    3. 3. 3“Knowledge about human genes” 1) There is a lot 2) It is scattered
    4. 4. 4Biological knowledge is growing, rapidly • More than 22 million articles indexed in PubMed • Growing at about million/year and rising
    5. 5. 5Scattered genomic knowledge is a problem GNF Hits IFITM3 • Scientists faced with new Robotics TFE3 BEX1 and unfamiliar genes on a ST8SIA1 TFEB daily basis BEX2 SKP1A .... • Public faced with unfamiliar genes on a daily basis
    6. 6. 6Knowledge synthesis “the pulling together of ideas or information to develop a common framework for understanding”
    7. 7. 7Knowledge synthesis in biology, aka biocuration • The production of structured data Unstructured Structured Gene Property Value Fibronectin Biological Angiogenesis Process Fibronectin Cellular Extracellular Localization matrix Fibronectin Related Glomerulopathy Disease
    8. 8. 8Gene Ontology “Tool for the unification of biology”[1] A shared, controlled vocabulary for describing gene function Molecular Function, Biological Process, Cellular Component > 10,550 Citations in Google Scholar [1] Nature Genetics. 2000 May;25(1):25-9.
    9. 9. 9Gene Ontology Annotation Database („GOA‟)• Records gene function using gene ontology terms• Expert synthesis of the knowledge from thousands of articles Gene Property Value Fibronectin Biological Angiogenesis Process Fibronectin Cellular Extracellular Localization matrix Fibronectin Related Glomerulopathy Disease
    10. 10. 1033k articles become 31 gene annotations Gene Ontology Curators 31 function annotations for human gene
    11. 11. 11Great!
    12. 12. 12BUT
    13. 13. 13GO annotation is not complete
    14. 14. 14Many genes are not thoroughly annotated GO Annotation Counts + Electronic annotation (IEA) Biological Process only Genes, sorted by decreasing counts Data: NCBI, February 2013
    15. 15. 151 million articles per year....
    16. 16. 16 Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of data generation.
    17. 17. 17The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News reporting: Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Gene annotation: bio-curators ????????????
    18. 18. 18Wikipedia successfully harnesses the long tail • Within top 10 most Articles visited websites Words • 14 million+ (millions) registered users Words/ article Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
    19. 19. 19Wikipedia is reasonably accurate
    20. 20. 20The Gene Wiki Hypothesis “We can harness the Long Tail of scientists to directly participate in the gene annotation process.” -Andrew Su
    21. 21. 21Goal of the Gene Wiki project • Enable the creation of a collaboratively written, continuously updated, high quality review article for every human gene.
    22. 22. Filtering, extracting, and summarizing PubMed
    23. 23. 23Success depends on a positive feedback loop Value of service 1 100 2 200 Number of Number of contributors users
    24. 24. Gene “stubs” seed community contributions 24 Protein structure Gene Symbols and summary identifiers Gene Ontology annotations Protein interactions Tissue expression pattern Linked references Links to structured databases
    25. 25. 25A review article for every gene is powerful 68 editors, 543 edits (as of July 2010) References to the literature Hyperlinks to related concepts
    26. 26. 26The Gene Wiki project – 2010 stats Value of service 10,300 articles 1.2 million words 67MB text (about 1,000 PloS Biology research articles) 55 million page views Number of Number of 3,500 editors contributors users 17,000 edits
    27. 27. Monthly growth of words in Gene Wiki articles, page views per month and edits per month between 1 September 2009 and 1 September 2011. Good B M et al. Nucl. Acids Res. 2012;40:D1255-D1261© The Author(s) 2011. Published by Oxford University Press.
    28. 28. 28Why is it working?
    29. 29. 29Google loves Wikipedia • 1.86 million results from Google • courses • products • databases • ...
    30. 30. 30The Gene Wiki hitches a ride on Wikipedia CC photo by ff137 on flickr
    31. 31. 31Take home messages Value • Success depends on a positive feedback contributors users loop • Where possible, try to hitch a ride
    32. 32. 32But still, many genes lack structured annotation… GO Annotation Counts + Electronic annotation (IEA) Biological Process only Genes, sorted by decreasing counts Data: NCBI, February 2013
    33. 33. 33Can we generate structured annotations fromthe text of the gene wiki? Gene Property Value ? Fibronectin Biological Angiogenesis Process Fibronectin Cellular Extracellular Localization matrix Fibronectin Related Glomerulopathy Disease Great for building Great for people to read software for people to use
    34. 34. Filtering, extracting, and summarizing PubMedDocuments Concepts
    35. 35. 35Document- and concept-centric text mining Predicate Subject Object
    36. 36. 36 Simple text mining for gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact matchGood, BMC Genomics, 2011.
    37. 37. Finding concepts• NCBO Annotator Web Service – Gene Ontology – Human Disease Ontology• Annotator service selected for: – Speed, easy API, precisionClement Jonquet, Nigam H Shah, Mark A Musen, (2009) The Open BiomedicalAnnotator. AMIA Summit on Translational Bioinformatics. 56-60http://bioportal.bioontology.org/annotator
    38. 38. Mining workflow Gene Wiki Articles (10,271) Filtering, cleanup Extract concepts (NCBO)11,022 matched 2,983 matched gene ontology disease ontology terms terms
    39. 39. Compared to current dbs Results Manual evaluation on random sample match moreDO specific term $" 2% !# " , exact match !# +" 23% !# *" !# " ) !# " ( !# " match more general term !#&" 5% !# " % no match !# $" 70% !" - . //012" 3 0045"6 . /0"078 40910" :91. //012" match more $"GO specific term 2% exact match 12% !# " , !# +" !# *" !# " ) !# " ( match more !# " general term !#&"no match 58% 28% !# " % !# $" !" - . //012" 3 0045"6 . /0"078 40910" :91. //012"
    40. 40. !# " , GO problems !#+" !#*" !# " ) !# " ( !# " !#&" !# " % !#$" !" - . /01"2 . 345"6 7# 90: ; 4<=9>" 1# "38. ?1931941": =109@ AA=83" 3"0; ?1931941"B . 43; . //D"C /01" 0"C . . 99=3. <=9"False match (e.g., “Olfactory receptors .. are responsible for thetransduction of odorant signals. The system incorrectly identifies„transduction‟ (GO:0009293) defined as the transfer of geneticinformation to a bacterium from a bacteriophage or between bacterialor yeast cells mediated by a phage vectorNo support in sentence (e.g., "The protein is composed ... including10 sialic acid residues, which are attached to the protein duringposttranslational modification in the Golgi apparatus.” Suchsentences may lead to incorrect annotations of Golgi apparatus andPosttranslational modification‟.)
    41. 41. Applications• Enrichment analysis • even with false positives, text-mined annotations can improve statistical analyses that are tolerant to noise.• GeneWiki+
    42. 42. Gene Wiki+ for integrative queries mwsyncGood, J Biomed Semantics, 2012. http://genewikiplus.org 42
    43. 43. Dynamic queries across genes, diseases, SNPs 43Good, J Biomed Semantics, 2012.
    44. 44. Gene Wiki+ for integrative queries mwsync OMIM PharmGKB {{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>] ] [[HasSNP:: <q>[[is_associated_with:: … <q>[[Category:Breast_cancer]]</q>]http://genewikiplus.org ] 44Good, J Biomed Semantics, 2012. </q>]]
    45. 45. Gene Wiki+ for integrative queries mwsync OMIM PharmGKBGood, J Biomed Semantics, 2012. http://genewikiplus.org 45
    46. 46. Text mining take home• Depends a lot on the ontology • (same text, same algorithm, completely different results)• Approach depends on corpus • concept-centric text has advantages• Approach depends on purpose • high false positive rates are common but may be acceptable – e.g. enrichment analysis 46
    47. 47. Can we skip text mining?http://fiehnlab.ucdavis.edu/projects/Rice_metabolome/
    48. 48. WikidataProvide a database of the world‟s knowledge that anyone can edit - Denny Vrandečić 48
    49. 49. Q414043 Wikidata Reelin Protein Q8054Property:P31 is a Glycoprotein Q187126 NeuralProperty:P128 regulates Q1345738 development VLDL receptor Q1979313Property:P129 Interacts Amyloid with precursor Q423510 protein 49 http://www.wikidata.org/wiki/Q414043
    50. 50. Q414043 Wikidata Q8054Property:P31 Q187126Property:P128 Q1345738 Q1979313Property:P129 Q423510 50 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
    51. 51. Wikidatahttp://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force 51
    52. 52. Wikidatahttp://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force 52
    53. 53. 53 “We can harness theLong Tail of scientiststo directly participate in the gene annotation process.” -Andrew Su
    54. 54. 54Gene Wiki acknowledgements.. http://wordle.com Many Wikipedia editors WP:MCB Project“A gene wiki for community annotation of gene function” “The Gene Wiki: community intelligence applied to humanPloS Biology 2008 gene annotation” Nucleic Acids Research 2009 “Mining the Gene Wiki for Functional Genomic Knowledge” BMC Genomics 2011 “The Gene Wiki in 2011: community intelligence applied to human gene annotation” Nucleic Acids Research 2012 “Linking genes to diseases with a SNPedia-Gene Wiki mashup” Journal of Biomedical Semantics 2012 “Building a biomedical semantic network in Wikipedia with Semantic Wiki Links” Database: The Journal of Biological Databases and Curation 2012
    55. 55. My sister Erin has a PhD in linguististics, lives in Raleigh and is looking for work in research or teaching.. Help her out! bgood@scripps.edu @bgood i9606.blogspot.comFunding and Support slideshare/goodb NIH / NIGMS 55 (Gene Wiki: GM089820)
    56. 56. 56 Gene Wiki content improves enrichment analysis More p-value significant (PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)Good, BMC Genomics, 2011.

    ×