Gene Wiki at Phenotype RCN annual meetingPresentation Transcript
The Gene Wiki:Synthesizing knowledge about human genes with Wikipedia Benjamin Good Feb. 26, 2013 http://www.slideshare.net/goodb
2“Knowledge about human genes”
3“Knowledge about human genes” 1) There is a lot 2) It is scattered
4Biological knowledge is growing, rapidly • More than 22 million articles indexed in PubMed • Growing at about million/year and rising
5Scattered genomic knowledge is a problem GNF Hits IFITM3 • Scientists faced with new Robotics TFE3 BEX1 and unfamiliar genes on a ST8SIA1 TFEB daily basis BEX2 SKP1A .... • Public faced with unfamiliar genes on a daily basis
6Knowledge synthesis “the pulling together of ideas or information to develop a common framework for understanding”
7Knowledge synthesis in biology, aka biocuration • The production of structured data Unstructured Structured Gene Property Value Fibronectin Biological Angiogenesis Process Fibronectin Cellular Extracellular Localization matrix Fibronectin Related Glomerulopathy Disease
8Gene Ontology “Tool for the unification of biology” A shared, controlled vocabulary for describing gene function Molecular Function, Biological Process, Cellular Component > 10,550 Citations in Google Scholar  Nature Genetics. 2000 May;25(1):25-9.
9Gene Ontology Annotation Database („GOA‟)• Records gene function using gene ontology terms• Expert synthesis of the knowledge from thousands of articles Gene Property Value Fibronectin Biological Angiogenesis Process Fibronectin Cellular Extracellular Localization matrix Fibronectin Related Glomerulopathy Disease
1033k articles become 31 gene annotations Gene Ontology Curators 31 function annotations for human gene
13GO annotation is not complete
14Many genes are not thoroughly annotated GO Annotation Counts + Electronic annotation (IEA) Biological Process only Genes, sorted by decreasing counts Data: NCBI, February 2013
151 million articles per year....
16 Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of data generation.
17The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News reporting: Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Gene annotation: bio-curators ????????????
18Wikipedia successfully harnesses the long tail • Within top 10 most Articles visited websites Words • 14 million+ (millions) registered users Words/ article Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
19Wikipedia is reasonably accurate
20The Gene Wiki Hypothesis “We can harness the Long Tail of scientists to directly participate in the gene annotation process.” -Andrew Su
21Goal of the Gene Wiki project • Enable the creation of a collaboratively written, continuously updated, high quality review article for every human gene.
Filtering, extracting, and summarizing PubMed
23Success depends on a positive feedback loop Value of service 1 100 2 200 Number of Number of contributors users
Gene “stubs” seed community contributions 24 Protein structure Gene Symbols and summary identifiers Gene Ontology annotations Protein interactions Tissue expression pattern Linked references Links to structured databases
25A review article for every gene is powerful 68 editors, 543 edits (as of July 2010) References to the literature Hyperlinks to related concepts
26The Gene Wiki project – 2010 stats Value of service 10,300 articles 1.2 million words 67MB text (about 1,000 PloS Biology research articles) 55 million page views Number of Number of 3,500 editors contributors users 17,000 edits
29Google loves Wikipedia • 1.86 million results from Google • courses • products • databases • ...
30The Gene Wiki hitches a ride on Wikipedia CC photo by ff137 on flickr
31Take home messages Value • Success depends on a positive feedback contributors users loop • Where possible, try to hitch a ride
32But still, many genes lack structured annotation… GO Annotation Counts + Electronic annotation (IEA) Biological Process only Genes, sorted by decreasing counts Data: NCBI, February 2013
33Can we generate structured annotations fromthe text of the gene wiki? Gene Property Value ? Fibronectin Biological Angiogenesis Process Fibronectin Cellular Extracellular Localization matrix Fibronectin Related Glomerulopathy Disease Great for building Great for people to read software for people to use
Filtering, extracting, and summarizing PubMedDocuments Concepts
35Document- and concept-centric text mining Predicate Subject Object
36 Simple text mining for gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact matchGood, BMC Genomics, 2011.
Finding concepts• NCBO Annotator Web Service – Gene Ontology – Human Disease Ontology• Annotator service selected for: – Speed, easy API, precisionClement Jonquet, Nigam H Shah, Mark A Musen, (2009) The Open BiomedicalAnnotator. AMIA Summit on Translational Bioinformatics. 56-60http://bioportal.bioontology.org/annotator
Compared to current dbs Results Manual evaluation on random sample match moreDO specific term $" 2% !# " , exact match !# +" 23% !# *" !# " ) !# " ( !# " match more general term !#&" 5% !# " % no match !# $" 70% !" - . //012" 3 0045"6 . /0"078 40910" :91. //012" match more $"GO specific term 2% exact match 12% !# " , !# +" !# *" !# " ) !# " ( match more !# " general term !#&"no match 58% 28% !# " % !# $" !" - . //012" 3 0045"6 . /0"078 40910" :91. //012"
!# " , GO problems !#+" !#*" !# " ) !# " ( !# " !#&" !# " % !#$" !" - . /01"2 . 345"6 7# 90: ; 4<=9>" 1# "38. ?1931941": =109@ AA=83" 3"0; ?1931941"B . 43; . //D"C /01" 0"C . . 99=3. <=9"False match (e.g., “Olfactory receptors .. are responsible for thetransduction of odorant signals. The system incorrectly identifies„transduction‟ (GO:0009293) defined as the transfer of geneticinformation to a bacterium from a bacteriophage or between bacterialor yeast cells mediated by a phage vectorNo support in sentence (e.g., "The protein is composed ... including10 sialic acid residues, which are attached to the protein duringposttranslational modification in the Golgi apparatus.” Suchsentences may lead to incorrect annotations of Golgi apparatus andPosttranslational modification‟.)
Applications• Enrichment analysis • even with false positives, text-mined annotations can improve statistical analyses that are tolerant to noise.• GeneWiki+
Text mining take home• Depends a lot on the ontology • (same text, same algorithm, completely different results)• Approach depends on corpus • concept-centric text has advantages• Approach depends on purpose • high false positive rates are common but may be acceptable – e.g. enrichment analysis 46
Can we skip text mining?http://fiehnlab.ucdavis.edu/projects/Rice_metabolome/
WikidataProvide a database of the world‟s knowledge that anyone can edit - Denny Vrandečić 48
Q414043 Wikidata Reelin Protein Q8054Property:P31 is a Glycoprotein Q187126 NeuralProperty:P128 regulates Q1345738 development VLDL receptor Q1979313Property:P129 Interacts Amyloid with precursor Q423510 protein 49 http://www.wikidata.org/wiki/Q414043
53 “We can harness theLong Tail of scientiststo directly participate in the gene annotation process.” -Andrew Su
54Gene Wiki acknowledgements.. http://wordle.com Many Wikipedia editors WP:MCB Project“A gene wiki for community annotation of gene function” “The Gene Wiki: community intelligence applied to humanPloS Biology 2008 gene annotation” Nucleic Acids Research 2009 “Mining the Gene Wiki for Functional Genomic Knowledge” BMC Genomics 2011 “The Gene Wiki in 2011: community intelligence applied to human gene annotation” Nucleic Acids Research 2012 “Linking genes to diseases with a SNPedia-Gene Wiki mashup” Journal of Biomedical Semantics 2012 “Building a biomedical semantic network in Wikipedia with Semantic Wiki Links” Database: The Journal of Biological Databases and Curation 2012
My sister Erin has a PhD in linguististics, lives in Raleigh and is looking for work in research or teaching.. Help her out! email@example.com @bgood i9606.blogspot.comFunding and Support slideshare/goodb NIH / NIGMS 55 (Gene Wiki: GM089820)
56 Gene Wiki content improves enrichment analysis More p-value significant (PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)Good, BMC Genomics, 2011.