Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Upcoming SlideShare
Loading in...5
×
 

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

on

  • 467 views

 

Statistics

Views

Total Views
467
Views on SlideShare
467
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Reverted four minutes later
  • Reverted four minutes later
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Tried on 773 GO categories, significant in 356 cases (46%)
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • MODs and portals
  • Genetics resources
  • Literature resources
  • Protein resources
  • Pathway and expression databases
  • Pathway and expression databases
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Empire state building
  • Question: how to interject biological knowledge in the feature selection process?
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org Presentation Transcript

  • Crowdsourcing Biology: The GeneWiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org October 30, 2012
  • 2Few genes are well annotated… TP53 TNF APOE MTHFR IL6 HLA-DRB1 Counts VEGFA EGFR TGFB1 59% ACE PubMed 38% 23,278 protein- coding genes Gene ontology Genes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010
  • 3… because the literature is sparsely curated? Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009
  • 4… because the literature is sparsely curated? Average of articlesof humantypical scientist Number capacity read by scientist 20 10 0 1979 1984 1989 1994 1999 2004 2009
  • 5311,696 articles (1.5% of PubMed)have been cited by GO annotations
  • 6 Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of data generation.
  • 7The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Talent judging: Olympics American Idol
  • 8Wikipedia is reasonably accurate
  • 9Wikipedia has breadth and depth Articles Words (millions) Words/ article Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
  • 10 We can harness theLong Tail of scientiststo directly participate in the gene annotation process.
  • 11From crowdsourcing to structured data The Gene Wiki Biological Games
  • Filtering, extracting, and summarizing PubMedDocuments Concepts Review article
  • Filtering, extracting, and summarizing PubMedDocuments Concepts
  • 14Wiki success depends on a positive feedback Gene wiki page utility 1 100 2 200 Number of Number of contributors users
  • 15 10,000 gene “stubs” within Wikipedia Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Proteininteractions Tissue expression Linked patternreferences Links to structured databasesHuss, PLoS Biol, 2008
  • 16 Gene Wiki has a critical mass of readers Utility Total: 5.0 million views / month Users ContributorsHuss, PLoS Biol, 2008; Good, NAR, 2011
  • 17 Gene Wiki has a critical mass of editors Utility Editors Editor count Edit count Users Contributors Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articlesGood, NAR, 2011
  • 18A review article for every gene is powerful Reelin: 98 editors, 703 edits since July 2002 Hyperlinks to related concepts Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002 References to the literature
  • 19 The Gene Wiki is (reasonably) reliable Per edit Average Probability probability lifetime by time Cumulative edits Good edits 98.9% 115.4 d 99.968% Vandalism 1.1% 3.4 d 0.032% Date (0.63% for WP overall)Good, NAR, 2011
  • 20Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds".that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 2
  • 21Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds".that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 36211 total edits 36 total edits * * * * * * * * * * * * * * High-trust author Low-trust author http://www.wikitrust.net/
  • 22Making the Gene Wiki more computableFree text Structured annotations
  • 23Filling the gaps in gene annotation NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel GO annotations 2147 novel DO annotations
  • 24TOP 100GENES
  • 25Gene Wiki content improves enrichment analysis axon Enrichment guidance GO term analysis(GO:0007411) 811 articles 264 genes PubMed Concept Gene list abstracts recognition GO:0007411 Yes NoLinked genes Yes 13 2 through No 251 12033 PubMed P = 1.55 E-20
  • 26Gene Wiki content improves enrichment analysis muscle Enrichment contraction GO term analysis(GO:0006936) 251 articles 87 genes PubMed Concept Gene list abstracts recognition + Gene Wiki 87 articles GO:0006936 GO:0006936Linked genes Linked genes through through PubMed PubMed + Gene Wiki P = 1.0 P = 1.22 E-09
  • 27Gene Wiki content improves enrichment analysis More p-value significant(PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)
  • 28Gene Wiki+: Crowdsourced semantic database Q: What genes are related to hemolytic anemia?
  • 29 The Long Tail of scientistsis a valuable source of information on gene function
  • 30From crowdsourcing to structured data The Gene Wiki Biological Games
  • 31Gene databases are numerous and overlapping … and hundreds more …
  • 32Community extensibility and user customizability http://biogps.org
  • 33Utility: A simple and universal plugin interface UtilityContributors Users
  • 34Utility: A simple and universal plugin interface UtilityContributors Users
  • 35Utility: A simple and universal plugin interface UtilityContributors Users
  • 36Utility: A simple and universal plugin interface UtilityContributors Users
  • 37Utility: A simple and universal plugin interface UtilityContributors Users
  • 38Utility: A simple and universal plugin interface UtilityContributors Users Total of 389 gene-centric online databases registered as BioGPS plugins
  • 39Users: BioGPS has critical mass Utility Daily pageviewsContributors Users • > 5000 registered users Top 10 organizations • 13,500 unique visitors per month 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 155,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC
  • 40Contributors: Explicit and implicit knowledge UtilityContributors Users 389 plugins registered (65% publicly shared) by over 75 users spanning 150+ domains
  • 41Mining structured content from HTML
  • 42Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
  • 43The BioGPS Semantic Annotator http://50.112.124.237
  • 44 The Long Tail of bioinformaticianscan collaborativelybuild a gene portal.
  • 45From crowdsourcing to structured data The Gene Wiki Biological Games
  • 46Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
  • 47Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
  • 48- 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
  • 49Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  • 50Using games to fold RNAs http://eterna.cmu.edu/
  • 51Using games to align sequences http://phylo.cs.mcgill.ca
  • 52Using games to annotate genes? http://genegames.org
  • 53No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease
  • 54No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility
  • 55No good gene-disease annotation database Query: Apolipoprotein E ? Alzheimers disease (AD) ? Lipoprotein glomerulopathy ? Sea-blue histiocyte disease Hyperlipoproteinemia, type III ? Macular degeneration, age-related ? Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases
  • 56No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Memory Coronary Artery Disease Neuropsychological Tests Hypertension Cognition Disorders Mental Status Schedule Psychiatric Status Rating Dementia Scales Cognition Hyperlipidemias Atrophy Disease Progression Dementia, Vascular Cardiovascular Diseases Parkinson Disease Brain Injuries Coronary Disease Myocardial Infarction Diabetes Mellitus, Type 2 … Memory Disorders 477 diseases!
  • 57Play Dizeez to annotate gene-disease links 6. Play to win! 5. Hurry! 4. Then on to the next question… 3. If it‟s „right‟, you get points 1. Read the clue (gene) 2. Click the related disease (only one is “right”)
  • 58Dizeez players seem pretty smart… In total (since Dec 2011): • 230 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Gene Wiki OMIM PharmGKB PubMed 11 NBPF3 neuroblastoma 11 SOX8 mental retardation 9 ABL1 leukemia 9 SSX1 synovial sarcoma 8 APC colorectal cancer 8 FES sarcoma 8 RBP3 retinoblastoma 8 GAST gastrinoma 8 DCC colorectal cancer 8 MAP3K5 cancer
  • 59Using games to predict phenotype from genotype? http://genegames.org
  • 60Classification problems in genome biology Classify new cancer normal samples find patterns cancer 100,000s features normal SVM Neural networks Naïve Bayes KNN … 100s samples
  • 61Random forests Sample subset of cases and Train decision cancer normal features tree 100,000s features 100s samples
  • 62Random forests cancer normal 100,000s features 100s samples
  • 63Random forests Classify new cancer normal samples cancer 100,000s features normal How to interject biological 100s samples knowledge?
  • 64Network-guided forests Dutkowski & Ideker (2011). PLoS Computational Biology
  • 65Network-guided forests Sample features by PPI Train decision cancer normal network tree 100,000s features 100s samples
  • 66Human-guided forests Sample features by Train decision cancer normal human tree intelligence 100,000s features 100s samples
  • 67
  • 68The Cure: Genomic predictors for disease
  • 69The Cure: Genomic predictors for disease
  • 70The Cure: Genomic predictors for disease
  • 71The Cure: Genomic predictors for disease
  • 72The Cure: Genomic predictors for disease
  • 73The Cure: Genomic predictors for disease
  • 74Human-guided forests Classify new samples cancer normal
  • 75“Critical Assessment”-style challenge
  • 76Preliminary results• 214 registered players – 50% declared knowledge of cancer biology – 40% self-identified as having Ph.D.• Prediction results – 69% correct on survival concordance index – Best scoring model was 72%
  • 77 TheLong Tail of gamers can collaboratively build an accurate disease classifier.
  • 78 Collaborators Group membersDoug Howe, ZFIN Ben Good Max NanisJohn Hogenesch, U PennJon Huss, GNF Salvatore Loguercio Chunlei WuLuca de Alfaro, UCSC Ian MacleodAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum, Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors WP:MCB Project Contact http://sulab.org Recruiting graduate students asu@scripps.edu in quantitative biology! See @andrewsu http://education.scripps.edu/ +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)