• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
 

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

on

  • 706 views

 

Statistics

Views

Total Views
706
Views on SlideShare
700
Embed Views
6

Actions

Likes
0
Downloads
6
Comments
0

1 Embed 6

https://twitter.com 6

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Reverted four minutes later
  • Reverted four minutes later
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Tried on 773 GO categories, significant in 356 cases (46%)
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • MODs and portals
  • Genetics resources
  • Literature resources
  • Protein resources
  • Pathway and expression databases
  • Pathway and expression databases
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Empire state building
  • Question: how to interject biological knowledge in the feature selection process?
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger) Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger) Presentation Transcript

  • Crowdsourcing Biology: The GeneWiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org Sanger/EBI September 7, 2012
  • 2Few genes are well annotated… TP53 TNF APOE MTHFR IL6 HLA-DRB1 Counts VEGFA EGFR TGFB1 59% ACE PubMed 38% 23,278 protein- coding genes Gene ontology Genes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010
  • 3… because the literature is sparsely curated? Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009
  • 4… because the literature is sparsely curated? Average of articlesof humantypical scientist Number capacity read by scientist 20 10 0 1979 1984 1989 1994 1999 2004 2009
  • 5311,696 articles (1.5% of PubMed)have been cited by GO annotations
  • 6 Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of data generation.
  • 7The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Talent judging: Olympics American Idol
  • 8Wikipedia is reasonably accurate
  • 9Wikipedia has breadth and depth Articles Words (millions) Words/ article Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
  • 10 We can harness theLong Tail of scientiststo directly participate in the gene annotation process.
  • 11From crowdsourcing to structured data The Gene Wiki Biological Games
  • Filtering, extracting, and summarizing PubMedDocuments Concepts
  • 13Wiki success depends on a positive feedback Gene wiki page utility 1 100 2 200 Number of Number of contributors users
  • 14 10,000 gene “stubs” within Wikipedia Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Proteininteractions Tissue expression Linked patternreferences Links to structured databasesHuss, PLoS Biol, 2008
  • 15 Gene Wiki has a critical mass of readers Utility Total: 5.0 million views / month Users ContributorsHuss, PLoS Biol, 2008; Good, NAR, 2011
  • 16 Gene Wiki has a critical mass of editors Utility Editors Editor count Edit count Users Contributors Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articlesGood, NAR, 2011
  • 17A review article for every gene is powerful Reelin: 98 editors, 703 edits since July 2002 Hyperlinks to related concepts Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002 References to the literature
  • 18Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds".that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 2
  • 19Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds".that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 36211 total edits 36 total edits * * * * * * * * * * * * * * High-trust author Low-trust author http://www.wikitrust.net/
  • 20Making the Gene Wiki more computableFree text Structured annotations
  • 21Filling the gaps in gene annotation NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel GO annotations 2147 novel DO annotations
  • 22TOP 100GENES
  • 23Gene Wiki content improves enrichment analysis axon Enrichment guidance GO term analysis(GO:0007411) 811 articles 264 genes PubMed Concept Gene list abstracts recognition GO:0007411 Yes NoLinked genes Yes 13 2 through No 251 12033 PubMed P = 1.55 E-20
  • 24Gene Wiki content improves enrichment analysis muscle Enrichment contraction GO term analysis(GO:0006936) 251 articles 87 genes PubMed Concept Gene list abstracts recognition + Gene Wiki 87 articles GO:0006936 GO:0006936Linked genes Linked genes through through PubMed PubMed + Gene Wiki P = 1.0 P = 1.22 E-09
  • 25Gene Wiki content improves enrichment analysis More p-value significant(PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)
  • 26Gene Wiki+: Crowdsourced semantic database Q: What genes are related to hemolytic anemia?
  • 27 The Long Tail of scientistsis a valuable source of information on gene function
  • 28From crowdsourcing to structured data The Gene Wiki Biological Games
  • 29Gene databases are numerous and overlapping … and hundreds more …
  • 30Community extensibility and user customizability http://biogps.org
  • 31Utility: A simple and universal plugin interface UtilityContributors Users
  • 32Utility: A simple and universal plugin interface UtilityContributors Users
  • 33Utility: A simple and universal plugin interface UtilityContributors Users
  • 34Utility: A simple and universal plugin interface UtilityContributors Users
  • 35Utility: A simple and universal plugin interface UtilityContributors Users
  • 36Utility: A simple and universal plugin interface UtilityContributors Users Total of 389 gene-centric online databases registered as BioGPS plugins
  • 37Users: BioGPS has critical mass Utility Daily pageviewsContributors Users • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC
  • 38Contributors: Explicit and implicit knowledge UtilityContributors Users 389 plugins registered (65% publicly shared) by over 75 users spanning 150+ domains
  • 39Mining structured content from HTML
  • 40Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
  • 41The BioGPS Semantic Annotator http://50.112.124.237
  • 42 The Long Tail of bioinformaticianscan collaborativelybuild a gene portal.
  • 43From crowdsourcing to structured data The Gene Wiki Biological Games
  • 44Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
  • 45Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
  • 46- 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
  • 47Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  • 48Using games to fold RNAs http://eterna.cmu.edu/
  • 49Using games to align sequences http://phylo.cs.mcgill.ca
  • 50Using games to annotate genes? http://genegames.org
  • 51No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease
  • 52No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility
  • 53No good gene-disease annotation database Query: Apolipoprotein E ? Alzheimers disease (AD) ? Lipoprotein glomerulopathy ? Sea-blue histiocyte disease Hyperlipoproteinemia, type III ? Macular degeneration, age-related ? Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases
  • 54No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Memory Coronary Artery Disease Neuropsychological Tests Hypertension Cognition Disorders Mental Status Schedule Psychiatric Status Rating Dementia Scales Cognition Hyperlipidemias Atrophy Disease Progression Dementia, Vascular Cardiovascular Diseases Parkinson Disease Brain Injuries Coronary Disease Myocardial Infarction Diabetes Mellitus, Type 2 … Memory Disorders 477 diseases!
  • 55Play Dizeez to annotate gene-disease links 6. Play to win! 5. Hurry! 4. Then on to the next question… 3. If it‟s „right‟, you get points 1. Read the clue (gene) 2. Click the related disease (only one is “right”)
  • 56Dizeez players seem pretty smart… In total (since Dec 2011): • 207 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 7 GAST gastrinoma 7 RBP3 retinoblastoma 7 SSX1 synovial sarcoma 6 TG Graves disease 6 CRYGC Cataract 6 SOX8 mental retardation 6 WRN Werner syndrome 6 ABL1 leukemia 6 MLL3 leukemia 6 SNAI2 breast carcinoma
  • 57Dizeez players seem pretty smart… In total (since Dec 2011): • 207 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 5 MECOM sarcoma 4 ATF7 cancer 3 ABCB5 acute myeloid leukemia 3 SART1 glioblastoma 3 NCK1 leukemia 3 NEK1 cancer
  • 58Using games to predict phenotype from genotype? The Cure http://genegames.org
  • 59Classification problems in genome biology Classify new cancer normal samples find patterns cancer 100,000s features normal SVM Neural networks Naïve Bayes KNN … 100s samples
  • 60Random forests Sample subset of cases and Train decision cancer normal features tree 100,000s features 100s samples
  • 61Random forests cancer normal 100,000s features 100s samples
  • 62Random forests Classify new cancer normal samples cancer 100,000s features normal How to interject biological 100s samples knowledge?
  • 63Network-guided forests Dutkowski & Ideker (2011). PLoS Computational Biology
  • 64Network-guided forests Sample features by PPI Train decision cancer normal network tree 100,000s features 100s samples
  • 65Human-guided forests Sample features by Train decision cancer normal human tree intelligence 100,000s features 100s samples
  • 66
  • 67The Cure: Genomic predictors for disease
  • 68The Cure: Genomic predictors for disease
  • 69The Cure: Genomic predictors for disease
  • 70The Cure: Genomic predictors for disease
  • 71The Cure: Genomic predictors for disease
  • 72The Cure: Genomic predictors for disease
  • 73Human-guided forests Classify new samples cancer normal
  • 74“Critical Assessment”-style challenge Will this work? Check our blog after October 15.
  • 75 TheLong Tail of gamers can collaboratively build an accurate disease classifier.
  • 76 Collaborators Group membersDoug Howe, ZFIN Ben Good Max NanisJohn Hogenesch, U PennJon Huss, GNF Salvatore Loguercio Chunlei WuLuca de Alfaro, UCSC Ian MacleodAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum, Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors WP:MCB Project Contact http://sulab.org Recruiting graduate students asu@scripps.edu in quantitative biology! See @andrewsu http://education.scripps.edu/ +Andrew Su Funding and Support @genegame (BioGPS: GM83924, Gene Wiki: GM089820)