Crowdsourcing Biology: The GeneWiki, BioGPS and GeneGames.org               Andrew Su, Ph.D.                  @andrewsu   ...
2Few genes are well annotated…            TP53            TNF            APOE            MTHFR            IL6            H...
3… because the literature is sparsely curated?                       Number of PubMed-indexed articles          1,000,000 ...
4… because the literature is sparsely curated?                   Average of articlesof humantypical scientist             ...
5311,696 articles (1.5% of PubMed)have been cited by GO annotations
6    Sooner or later, the research community willneed to be involved in the             0annotation effort to scale   up t...
7The Long Tail is a prolific source of content                       Short                       Head             Content ...
8Wikipedia is reasonably accurate
9Wikipedia has breadth and depth           Articles            Words             (millions)            Words/            a...
10  We can harness theLong Tail of scientiststo directly participate in  the gene annotation        process.
11From crowdsourcing to structured data                   The Gene Wiki                Biological Games
Filtering, extracting, and summarizing PubMedDocuments Concepts
13Wiki success depends on a positive feedback                  Gene wiki page utility                             1   100 ...
14 10,000 gene “stubs” within Wikipedia          Utility                                                         Users    ...
15 Gene Wiki has a critical mass of readers                                                                               ...
16 Gene Wiki has a critical mass of editors                                                                           Util...
17A review article for every gene is powerful     Reelin: 98 editors, 703 edits since July 2002                           ...
18Making the Gene Wiki more reliable  Novartis is a multinational   2       The company name is derived  pharmaceutical co...
19Making the Gene Wiki more reliable  Novartis is a multinational         2         The company name is derived  pharmaceu...
20Making the Gene Wiki more computableFree text       Structured annotations
21Filling the gaps in gene annotation                                             NCBI Entrez Gene: 334                   ...
22TOP 100GENES
23Gene Wiki content improves enrichment analysis    axon                                           Enrichment  guidance   ...
24Gene Wiki content improves enrichment analysis   muscle                                          Enrichment contraction ...
25Gene Wiki content improves enrichment analysis                     More    p-value       significant(PubMed + GW)    Pub...
26Gene Wiki+: Crowdsourced semantic database Q: What genes are related to hemolytic anemia?
27          The Long Tail of scientistsis a valuable source of  information on gene        function
28From crowdsourcing to structured data                   The Gene Wiki                Biological Games
29Gene databases are numerous and overlapping                            … and hundreds                               more …
30Community extensibility and user customizability                   http://biogps.org
31Utility: A simple and universal plugin interface         UtilityContributors       Users
32Utility: A simple and universal plugin interface         UtilityContributors       Users
33Utility: A simple and universal plugin interface         UtilityContributors       Users
34Utility: A simple and universal plugin interface         UtilityContributors       Users
35Utility: A simple and universal plugin interface         UtilityContributors       Users
36Utility: A simple and universal plugin interface         UtilityContributors         Users                       Total o...
37Users: BioGPS has critical mass         Utility           Daily pageviewsContributors       Users   • > 4100 registered ...
38Contributors: Explicit and implicit knowledge         UtilityContributors       Users     389 plugins registered      (6...
39Mining structured content from HTML
40Defining a data extraction template        TP53   TNF   APOE   IL6   VEGF EGFR TGFB1   …  …
41The BioGPS Semantic Annotator              http://50.112.124.237
42        The    Long Tail of bioinformaticianscan collaborativelybuild a gene portal.
43From crowdsourcing to structured data                   The Gene Wiki                Biological Games
44Seven million human hours                            http://www.flickr.com/photos/archana3k1/4124330493/
45Twenty million human hours                             http://www.flickr.com/photos/ableman/2171326385/
46-    150 billion human hours              per year                              http://www.flickr.com/photos/rvp-cw/6243...
47Using games to fold proteins      Fold.it players have successfully:      • Outperformed state of the art protein       ...
48Using games to fold RNAs              http://eterna.cmu.edu/
49Using games to align sequences              http://phylo.cs.mcgill.ca
50Using games to annotate genes?              http://genegames.org
51No good gene-disease annotation database             Query: Apolipoprotein E            Alzheimers disease (AD)         ...
52No good gene-disease annotation database             Query: Apolipoprotein E            Alzheimers disease (AD)         ...
53No good gene-disease annotation database              Query: Apolipoprotein E           ? Alzheimers disease (AD)       ...
54No good gene-disease annotation database             Query: Apolipoprotein E            Alzheimers disease (AD)    Memor...
55Play Dizeez to annotate gene-disease links                                                6. Play to win!               ...
56Dizeez players seem pretty smart…  In total (since Dec 2011):  • 207 unique gamers  • 1045 games played  • 8525 guesses#...
57Dizeez players seem pretty smart…  In total (since Dec 2011):  • 207 unique gamers  • 1045 games played  • 8525 guesses#...
58Using games to predict phenotype from genotype?                                  The Cure               http://genegames...
59Classification problems in genome biology                                                   Classify new   cancer       ...
60Random forests                                      Sample subset                                       of cases and   T...
61Random forests  cancer                     normal   100,000s features                       100s samples
62Random forests                                                         Classify new  cancer                     normal  ...
63Network-guided forests                         Dutkowski & Ideker (2011). PLoS Computational Biology
64Network-guided forests                                          Sample                                      features by ...
65Human-guided forests                                        Sample                                      features by    T...
66
67The Cure: Genomic predictors for disease
68The Cure: Genomic predictors for disease
69The Cure: Genomic predictors for disease
70The Cure: Genomic predictors for disease
71The Cure: Genomic predictors for disease
72The Cure: Genomic predictors for disease
73Human-guided forests                       Classify new                         samples                                 ...
74“Critical Assessment”-style challenge      Will this work? Check our blog after October 15.
75         TheLong Tail of gamers can collaboratively  build an accurate disease classifier.
76       Collaborators                                                        Group membersDoug Howe, ZFIN                ...
Upcoming SlideShare
Loading in …5
×

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

984 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
984
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Reverted four minutes later
  • Reverted four minutes later
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Tried on 773 GO categories, significant in 356 cases (46%)
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • MODs and portals
  • Genetics resources
  • Literature resources
  • Protein resources
  • Pathway and expression databases
  • Pathway and expression databases
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Empire state building
  • Question: how to interject biological knowledge in the feature selection process?
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

    1. 1. Crowdsourcing Biology: The GeneWiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org Sanger/EBI September 7, 2012
    2. 2. 2Few genes are well annotated… TP53 TNF APOE MTHFR IL6 HLA-DRB1 Counts VEGFA EGFR TGFB1 59% ACE PubMed 38% 23,278 protein- coding genes Gene ontology Genes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010
    3. 3. 3… because the literature is sparsely curated? Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009
    4. 4. 4… because the literature is sparsely curated? Average of articlesof humantypical scientist Number capacity read by scientist 20 10 0 1979 1984 1989 1994 1999 2004 2009
    5. 5. 5311,696 articles (1.5% of PubMed)have been cited by GO annotations
    6. 6. 6 Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of data generation.
    7. 7. 7The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Talent judging: Olympics American Idol
    8. 8. 8Wikipedia is reasonably accurate
    9. 9. 9Wikipedia has breadth and depth Articles Words (millions) Words/ article Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
    10. 10. 10 We can harness theLong Tail of scientiststo directly participate in the gene annotation process.
    11. 11. 11From crowdsourcing to structured data The Gene Wiki Biological Games
    12. 12. Filtering, extracting, and summarizing PubMedDocuments Concepts
    13. 13. 13Wiki success depends on a positive feedback Gene wiki page utility 1 100 2 200 Number of Number of contributors users
    14. 14. 14 10,000 gene “stubs” within Wikipedia Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Proteininteractions Tissue expression Linked patternreferences Links to structured databasesHuss, PLoS Biol, 2008
    15. 15. 15 Gene Wiki has a critical mass of readers Utility Total: 5.0 million views / month Users ContributorsHuss, PLoS Biol, 2008; Good, NAR, 2011
    16. 16. 16 Gene Wiki has a critical mass of editors Utility Editors Editor count Edit count Users Contributors Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articlesGood, NAR, 2011
    17. 17. 17A review article for every gene is powerful Reelin: 98 editors, 703 edits since July 2002 Hyperlinks to related concepts Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002 References to the literature
    18. 18. 18Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds".that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 2
    19. 19. 19Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds".that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 36211 total edits 36 total edits * * * * * * * * * * * * * * High-trust author Low-trust author http://www.wikitrust.net/
    20. 20. 20Making the Gene Wiki more computableFree text Structured annotations
    21. 21. 21Filling the gaps in gene annotation NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel GO annotations 2147 novel DO annotations
    22. 22. 22TOP 100GENES
    23. 23. 23Gene Wiki content improves enrichment analysis axon Enrichment guidance GO term analysis(GO:0007411) 811 articles 264 genes PubMed Concept Gene list abstracts recognition GO:0007411 Yes NoLinked genes Yes 13 2 through No 251 12033 PubMed P = 1.55 E-20
    24. 24. 24Gene Wiki content improves enrichment analysis muscle Enrichment contraction GO term analysis(GO:0006936) 251 articles 87 genes PubMed Concept Gene list abstracts recognition + Gene Wiki 87 articles GO:0006936 GO:0006936Linked genes Linked genes through through PubMed PubMed + Gene Wiki P = 1.0 P = 1.22 E-09
    25. 25. 25Gene Wiki content improves enrichment analysis More p-value significant(PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)
    26. 26. 26Gene Wiki+: Crowdsourced semantic database Q: What genes are related to hemolytic anemia?
    27. 27. 27 The Long Tail of scientistsis a valuable source of information on gene function
    28. 28. 28From crowdsourcing to structured data The Gene Wiki Biological Games
    29. 29. 29Gene databases are numerous and overlapping … and hundreds more …
    30. 30. 30Community extensibility and user customizability http://biogps.org
    31. 31. 31Utility: A simple and universal plugin interface UtilityContributors Users
    32. 32. 32Utility: A simple and universal plugin interface UtilityContributors Users
    33. 33. 33Utility: A simple and universal plugin interface UtilityContributors Users
    34. 34. 34Utility: A simple and universal plugin interface UtilityContributors Users
    35. 35. 35Utility: A simple and universal plugin interface UtilityContributors Users
    36. 36. 36Utility: A simple and universal plugin interface UtilityContributors Users Total of 389 gene-centric online databases registered as BioGPS plugins
    37. 37. 37Users: BioGPS has critical mass Utility Daily pageviewsContributors Users • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC
    38. 38. 38Contributors: Explicit and implicit knowledge UtilityContributors Users 389 plugins registered (65% publicly shared) by over 75 users spanning 150+ domains
    39. 39. 39Mining structured content from HTML
    40. 40. 40Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
    41. 41. 41The BioGPS Semantic Annotator http://50.112.124.237
    42. 42. 42 The Long Tail of bioinformaticianscan collaborativelybuild a gene portal.
    43. 43. 43From crowdsourcing to structured data The Gene Wiki Biological Games
    44. 44. 44Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
    45. 45. 45Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
    46. 46. 46- 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
    47. 47. 47Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
    48. 48. 48Using games to fold RNAs http://eterna.cmu.edu/
    49. 49. 49Using games to align sequences http://phylo.cs.mcgill.ca
    50. 50. 50Using games to annotate genes? http://genegames.org
    51. 51. 51No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease
    52. 52. 52No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility
    53. 53. 53No good gene-disease annotation database Query: Apolipoprotein E ? Alzheimers disease (AD) ? Lipoprotein glomerulopathy ? Sea-blue histiocyte disease Hyperlipoproteinemia, type III ? Macular degeneration, age-related ? Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases
    54. 54. 54No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Memory Coronary Artery Disease Neuropsychological Tests Hypertension Cognition Disorders Mental Status Schedule Psychiatric Status Rating Dementia Scales Cognition Hyperlipidemias Atrophy Disease Progression Dementia, Vascular Cardiovascular Diseases Parkinson Disease Brain Injuries Coronary Disease Myocardial Infarction Diabetes Mellitus, Type 2 … Memory Disorders 477 diseases!
    55. 55. 55Play Dizeez to annotate gene-disease links 6. Play to win! 5. Hurry! 4. Then on to the next question… 3. If it‟s „right‟, you get points 1. Read the clue (gene) 2. Click the related disease (only one is “right”)
    56. 56. 56Dizeez players seem pretty smart… In total (since Dec 2011): • 207 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 7 GAST gastrinoma 7 RBP3 retinoblastoma 7 SSX1 synovial sarcoma 6 TG Graves disease 6 CRYGC Cataract 6 SOX8 mental retardation 6 WRN Werner syndrome 6 ABL1 leukemia 6 MLL3 leukemia 6 SNAI2 breast carcinoma
    57. 57. 57Dizeez players seem pretty smart… In total (since Dec 2011): • 207 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 5 MECOM sarcoma 4 ATF7 cancer 3 ABCB5 acute myeloid leukemia 3 SART1 glioblastoma 3 NCK1 leukemia 3 NEK1 cancer
    58. 58. 58Using games to predict phenotype from genotype? The Cure http://genegames.org
    59. 59. 59Classification problems in genome biology Classify new cancer normal samples find patterns cancer 100,000s features normal SVM Neural networks Naïve Bayes KNN … 100s samples
    60. 60. 60Random forests Sample subset of cases and Train decision cancer normal features tree 100,000s features 100s samples
    61. 61. 61Random forests cancer normal 100,000s features 100s samples
    62. 62. 62Random forests Classify new cancer normal samples cancer 100,000s features normal How to interject biological 100s samples knowledge?
    63. 63. 63Network-guided forests Dutkowski & Ideker (2011). PLoS Computational Biology
    64. 64. 64Network-guided forests Sample features by PPI Train decision cancer normal network tree 100,000s features 100s samples
    65. 65. 65Human-guided forests Sample features by Train decision cancer normal human tree intelligence 100,000s features 100s samples
    66. 66. 66
    67. 67. 67The Cure: Genomic predictors for disease
    68. 68. 68The Cure: Genomic predictors for disease
    69. 69. 69The Cure: Genomic predictors for disease
    70. 70. 70The Cure: Genomic predictors for disease
    71. 71. 71The Cure: Genomic predictors for disease
    72. 72. 72The Cure: Genomic predictors for disease
    73. 73. 73Human-guided forests Classify new samples cancer normal
    74. 74. 74“Critical Assessment”-style challenge Will this work? Check our blog after October 15.
    75. 75. 75 TheLong Tail of gamers can collaboratively build an accurate disease classifier.
    76. 76. 76 Collaborators Group membersDoug Howe, ZFIN Ben Good Max NanisJohn Hogenesch, U PennJon Huss, GNF Salvatore Loguercio Chunlei WuLuca de Alfaro, UCSC Ian MacleodAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum, Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors WP:MCB Project Contact http://sulab.org Recruiting graduate students asu@scripps.edu in quantitative biology! See @andrewsu http://education.scripps.edu/ +Andrew Su Funding and Support @genegame (BioGPS: GM83924, Gene Wiki: GM089820)

    ×