Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

571 views

Published on

Given at DBMI seminar series at UCSD. http://dbmi.ucsd.edu/display/DBMI/Seminars

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
571
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discoveryNo IEA
  • If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Tried on 773 GO categories, significant in 356 cases (46%)
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Developer resources do not scale with usagePractical effects:Core developers’ time is always the rate-limiting step Addition of new features and data always feels slowEventually, new databases are created to fill the gap80% duplication for 20% innovation
  • MODs and portals
  • Genetics resources
  • Literature resources
  • Protein resources
  • Pathway and expression databases
  • Pathway and expression databases
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Empire state building
  • Question: how to interject biological knowledge in the feature selection process?
  • Kellogg School slide.pptx
  • Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

    1. 1. Crowdsourcing Biology: The GeneWiki, BioGPS and GeneGames.orgAndrew Su, Ph.D.@andrewsuasu@scripps.eduhttp://sulab.orgApril 5, 2013UCSD DBMI Seminar
    2. 2. Few genes are well annotated…2Data: NCBI, February 201341%65%CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF20,473protein-codinggenesGenes, sorted by decreasing countsGOAnnotationCounts
    3. 3. 0200,000400,000600,000800,0001,000,0001979 1984 1989 1994 1999 2004 2009Number of PubMed-indexed articles… because the literature is sparsely curated?3
    4. 4. … because the literature is sparsely curated?4010201979 1984 1989 1994 1999 2004 2009Average capacity of human scientistNumber of articles read by typical scientist
    5. 5. 5311,696 articles (1.5% of PubMed)have been cited by GO annotations
    6. 6. 60Sooner or later, theresearch community willneed to be involved in theannotation effort to scaleup to the rate of datageneration.
    7. 7. The Long Tail is a prolific source of content7ShortHeadLong TailContentproducedContributors (sorted)News :Video:Product reviews:Food reviews:Talent judging:NewspapersTV/HollywoodConsumer reportsFood criticsOlympicsBlogsYouTubeAmazon reviewsYelpAmerican Idol
    8. 8. Wikipedia is reasonably accurate8
    9. 9. Wikipedia has breadth and depth9http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008ArticlesWords(millions)Wikipedia BritannicaOnline
    10. 10. 10We can harness theLong Tail of scientiststo directly participate inthe gene annotationprocess.
    11. 11. From crowdsourcing to structured data11The Gene WikiBiological Games
    12. 12. Filtering, extracting, and summarizing PubMedDocumentsConcepts Review article
    13. 13. Filtering, extracting, and summarizing PubMedDocumentsConcepts
    14. 14. Wiki success depends on a positive feedback14Gene wiki page utilityNumber ofusersNumber ofcontributors10012002
    15. 15. 10,000 gene “stubs” within Wikipedia15Protein structureSymbols andidentifiersTissue expressionpatternGene OntologyannotationsLinks to structureddatabasesGenesummaryProteininteractionsLinkedreferencesHuss, PLoS Biol, 2008UtilityUsersContributors
    16. 16. Gene Wiki has a critical mass of readers16Total: 4.0 million views / monthHuss, PLoS Biol, 2008; Good, NAR, 2011UtilityUsersContributors
    17. 17. Gene Wiki has a critical mass of editors17Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million wordsApproximately equal to 230 full-length articlesGood, NAR, 2011UtilityUsersContributorsEditorcountEditorsEditsEditcount
    18. 18. A review article for every gene is powerful18References to the literatureHyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002Heparin: 358 editors, 654 edits since June 2003AMPK: 109 editors, 203 edits since March 2004RNAi: 394 editors, 994 edits since October 2002
    19. 19. Making the Gene Wiki more computable19Structured annotationsFree text
    20. 20. Filling the gaps in gene annotation20WikilinkGO exactmatchGene WikimappingNCBI Entrez Gene: 334CandidateassertionGO:00068976319 novel GO annotations2147 novel DO annotations
    21. 21. Gene Wiki content improves enrichment analysis21GO termGene listConceptrecognitionPubMedabstractsEnrichmentanalysisGO:0007411axonguidance(GO:0007411)264 genesLinked genesthroughPubMedP = 1.55 E-20811 articlesYes NoYes 13 2No 251 12033
    22. 22. Gene Wiki content improves enrichment analysis22GO termGene listConceptrecognitionPubMedabstractsGene Wiki+EnrichmentanalysisGO:0006936 GO:0006936musclecontraction(GO:0006936)87 genesLinked genesthroughPubMedLinked genesthroughPubMed +Gene WikiP = 1.0 P = 1.22 E-09251 articles87 articles
    23. 23. Gene Wiki content improves enrichment analysis23p-value (PubMed only)p-value(PubMed + GW)MusclecontractionMoresignificantPubMed + GWMoresignificantPubMed only
    24. 24. Making the Gene Wiki more computable24Structured annotationsFree textAnalyses
    25. 25. Making the Gene Wiki more computable25Structured annotationsFree textDatabases
    26. 26. Making the Gene Wiki more computable26DatabasesLinked Data
    27. 27. TheLong Tail of scientistsis a valuable source ofinformation on genefunction27
    28. 28. From crowdsourcing to structured data28The Gene WikiBiological Games
    29. 29. Gene databases are numerous and overlapping29… and hundredsmore …
    30. 30. Why is there so much redundancy?30UsersRequestsResourcesTimeCommunitydevelopmentBioGPS emphasizes community extensibility
    31. 31. Why do developers define the gene report view?31BioGPS emphasizes user customizability
    32. 32. http://biogps.orgCommunity extensibility and user customizability32
    33. 33. Utility: A simple and universal plugin interface33KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}URL templateGene entityRendered URL
    34. 34. UtilityUsersContributorsUtility: A simple and universal plugin interface34
    35. 35. UtilityUsersContributorsUtility: A simple and universal plugin interface35
    36. 36. UtilityUsersContributorsUtility: A simple and universal plugin interface36
    37. 37. UtilityUsersContributorsUtility: A simple and universal plugin interface37
    38. 38. UtilityUsersContributorsUtility: A simple and universal plugin interface38
    39. 39. Utility: A simple and universal plugin interface39UtilityUsersContributorsTotal of > 540 gene-centric onlinedatabases registered as BioGPS plugins
    40. 40. Users: BioGPS has critical mass40• > 6400 registered users• 14,000 unique visitors per month• 155,000 page views per month1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge7. U Penn8. Stanford9. Wash U10. UNCTop 10 organizationsDaily pageviewsUtilityUsersContributors
    41. 41. Contributors: Explicit and implicit knowledge41540 plugins registered(>300 publicly shared)by over 120 usersspanning 280+ domainsUtilityUsersContributors
    42. 42. All resources should provide RDF…42
    43. 43. Mining structured content from HTML43
    44. 44. Defining a data extraction template44…TP53 TNF APOE IL6 VEGF …EGFR TGFB1
    45. 45. The BioGPS Semantic Annotator45http://54.244.135.254:8080
    46. 46. TheLong Tail ofbioinformaticianscan collaborativelybuild a gene portal.46
    47. 47. From crowdsourcing to structured data47The Gene WikiBiological Games
    48. 48. 48http://www.flickr.com/photos/archana3k1/4124330493/Seven million human hours
    49. 49. 49Twenty million human hourshttp://www.flickr.com/photos/ableman/2171326385/
    50. 50. -50150 billion human hourshttp://www.flickr.com/photos/rvp-cw/6243289302/per year
    51. 51. Using games to fold proteins51Fold.it players have successfully:• Outperformed state of the art proteinfolding algorithms (Cooper, Nature, 2010)• Solved a previously-intractable crystalstructure (Khatib, Nat Struct Mol Biol, 2011)• Designed an improved protein foldingalgorithm (Khatib, PNAS, 2011)• Improved enzyme activity of de novodesigned enzyme (Eiben, Nat Biotechnol, 2011)
    52. 52. Using games to fold RNAs52http://eterna.cmu.edu/
    53. 53. Using games to align sequences53http://phylo.cs.mcgill.ca
    54. 54. Using games to diagnose malaria infection54http://biogames.ee.ucla.edu/
    55. 55. Using games to map neurons55http://eyewire.org
    56. 56. Using games to annotate genes?56http://genegames.org
    57. 57. No good gene-disease annotation database57Alzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseQuery: Apolipoprotein E
    58. 58. No good gene-disease annotation database58Alzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityQuery: Apolipoprotein E
    59. 59. No good gene-disease annotation database59Alzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular DiseasesQuery: Apolipoprotein E?????
    60. 60. No good gene-disease annotation database60Alzheimers disease (AD)Neuropsychological TestsCognition DisordersDementiaCognitionDisease ProgressionCardiovascular DiseasesCoronary DiseaseDiabetes Mellitus, Type 2Memory DisordersQuery: Apolipoprotein EMemoryCoronary Artery DiseaseHypertensionMental Status SchedulePsychiatric Status RatingScalesHyperlipidemiasAtrophyDementia, VascularParkinson DiseaseBrain InjuriesMyocardial Infarction…477 diseases!
    61. 61. Play Dizeez to annotate gene-disease links613. If it‟s „right‟, you get points4. Then on to thenext question…2. Click the related disease(only one is “right”)5. Hurry!1. Read the clue (gene)6. Play to win!
    62. 62. Dizeez players seem pretty smart…62In total (since Dec 2011):• 230 unique gamers• 1045 games played• 8525 guesses# Occurrences Gene Disease11 NBPF3 neuroblastoma11 SOX8 mental retardation9 ABL1 leukemia9 SSX1 synovial sarcoma8 APC colorectal cancer8 FES sarcoma8 RBP3 retinoblastoma8 GAST gastrinoma8 DCC colorectal cancer8 MAP3K5 cancerGene Wiki OMIM PharmGKB PubMed
    63. 63. Using games to predict phenotype from genotype?63http://genegames.org
    64. 64. Classification problems in genome biology64cancer normalfind patternsClassify newsamplescancernormalSVMNeuralnetworksNaïveBayesKNN…100s samples100,000sfeatures
    65. 65. Random forests65Sample subsetof cases andfeaturesTrain decisiontreecancer normal100s samples100,000sfeatures
    66. 66. Random forests66cancer normal100s samples100,000sfeatures
    67. 67. Random forests67Classify newsamplescancernormalcancer normal100s samples100,000sfeaturesHow to interjectbiologicalknowledge?
    68. 68. Network-guided forests68Dutkowski & Ideker (2011). PLoS Computational Biology
    69. 69. Network-guided forests69Samplefeatures by PPInetworkTrain decisiontreecancer normal100s samples100,000sfeatures
    70. 70. Human-guided forests70Samplefeatures byhumanintelligenceTrain decisiontreecancer normal100s samples100,000sfeatures
    71. 71. 71
    72. 72. The Cure: Genomic predictors for disease72
    73. 73. The Cure: Genomic predictors for disease73
    74. 74. The Cure: Genomic predictors for disease74
    75. 75. The Cure: Genomic predictors for disease75
    76. 76. The Cure: Genomic predictors for disease76
    77. 77. The Cure: Genomic predictors for disease77
    78. 78. Human-guided forests78Classify newsamplescancernormal
    79. 79. “Critical Assessment”-style challenge79
    80. 80. Results• 214 registered players– 50% declared knowledge of cancerbiology– 40% self-identified as having Ph.D.• Prediction results– 70% correct on survival concordanceindex– Best scoring model was 76%– Player registrations still increasing!80
    81. 81. TheLong Tail of gamerscan collaborativelybuild an accuratedisease classifier.81
    82. 82. 82Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editorsWP:MCB ProjectCollaboratorsKatie FischBen GoodSalvatore LoguercioMax NanisChunlei WuGroup membersFunding and Support(BioGPS: GM83924, Gene Wiki: GM089820)Contacthttp://sulab.orgasu@scripps.edu@andrewsu+Andrew SuAdriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo OrozcoKey group alumni
    83. 83. Doctoral Program in Chemicaland Biological SciencesCALIFORNIAOffice of Graduate Studies10550 N. Torrey Pines RoadLa Jolla, CA 92037Email:gradprgrm@scripps.eduPhone: 858.784.8469http://education.scripps.edu

    ×