Your SlideShare is downloading. ×
0
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

292

Published on

Given at DBMI seminar series at UCSD. http://dbmi.ucsd.edu/display/DBMI/Seminars

Given at DBMI seminar series at UCSD. http://dbmi.ucsd.edu/display/DBMI/Seminars

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
292
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discoveryNo IEA
  • If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Tried on 773 GO categories, significant in 356 cases (46%)
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Developer resources do not scale with usagePractical effects:Core developers’ time is always the rate-limiting step Addition of new features and data always feels slowEventually, new databases are created to fill the gap80% duplication for 20% innovation
  • MODs and portals
  • Genetics resources
  • Literature resources
  • Protein resources
  • Pathway and expression databases
  • Pathway and expression databases
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Empire state building
  • Question: how to interject biological knowledge in the feature selection process?
  • Kellogg School slide.pptx
  • Transcript

    • 1. Crowdsourcing Biology: The GeneWiki, BioGPS and GeneGames.orgAndrew Su, Ph.D.@andrewsuasu@scripps.eduhttp://sulab.orgApril 5, 2013UCSD DBMI Seminar
    • 2. Few genes are well annotated…2Data: NCBI, February 201341%65%CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF20,473protein-codinggenesGenes, sorted by decreasing countsGOAnnotationCounts
    • 3. 0200,000400,000600,000800,0001,000,0001979 1984 1989 1994 1999 2004 2009Number of PubMed-indexed articles… because the literature is sparsely curated?3
    • 4. … because the literature is sparsely curated?4010201979 1984 1989 1994 1999 2004 2009Average capacity of human scientistNumber of articles read by typical scientist
    • 5. 5311,696 articles (1.5% of PubMed)have been cited by GO annotations
    • 6. 60Sooner or later, theresearch community willneed to be involved in theannotation effort to scaleup to the rate of datageneration.
    • 7. The Long Tail is a prolific source of content7ShortHeadLong TailContentproducedContributors (sorted)News :Video:Product reviews:Food reviews:Talent judging:NewspapersTV/HollywoodConsumer reportsFood criticsOlympicsBlogsYouTubeAmazon reviewsYelpAmerican Idol
    • 8. Wikipedia is reasonably accurate8
    • 9. Wikipedia has breadth and depth9http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008ArticlesWords(millions)Wikipedia BritannicaOnline
    • 10. 10We can harness theLong Tail of scientiststo directly participate inthe gene annotationprocess.
    • 11. From crowdsourcing to structured data11The Gene WikiBiological Games
    • 12. Filtering, extracting, and summarizing PubMedDocumentsConcepts Review article
    • 13. Filtering, extracting, and summarizing PubMedDocumentsConcepts
    • 14. Wiki success depends on a positive feedback14Gene wiki page utilityNumber ofusersNumber ofcontributors10012002
    • 15. 10,000 gene “stubs” within Wikipedia15Protein structureSymbols andidentifiersTissue expressionpatternGene OntologyannotationsLinks to structureddatabasesGenesummaryProteininteractionsLinkedreferencesHuss, PLoS Biol, 2008UtilityUsersContributors
    • 16. Gene Wiki has a critical mass of readers16Total: 4.0 million views / monthHuss, PLoS Biol, 2008; Good, NAR, 2011UtilityUsersContributors
    • 17. Gene Wiki has a critical mass of editors17Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million wordsApproximately equal to 230 full-length articlesGood, NAR, 2011UtilityUsersContributorsEditorcountEditorsEditsEditcount
    • 18. A review article for every gene is powerful18References to the literatureHyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002Heparin: 358 editors, 654 edits since June 2003AMPK: 109 editors, 203 edits since March 2004RNAi: 394 editors, 994 edits since October 2002
    • 19. Making the Gene Wiki more computable19Structured annotationsFree text
    • 20. Filling the gaps in gene annotation20WikilinkGO exactmatchGene WikimappingNCBI Entrez Gene: 334CandidateassertionGO:00068976319 novel GO annotations2147 novel DO annotations
    • 21. Gene Wiki content improves enrichment analysis21GO termGene listConceptrecognitionPubMedabstractsEnrichmentanalysisGO:0007411axonguidance(GO:0007411)264 genesLinked genesthroughPubMedP = 1.55 E-20811 articlesYes NoYes 13 2No 251 12033
    • 22. Gene Wiki content improves enrichment analysis22GO termGene listConceptrecognitionPubMedabstractsGene Wiki+EnrichmentanalysisGO:0006936 GO:0006936musclecontraction(GO:0006936)87 genesLinked genesthroughPubMedLinked genesthroughPubMed +Gene WikiP = 1.0 P = 1.22 E-09251 articles87 articles
    • 23. Gene Wiki content improves enrichment analysis23p-value (PubMed only)p-value(PubMed + GW)MusclecontractionMoresignificantPubMed + GWMoresignificantPubMed only
    • 24. Making the Gene Wiki more computable24Structured annotationsFree textAnalyses
    • 25. Making the Gene Wiki more computable25Structured annotationsFree textDatabases
    • 26. Making the Gene Wiki more computable26DatabasesLinked Data
    • 27. TheLong Tail of scientistsis a valuable source ofinformation on genefunction27
    • 28. From crowdsourcing to structured data28The Gene WikiBiological Games
    • 29. Gene databases are numerous and overlapping29… and hundredsmore …
    • 30. Why is there so much redundancy?30UsersRequestsResourcesTimeCommunitydevelopmentBioGPS emphasizes community extensibility
    • 31. Why do developers define the gene report view?31BioGPS emphasizes user customizability
    • 32. http://biogps.orgCommunity extensibility and user customizability32
    • 33. Utility: A simple and universal plugin interface33KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}URL templateGene entityRendered URL
    • 34. UtilityUsersContributorsUtility: A simple and universal plugin interface34
    • 35. UtilityUsersContributorsUtility: A simple and universal plugin interface35
    • 36. UtilityUsersContributorsUtility: A simple and universal plugin interface36
    • 37. UtilityUsersContributorsUtility: A simple and universal plugin interface37
    • 38. UtilityUsersContributorsUtility: A simple and universal plugin interface38
    • 39. Utility: A simple and universal plugin interface39UtilityUsersContributorsTotal of > 540 gene-centric onlinedatabases registered as BioGPS plugins
    • 40. Users: BioGPS has critical mass40• > 6400 registered users• 14,000 unique visitors per month• 155,000 page views per month1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge7. U Penn8. Stanford9. Wash U10. UNCTop 10 organizationsDaily pageviewsUtilityUsersContributors
    • 41. Contributors: Explicit and implicit knowledge41540 plugins registered(>300 publicly shared)by over 120 usersspanning 280+ domainsUtilityUsersContributors
    • 42. All resources should provide RDF…42
    • 43. Mining structured content from HTML43
    • 44. Defining a data extraction template44…TP53 TNF APOE IL6 VEGF …EGFR TGFB1
    • 45. The BioGPS Semantic Annotator45http://54.244.135.254:8080
    • 46. TheLong Tail ofbioinformaticianscan collaborativelybuild a gene portal.46
    • 47. From crowdsourcing to structured data47The Gene WikiBiological Games
    • 48. 48http://www.flickr.com/photos/archana3k1/4124330493/Seven million human hours
    • 49. 49Twenty million human hourshttp://www.flickr.com/photos/ableman/2171326385/
    • 50. -50150 billion human hourshttp://www.flickr.com/photos/rvp-cw/6243289302/per year
    • 51. Using games to fold proteins51Fold.it players have successfully:• Outperformed state of the art proteinfolding algorithms (Cooper, Nature, 2010)• Solved a previously-intractable crystalstructure (Khatib, Nat Struct Mol Biol, 2011)• Designed an improved protein foldingalgorithm (Khatib, PNAS, 2011)• Improved enzyme activity of de novodesigned enzyme (Eiben, Nat Biotechnol, 2011)
    • 52. Using games to fold RNAs52http://eterna.cmu.edu/
    • 53. Using games to align sequences53http://phylo.cs.mcgill.ca
    • 54. Using games to diagnose malaria infection54http://biogames.ee.ucla.edu/
    • 55. Using games to map neurons55http://eyewire.org
    • 56. Using games to annotate genes?56http://genegames.org
    • 57. No good gene-disease annotation database57Alzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseQuery: Apolipoprotein E
    • 58. No good gene-disease annotation database58Alzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityQuery: Apolipoprotein E
    • 59. No good gene-disease annotation database59Alzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular DiseasesQuery: Apolipoprotein E?????
    • 60. No good gene-disease annotation database60Alzheimers disease (AD)Neuropsychological TestsCognition DisordersDementiaCognitionDisease ProgressionCardiovascular DiseasesCoronary DiseaseDiabetes Mellitus, Type 2Memory DisordersQuery: Apolipoprotein EMemoryCoronary Artery DiseaseHypertensionMental Status SchedulePsychiatric Status RatingScalesHyperlipidemiasAtrophyDementia, VascularParkinson DiseaseBrain InjuriesMyocardial Infarction…477 diseases!
    • 61. Play Dizeez to annotate gene-disease links613. If it‟s „right‟, you get points4. Then on to thenext question…2. Click the related disease(only one is “right”)5. Hurry!1. Read the clue (gene)6. Play to win!
    • 62. Dizeez players seem pretty smart…62In total (since Dec 2011):• 230 unique gamers• 1045 games played• 8525 guesses# Occurrences Gene Disease11 NBPF3 neuroblastoma11 SOX8 mental retardation9 ABL1 leukemia9 SSX1 synovial sarcoma8 APC colorectal cancer8 FES sarcoma8 RBP3 retinoblastoma8 GAST gastrinoma8 DCC colorectal cancer8 MAP3K5 cancerGene Wiki OMIM PharmGKB PubMed
    • 63. Using games to predict phenotype from genotype?63http://genegames.org
    • 64. Classification problems in genome biology64cancer normalfind patternsClassify newsamplescancernormalSVMNeuralnetworksNaïveBayesKNN…100s samples100,000sfeatures
    • 65. Random forests65Sample subsetof cases andfeaturesTrain decisiontreecancer normal100s samples100,000sfeatures
    • 66. Random forests66cancer normal100s samples100,000sfeatures
    • 67. Random forests67Classify newsamplescancernormalcancer normal100s samples100,000sfeaturesHow to interjectbiologicalknowledge?
    • 68. Network-guided forests68Dutkowski & Ideker (2011). PLoS Computational Biology
    • 69. Network-guided forests69Samplefeatures by PPInetworkTrain decisiontreecancer normal100s samples100,000sfeatures
    • 70. Human-guided forests70Samplefeatures byhumanintelligenceTrain decisiontreecancer normal100s samples100,000sfeatures
    • 71. 71
    • 72. The Cure: Genomic predictors for disease72
    • 73. The Cure: Genomic predictors for disease73
    • 74. The Cure: Genomic predictors for disease74
    • 75. The Cure: Genomic predictors for disease75
    • 76. The Cure: Genomic predictors for disease76
    • 77. The Cure: Genomic predictors for disease77
    • 78. Human-guided forests78Classify newsamplescancernormal
    • 79. “Critical Assessment”-style challenge79
    • 80. Results• 214 registered players– 50% declared knowledge of cancerbiology– 40% self-identified as having Ph.D.• Prediction results– 70% correct on survival concordanceindex– Best scoring model was 76%– Player registrations still increasing!80
    • 81. TheLong Tail of gamerscan collaborativelybuild an accuratedisease classifier.81
    • 82. 82Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editorsWP:MCB ProjectCollaboratorsKatie FischBen GoodSalvatore LoguercioMax NanisChunlei WuGroup membersFunding and Support(BioGPS: GM83924, Gene Wiki: GM089820)Contacthttp://sulab.orgasu@scripps.edu@andrewsu+Andrew SuAdriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo OrozcoKey group alumni
    • 83. Doctoral Program in Chemicaland Biological SciencesCALIFORNIAOffice of Graduate Studies10550 N. Torrey Pines RoadLa Jolla, CA 92037Email:gradprgrm@scripps.eduPhone: 858.784.8469http://education.scripps.edu

    ×