Successfully reported this slideshow.

Crowdsourcing to structure biological knowledge (USC/ISI)

2,988 views

Published on

Talk given at USC's Information Sciences Institute (http://www.isi.edu). The AV recording is pretty horrible, but for anyone interested: http://webcasterms1.isi.edu/mediasite/SilverlightPlayer/Default.aspx?peid=89751f8537c44f2fa241db99c793cd231d

Published in: Technology
  • Be the first to comment

Crowdsourcing to structure biological knowledge (USC/ISI)

  1. 1. Crowdsourcing to structure biological knowledge Andrew Su, Ph.D. Department of Molecular and Experimental Medicine The Scripps Research Institute ISI, USC August 16, 2012
  2. 2. 2Human genetics underlies human health Molecular understanding of: • Biological function • Genetic variation • Mutation “Gene • Deletion annotation” • Amplification • … ~3 billion ~23,000 bases genes Molecular diagnostics & therapeutics
  3. 3. 3Structured gene annotations enable computation Structured annotations
  4. 4. 4Few genes are well annotated TP53 TNF APOE MTHFR IL6 HLA-DRB1 Counts VEGFA EGFR TGFB1 59% ACE PubMed 38% 23,278 protein- coding genes Gene ontology (GO) Genes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010
  5. 5. 5Biocuration is a key annotation bottleneck Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009
  6. 6. 6311,696 articles (1.5% of PubMed)have been cited by GO annotations
  7. 7. 7 Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of data generation.
  8. 8. 8The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Talent judging: Olympics American Idol
  9. 9. 9 We can harness theLong Tail of scientiststo directly participate in the gene annotation process.
  10. 10. 10From crowdsourcing to structured data The Gene Wiki Biological Games
  11. 11. 11 10,000 gene “stubs” within Wikipedia Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Proteininteractions Tissue expression Linked patternreferences Links to structured databasesHuss, PLoS Biol, 2008
  12. 12. 12 Gene Wiki has a critical mass of readers Total: 4.0 million views / monthHuss, PLoS Biol, 2008; Good, NAR, 2011
  13. 13. 13 Gene Wiki has a critical mass of editors Editor count Editors Edit count Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articlesGood, NAR, 2011
  14. 14. 14A review article for every gene is powerful Reelin: 68 editors, 543 edits since July 2002 Heparin: 175 editors, 320 edits since June 2003 AMPK: 44 editors, 84 edits since March 2004 RNAi: 232 editors, 708 edits since October 2002 References to the literature Hyperlinks to related concepts
  15. 15. Filtering, extracting, and summarizing PubMedDocuments Concepts
  16. 16. 16Document- and concept-centric text mining Predicate Subject Object
  17. 17. 17Simple text mining for gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel Gene Ontology annotations 2147 novel Disease Ontology annotations
  18. 18. 18Gene Wiki+ for integrative queries mwsync http://genewikiplus.org
  19. 19. 19Dynamic queries across genes, diseases, SNPs
  20. 20. 20
  21. 21. 21TOP 100GENES
  22. 22. 22Gene Wiki+ for integrative queries mwsync OMIM PharmGKB {{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer] ]</q>]] [[HasSNP:: … <q>[[is_associated_with:: http://genewikiplus.org
  23. 23. 23Gene Wiki+ for integrative queries mwsync OMIM PharmGKB http://genewikiplus.org
  24. 24. 24From crowdsourcing to structured data The Gene Wiki Biological Games
  25. 25. 25Not just the biomedical literature…
  26. 26. 26BioGPS aggregates gene-centric information http://biogps.org
  27. 27. 27The plugin interface is simple and universalPubmed http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}STRING http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}} KEGG http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}} URL template Rendered URL Gene entity
  28. 28. 28The plugin interface is simple and universal
  29. 29. 29The plugin interface is simple and universal
  30. 30. 30The plugin interface is simple and universal
  31. 31. 31The plugin interface is simple and universal
  32. 32. 32The plugin interface is simple and universal Total of 389 gene-centric online databases registered as BioGPS plugins
  33. 33. 33BioGPS has a critical mass of users Daily pageviews • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC
  34. 34. 34All resources should provide RDF…
  35. 35. 35Mining structured content from HTML
  36. 36. 36Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
  37. 37. 37The BioGPS Semantic Annotator http://50.112.124.237
  38. 38. 38All resources should provide flat files…
  39. 39. 39From crowdsourcing to structured data The Gene Wiki Biological Games
  40. 40. 40Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
  41. 41. 41Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
  42. 42. 42- 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
  43. 43. 43Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  44. 44. 44Using games to fold RNAs http://eterna.cmu.edu/
  45. 45. 45Using games to align sequences http://phylo.cs.mcgill.ca
  46. 46. 46Using games to annotate gene-disease links hurry! then on to the next question If its ‘right’, you get points Click the related disease http://genegames.org
  47. 47. 47Dizeez players seem pretty smart… In total: • 207 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 7 GAST gastrinoma 7 RBP3 retinoblastoma 7 SSX1 synovial sarcoma 6 TG Graves disease 6 CRYGC Cataract 6 SOX8 mental retardation 6 WRN Werner syndrome 6 ABL1 leukemia 6 MLL3 leukemia 6 SNAI2 breast carcinoma
  48. 48. 48Dizeez players seem pretty smart… In total: • 207 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 5 MECOM sarcoma 4 ATF7 cancer 3 ABCB5 acute myeloid leukemia 3 SART1 glioblastoma 3 NCK1 leukemia 3 NEK1 cancer
  49. 49. 49GenESP: Two-player annotation games
  50. 50. 50COMBO: Genomic predictors for disease make predictions on cancer normal new samples find patterns cancer normal
  51. 51. 51COMBO: Genomic predictors for disease
  52. 52. 52COMBO: Genomic predictors for disease
  53. 53. 53COMBO: Genomic predictors for disease
  54. 54. 54COMBO: Genomic predictors for disease
  55. 55. 55COMBO: Genomic predictors for disease
  56. 56. 56COMBO: Genomic predictors for disease
  57. 57. 57 We can harness theLong Tail of scientiststo directly participate in the gene annotation process.
  58. 58. 58 Collaborators Group membersDoug Howe, ZFIN Erik Clarke Ian MacleodJohn Hogenesch, U PennJon Huss, GNF Ben Good Chunlei WuLuca de Alfaro, UCSC Salvatore LoguercioAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum, Fondation Jean Dausset Summer internships for students!Michael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors WP:MCB Project Contact http://sulab.org Recruiting graduate students asu@scripps.edu in quantitative biology! See @andrewsu http://education.scripps.edu/ +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)

×