0
Crowdsourcing to structure biological           knowledge                    Andrew Su, Ph.D.      Department of Molecular...
2Human genetics underlies human health                                     Molecular understanding of:                    ...
3Structured gene annotations enable computation         Structured annotations
4Few genes are well annotated               TP53               TNF               APOE               MTHFR               IL...
5Biocuration is a key annotation bottleneck                   Number of PubMed-indexed articles    1,000,000     800,000  ...
6311,696 articles (1.5% of PubMed)have been cited by GO annotations
7    Sooner or later, the research community willneed to be involved in the             0annotation effort to scale   up t...
8The Long Tail is a prolific source of content                       Short                       Head             Content ...
9  We can harness theLong Tail of scientiststo directly participate in  the gene annotation        process.
10From crowdsourcing to structured data                   The Gene Wiki                Biological Games
11 10,000 gene “stubs” within Wikipedia          Utility                                                         Users    ...
12 Gene Wiki has a critical mass of readers                                         Total: 4.0 million views / monthHuss, ...
13 Gene Wiki has a critical mass of editors                         Editor count   Editors                                ...
14A review article for every gene is powerful      Reelin: 68 editors, 543 edits since July 2002      Heparin: 175 editors...
Filtering, extracting, and summarizing PubMedDocuments Concepts
16Document- and concept-centric text mining                          Predicate                Subject               Object
17Simple text mining for gene annotations                                          NCBI Entrez Gene: 334                  ...
18Gene Wiki+ for integrative queries                      mwsync                http://genewikiplus.org
19Dynamic queries across genes, diseases, SNPs
20
21TOP 100GENES
22Gene Wiki+ for integrative queries                     mwsync                                OMIM                       ...
23Gene Wiki+ for integrative queries                      mwsync                                   OMIM                   ...
24From crowdsourcing to structured data                   The Gene Wiki                Biological Games
25Not just the biomedical literature…
26BioGPS aggregates gene-centric information                  http://biogps.org
27The plugin interface is simple and universalPubmed   http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}STRING ...
28The plugin interface is simple and universal
29The plugin interface is simple and universal
30The plugin interface is simple and universal
31The plugin interface is simple and universal
32The plugin interface is simple and universal              Total of 389 gene-centric online          databases registered...
33BioGPS has a critical mass of users           Daily pageviews  • > 4100 registered users              Top 10 organizatio...
34All resources should provide RDF…
35Mining structured content from HTML
36Defining a data extraction template        TP53   TNF   APOE   IL6   VEGF EGFR TGFB1   …  …
37The BioGPS Semantic Annotator              http://50.112.124.237
38All resources should provide flat files…
39From crowdsourcing to structured data                   The Gene Wiki                Biological Games
40Seven million human hours                            http://www.flickr.com/photos/archana3k1/4124330493/
41Twenty million human hours                             http://www.flickr.com/photos/ableman/2171326385/
42-    150 billion human hours              per year                              http://www.flickr.com/photos/rvp-cw/6243...
43Using games to fold proteins      Fold.it players have successfully:      • Outperformed state of the art protein       ...
44Using games to fold RNAs              http://eterna.cmu.edu/
45Using games to align sequences              http://phylo.cs.mcgill.ca
46Using games to annotate gene-disease links                    hurry!                                        then on to t...
47Dizeez players seem pretty smart…  In total:  • 207 unique gamers  • 1045 games played  • 8525 guesses# Occurrences   Ge...
48Dizeez players seem pretty smart…  In total:  • 207 unique gamers  • 1045 games played  • 8525 guesses# Occurrences    G...
49GenESP: Two-player annotation games
50COMBO: Genomic predictors for disease                          make predictions on  cancer   normal           new sample...
51COMBO: Genomic predictors for disease
52COMBO: Genomic predictors for disease
53COMBO: Genomic predictors for disease
54COMBO: Genomic predictors for disease
55COMBO: Genomic predictors for disease
56COMBO: Genomic predictors for disease
57  We can harness theLong Tail of scientiststo directly participate in  the gene annotation        process.
58       Collaborators                                                  Group membersDoug Howe, ZFIN                      ...
Upcoming SlideShare
Loading in...5
×

Crowdsourcing to structure biological knowledge (USC/ISI)

2,307

Published on

Talk given at USC's Information Sciences Institute (http://www.isi.edu). The AV recording is pretty horrible, but for anyone interested: http://webcasterms1.isi.edu/mediasite/SilverlightPlayer/Default.aspx?peid=89751f8537c44f2fa241db99c793cd231d

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,307
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
7
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • MODs and portals
  • Genetics resources
  • Literature resources
  • Protein resources
  • Pathway and expression databases
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Empire state building
  • Transcript of "Crowdsourcing to structure biological knowledge (USC/ISI)"

    1. 1. Crowdsourcing to structure biological knowledge Andrew Su, Ph.D. Department of Molecular and Experimental Medicine The Scripps Research Institute ISI, USC August 16, 2012
    2. 2. 2Human genetics underlies human health Molecular understanding of: • Biological function • Genetic variation • Mutation “Gene • Deletion annotation” • Amplification • … ~3 billion ~23,000 bases genes Molecular diagnostics & therapeutics
    3. 3. 3Structured gene annotations enable computation Structured annotations
    4. 4. 4Few genes are well annotated TP53 TNF APOE MTHFR IL6 HLA-DRB1 Counts VEGFA EGFR TGFB1 59% ACE PubMed 38% 23,278 protein- coding genes Gene ontology (GO) Genes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010
    5. 5. 5Biocuration is a key annotation bottleneck Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009
    6. 6. 6311,696 articles (1.5% of PubMed)have been cited by GO annotations
    7. 7. 7 Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of data generation.
    8. 8. 8The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Talent judging: Olympics American Idol
    9. 9. 9 We can harness theLong Tail of scientiststo directly participate in the gene annotation process.
    10. 10. 10From crowdsourcing to structured data The Gene Wiki Biological Games
    11. 11. 11 10,000 gene “stubs” within Wikipedia Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Proteininteractions Tissue expression Linked patternreferences Links to structured databasesHuss, PLoS Biol, 2008
    12. 12. 12 Gene Wiki has a critical mass of readers Total: 4.0 million views / monthHuss, PLoS Biol, 2008; Good, NAR, 2011
    13. 13. 13 Gene Wiki has a critical mass of editors Editor count Editors Edit count Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articlesGood, NAR, 2011
    14. 14. 14A review article for every gene is powerful Reelin: 68 editors, 543 edits since July 2002 Heparin: 175 editors, 320 edits since June 2003 AMPK: 44 editors, 84 edits since March 2004 RNAi: 232 editors, 708 edits since October 2002 References to the literature Hyperlinks to related concepts
    15. 15. Filtering, extracting, and summarizing PubMedDocuments Concepts
    16. 16. 16Document- and concept-centric text mining Predicate Subject Object
    17. 17. 17Simple text mining for gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel Gene Ontology annotations 2147 novel Disease Ontology annotations
    18. 18. 18Gene Wiki+ for integrative queries mwsync http://genewikiplus.org
    19. 19. 19Dynamic queries across genes, diseases, SNPs
    20. 20. 20
    21. 21. 21TOP 100GENES
    22. 22. 22Gene Wiki+ for integrative queries mwsync OMIM PharmGKB {{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer] ]</q>]] [[HasSNP:: … <q>[[is_associated_with:: http://genewikiplus.org
    23. 23. 23Gene Wiki+ for integrative queries mwsync OMIM PharmGKB http://genewikiplus.org
    24. 24. 24From crowdsourcing to structured data The Gene Wiki Biological Games
    25. 25. 25Not just the biomedical literature…
    26. 26. 26BioGPS aggregates gene-centric information http://biogps.org
    27. 27. 27The plugin interface is simple and universalPubmed http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}STRING http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}} KEGG http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}} URL template Rendered URL Gene entity
    28. 28. 28The plugin interface is simple and universal
    29. 29. 29The plugin interface is simple and universal
    30. 30. 30The plugin interface is simple and universal
    31. 31. 31The plugin interface is simple and universal
    32. 32. 32The plugin interface is simple and universal Total of 389 gene-centric online databases registered as BioGPS plugins
    33. 33. 33BioGPS has a critical mass of users Daily pageviews • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC
    34. 34. 34All resources should provide RDF…
    35. 35. 35Mining structured content from HTML
    36. 36. 36Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
    37. 37. 37The BioGPS Semantic Annotator http://50.112.124.237
    38. 38. 38All resources should provide flat files…
    39. 39. 39From crowdsourcing to structured data The Gene Wiki Biological Games
    40. 40. 40Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
    41. 41. 41Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
    42. 42. 42- 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
    43. 43. 43Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
    44. 44. 44Using games to fold RNAs http://eterna.cmu.edu/
    45. 45. 45Using games to align sequences http://phylo.cs.mcgill.ca
    46. 46. 46Using games to annotate gene-disease links hurry! then on to the next question If its ‘right’, you get points Click the related disease http://genegames.org
    47. 47. 47Dizeez players seem pretty smart… In total: • 207 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 7 GAST gastrinoma 7 RBP3 retinoblastoma 7 SSX1 synovial sarcoma 6 TG Graves disease 6 CRYGC Cataract 6 SOX8 mental retardation 6 WRN Werner syndrome 6 ABL1 leukemia 6 MLL3 leukemia 6 SNAI2 breast carcinoma
    48. 48. 48Dizeez players seem pretty smart… In total: • 207 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 5 MECOM sarcoma 4 ATF7 cancer 3 ABCB5 acute myeloid leukemia 3 SART1 glioblastoma 3 NCK1 leukemia 3 NEK1 cancer
    49. 49. 49GenESP: Two-player annotation games
    50. 50. 50COMBO: Genomic predictors for disease make predictions on cancer normal new samples find patterns cancer normal
    51. 51. 51COMBO: Genomic predictors for disease
    52. 52. 52COMBO: Genomic predictors for disease
    53. 53. 53COMBO: Genomic predictors for disease
    54. 54. 54COMBO: Genomic predictors for disease
    55. 55. 55COMBO: Genomic predictors for disease
    56. 56. 56COMBO: Genomic predictors for disease
    57. 57. 57 We can harness theLong Tail of scientiststo directly participate in the gene annotation process.
    58. 58. 58 Collaborators Group membersDoug Howe, ZFIN Erik Clarke Ian MacleodJohn Hogenesch, U PennJon Huss, GNF Ben Good Chunlei WuLuca de Alfaro, UCSC Salvatore LoguercioAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum, Fondation Jean Dausset Summer internships for students!Michael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors WP:MCB Project Contact http://sulab.org Recruiting graduate students asu@scripps.edu in quantitative biology! See @andrewsu http://education.scripps.edu/ +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×