NCBO Webinar: Translating unstructured, crowdsourced content into structured data

2,651 views
2,562 views

Published on

The use of crowdsourcing in biology is gaining popularity as a mechanism to tackle challenges of massive scale. However, to maximize participation and lower the barriers to entry, contributions to crowdsourcing efforts are typically not well-structured, which makes computing on these data challenging and difficult. The presentation will discuss strategies for translating this unstructured content into structured data. Three vignettes (in varying degrees of completion) will be described, one each from our Gene Wiki [1], BioGPS [2], and serious gaming [3] initiatives.

[1]: http://en.wikipedia.org/wiki/Portal:Gene_Wiki

[2]: http://biogps.org

[3]: http://genegames.org

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,651
On SlideShare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
11
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Combines open editing of a wiki, with the robust community of editors at Wikipedia, with the structured data model of a database
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • MODs and portals
  • Genetics resources
  • Literature resources
  • Protein resources
  • Pathway and expression databases
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Empire state building
  • FES = “Feline sarcoma oncogene”RBP3 = “Retinol binding protein 3, interstitial”
  • Question: how to interject biological knowledge in the feature selection process?
  • NCBO Webinar: Translating unstructured, crowdsourced content into structured data

    1. 1. Translating unstructured, crowdsourced content into structured data Andrew Su, Ph.D. The Scripps Research Institute NCBO Webinar February 20, 2013
    2. 2. 2Human genetics underlies human health Molecular understanding of: • Biological function • Genetic variation • Mutation • Deletion • Amplification • … Structured gene Gene annotations ~3 billion ~20,000 bases genes Molecular diagnostics & therapeutics
    3. 3. 3Structured gene annotations enable computation Structured gene annotations
    4. 4. 4Few genes are well annotated CTNNB1 VEGFA SIRT1 FGFR2 GO Annotation TGFB1 TP53 Counts MEF2C BMP4 65% LEF1 WNT5A TNF 41% 20,473 protein- coding genes Genes, sorted by decreasing counts Data: NCBI, February 2013
    5. 5. 5Few genes are well annotated GO Annotation Counts + Electronic annotation (IEA) Genes, sorted by decreasing counts Data: NCBI, February 2013
    6. 6. 6Few genes are well annotated GO Annotation Counts + Electronic annotation (IEA) Biological Process only Genes, sorted by decreasing counts Data: NCBI, February 2013
    7. 7. 7311,696 articles (1.5% of PubMed)have been cited by GO annotations
    8. 8. 8 Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of data generation.
    9. 9. 9 Crowdsourcing empowers the entire scientific community todirectly participate in thegene annotation process.
    10. 10. 10From crowdsourcing to structured data The Gene Wiki GeneGames.org
    11. 11. 11 10,000 gene “stubs” within Wikipedia Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Proteininteractions Tissue expression Linked patternreferences Links to structured databasesHuss, PLoS Biol, 2008
    12. 12. 12 Gene Wiki has a critical mass of readers Total: 4.0 million views / monthHuss, PLoS Biol, 2008; Good, NAR, 2011
    13. 13. 13 Gene Wiki has a critical mass of editors Editor count Editors Edit count Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articlesGood, NAR, 2011
    14. 14. 14A review article for every gene is powerful Reelin: 68 editors, 543 edits since July 2002 Heparin: 175 editors, 320 edits since June 2003 AMPK: 44 editors, 84 edits since March 2004 RNAi: 232 editors, 708 edits since October 2002 References to the literature Hyperlinks to related concepts
    15. 15. Filtering, extracting, and summarizing PubMedDocuments Concepts
    16. 16. 16Document- and concept-centric text mining Predicate Subject Object
    17. 17. 17 Simple text mining for gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel Gene Ontology annotations 2147 novel Disease Ontology annotationsGood, BMC Genomics, 2011.
    18. 18. 18 Gene Wiki content improves enrichment analysis More p-value significant (PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)Good, BMC Genomics, 2011.
    19. 19. 19 Gene Wiki+ for integrative queries mwsyncGood, J Biomed Semantics, 2012. http://genewikiplus.org
    20. 20. 20 Dynamic queries across genes, diseases, SNPsGood, J Biomed Semantics, 2012.
    21. 21. 21 Gene Wiki+ for integrative queries mwsync OMIM PharmGKB {{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer] ]</q>]] [[HasSNP:: … <q>[[is_associated_with:: http://genewikiplus.orgGood, J Biomed Semantics, 2012.
    22. 22. 22 Gene Wiki+ for integrative queries mwsync OMIM PharmGKBGood, J Biomed Semantics, 2012. http://genewikiplus.org
    23. 23. 23Wikidata Provide a database of the world‟s knowledge that anyone can edit - Denny Vrandečić
    24. 24. 24Wikidata Q414043 Reelin Protein Q8054Property:P31 is a Glycoprotein Q187126 NeuralProperty:P128 regulates Q1345738 development VLDL receptor Q1979313Property:P129 Interacts Amyloid with precursor Q423510 protein http://www.wikidata.org/wiki/Q414043
    25. 25. 25Wikidata Q414043 Q8054Property:P31 Q187126Property:P128 Q1345738 Q1979313Property:P129 Q423510 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
    26. 26. 26Wikidata http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
    27. 27. 27Wikidata http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
    28. 28. 28From crowdsourcing to structured data The Gene Wiki GeneGames.org
    29. 29. 29Not just the biomedical literature…
    30. 30. 30 BioGPS aggregates gene-centric information http://biogps.orgWu, NAR, 2013; Wu, Genome Biology, 2009.
    31. 31. 31 The plugin interface is simple and universal Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}} STRING http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}} KEGG http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}} URL template Rendered URL Gene entityWu, NAR, 2013; Wu, Genome Biology, 2009.
    32. 32. 32The plugin interface is simple and universal
    33. 33. 33The plugin interface is simple and universal
    34. 34. 34The plugin interface is simple and universal
    35. 35. 35The plugin interface is simple and universal
    36. 36. 36The plugin interface is simple and universal Total of 389 gene-centric online databases registered as BioGPS plugins
    37. 37. 37 BioGPS has a critical mass of users Daily pageviews • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNCWu, NAR, 2013; Wu, Genome Biology, 2009.
    38. 38. 38All resources should provide RDF…
    39. 39. 39Mining structured content from HTML
    40. 40. 40Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
    41. 41. 41The BioGPS Semantic Annotator http://54.244.135.254:8000/
    42. 42. 42From crowdsourcing to structured data The Gene Wiki GeneGames.org
    43. 43. 43Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
    44. 44. 44Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
    45. 45. 45- 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
    46. 46. 46Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
    47. 47. 47Using games to fold RNAs http://eterna.cmu.edu/
    48. 48. 48 Using games to align sequences http://phylo.cs.mcgill.caKawrykow, PLOS ONE, 2012.
    49. 49. 49Using games to annotate genes? http://genegames.org
    50. 50. 50No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease
    51. 51. 51No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility
    52. 52. 52No good gene-disease annotation database Query: Apolipoprotein E ? Alzheimers disease (AD) ? Lipoprotein glomerulopathy ? Sea-blue histiocyte disease Hyperlipoproteinemia, type III ? Macular degeneration, age-related ? Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases
    53. 53. 53No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Memory Coronary Artery Disease Neuropsychological Tests Hypertension Cognition Disorders Mental Status Schedule Psychiatric Status Rating Dementia Scales Cognition Hyperlipidemias Atrophy Disease Progression Dementia, Vascular Cardiovascular Diseases Parkinson Disease Brain Injuries Coronary Disease Myocardial Infarction Diabetes Mellitus, Type 2 … Memory Disorders 477 diseases!
    54. 54. 54Play Dizeez to annotate gene-disease links 6. Play to win! 5. Hurry! 4. Then on to the next question… 3. If it‟s „right‟, you get points 1. Read the clue (gene) 2. Click the related disease (only one is “right”)
    55. 55. 55Dizeez players seem pretty smart… In total (since Dec 2011): • 230 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Gene Wiki OMIM PharmGKB PubMed 11 NBPF3 neuroblastoma 11 SOX8 mental retardation 9 ABL1 leukemia 9 SSX1 synovial sarcoma 8 APC colorectal cancer 8 FES sarcoma 8 RBP3 retinoblastoma 8 GAST gastrinoma 8 DCC colorectal cancer 8 MAP3K5 cancer
    56. 56. 56Using games to predict phenotype from genotype? http://genegames.org
    57. 57. 57Classification problems in genome biology Classify new cancer normal samples find patterns cancer 100,000s features normal SVM Neural networks Naïve Bayes KNN … 100s samples
    58. 58. 58Random forests Sample subset of cases and Train decision cancer normal features tree 100,000s features 100s samples
    59. 59. 59Random forests cancer normal 100,000s features 100s samples
    60. 60. 60Random forests Classify new cancer normal samples cancer 100,000s features normal How to interject biological 100s samples knowledge?
    61. 61. 61Network-guided forests Dutkowski & Ideker (2011). PLoS Computational Biology
    62. 62. 62Network-guided forests Sample features by PPI Train decision cancer normal network tree 100,000s features 100s samples
    63. 63. 63Human-guided forests Sample features by Train decision cancer normal human tree intelligence 100,000s features 100s samples
    64. 64. 64
    65. 65. 65The Cure: Genomic predictors for disease
    66. 66. 66The Cure: Genomic predictors for disease
    67. 67. 67The Cure: Genomic predictors for disease
    68. 68. 68The Cure: Genomic predictors for disease
    69. 69. 69The Cure: Genomic predictors for disease
    70. 70. 70The Cure: Genomic predictors for disease
    71. 71. 71Human-guided forests Classify new samples cancer normal
    72. 72. 72“Critical Assessment”-style challenge
    73. 73. 73Results• 214 registered players – 50% declared knowledge of cancer biology – 40% self-identified as having Ph.D.• Prediction results – 70% correct on survival concordance index – Best scoring model was 76% – Player registrations still increasing!
    74. 74. 74 Crowdsourcing empowers the entire scientific community todirectly participate in thegene annotation process.
    75. 75. 75 Collaborators Group membersDoug Howe, ZFIN Katie Fisch Max NanisJohn Hogenesch, U PennLuca de Alfaro, UCSC Ben Good Chunlei WuAngel Pizzaro, U Penn Salvatore LoguercioFaramarz Valafar, SDSUPierre Lindenbaum, Fondation Jean Dausset Key group alumniMichael Martone, RushKonrad Koehler, Karo Bio Erik ClarkeWarren Kibbe, Simon Lim, Northwestern Jon HussMany Wikipedia editors Marc Leglise WP:MCB Project Maximilian Ludvigsson Ian MacLeod Camilo Orozco Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)

    ×