NCBO Webinar: Translating unstructured, crowdsourced content into structured data
Upcoming SlideShare
Loading in...5
×
 

NCBO Webinar: Translating unstructured, crowdsourced content into structured data

on

  • 2,197 views

The use of crowdsourcing in biology is gaining popularity as a mechanism to tackle challenges of massive scale. However, to maximize participation and lower the barriers to entry, contributions to ...

The use of crowdsourcing in biology is gaining popularity as a mechanism to tackle challenges of massive scale. However, to maximize participation and lower the barriers to entry, contributions to crowdsourcing efforts are typically not well-structured, which makes computing on these data challenging and difficult. The presentation will discuss strategies for translating this unstructured content into structured data. Three vignettes (in varying degrees of completion) will be described, one each from our Gene Wiki [1], BioGPS [2], and serious gaming [3] initiatives.

[1]: http://en.wikipedia.org/wiki/Portal:Gene_Wiki

[2]: http://biogps.org

[3]: http://genegames.org

Statistics

Views

Total Views
2,197
Views on SlideShare
2,175
Embed Views
22

Actions

Likes
2
Downloads
8
Comments
0

2 Embeds 22

https://twitter.com 21
https://twimg0-a.akamaihd.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Combines open editing of a wiki, with the robust community of editors at Wikipedia, with the structured data model of a database
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • MODs and portals
  • Genetics resources
  • Literature resources
  • Protein resources
  • Pathway and expression databases
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Empire state building
  • FES = “Feline sarcoma oncogene”RBP3 = “Retinol binding protein 3, interstitial”
  • Question: how to interject biological knowledge in the feature selection process?

NCBO Webinar: Translating unstructured, crowdsourced content into structured data NCBO Webinar: Translating unstructured, crowdsourced content into structured data Presentation Transcript

  • Translating unstructured, crowdsourced content into structured data Andrew Su, Ph.D. The Scripps Research Institute NCBO Webinar February 20, 2013
  • 2Human genetics underlies human health Molecular understanding of: • Biological function • Genetic variation • Mutation • Deletion • Amplification • … Structured gene Gene annotations ~3 billion ~20,000 bases genes Molecular diagnostics & therapeutics
  • 3Structured gene annotations enable computation Structured gene annotations
  • 4Few genes are well annotated CTNNB1 VEGFA SIRT1 FGFR2 GO Annotation TGFB1 TP53 Counts MEF2C BMP4 65% LEF1 WNT5A TNF 41% 20,473 protein- coding genes Genes, sorted by decreasing counts Data: NCBI, February 2013
  • 5Few genes are well annotated GO Annotation Counts + Electronic annotation (IEA) Genes, sorted by decreasing counts Data: NCBI, February 2013
  • 6Few genes are well annotated GO Annotation Counts + Electronic annotation (IEA) Biological Process only Genes, sorted by decreasing counts Data: NCBI, February 2013
  • 7311,696 articles (1.5% of PubMed)have been cited by GO annotations
  • 8 Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of data generation.
  • 9 Crowdsourcing empowers the entire scientific community todirectly participate in thegene annotation process.
  • 10From crowdsourcing to structured data The Gene Wiki GeneGames.org
  • 11 10,000 gene “stubs” within Wikipedia Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Proteininteractions Tissue expression Linked patternreferences Links to structured databasesHuss, PLoS Biol, 2008
  • 12 Gene Wiki has a critical mass of readers Total: 4.0 million views / monthHuss, PLoS Biol, 2008; Good, NAR, 2011
  • 13 Gene Wiki has a critical mass of editors Editor count Editors Edit count Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articlesGood, NAR, 2011
  • 14A review article for every gene is powerful Reelin: 68 editors, 543 edits since July 2002 Heparin: 175 editors, 320 edits since June 2003 AMPK: 44 editors, 84 edits since March 2004 RNAi: 232 editors, 708 edits since October 2002 References to the literature Hyperlinks to related concepts
  • Filtering, extracting, and summarizing PubMedDocuments Concepts
  • 16Document- and concept-centric text mining Predicate Subject Object
  • 17 Simple text mining for gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel Gene Ontology annotations 2147 novel Disease Ontology annotationsGood, BMC Genomics, 2011.
  • 18 Gene Wiki content improves enrichment analysis More p-value significant (PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)Good, BMC Genomics, 2011.
  • 19 Gene Wiki+ for integrative queries mwsyncGood, J Biomed Semantics, 2012. http://genewikiplus.org
  • 20 Dynamic queries across genes, diseases, SNPsGood, J Biomed Semantics, 2012.
  • 21 Gene Wiki+ for integrative queries mwsync OMIM PharmGKB {{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer] ]</q>]] [[HasSNP:: … <q>[[is_associated_with:: http://genewikiplus.orgGood, J Biomed Semantics, 2012.
  • 22 Gene Wiki+ for integrative queries mwsync OMIM PharmGKBGood, J Biomed Semantics, 2012. http://genewikiplus.org
  • 23Wikidata Provide a database of the world‟s knowledge that anyone can edit - Denny Vrandečić
  • 24Wikidata Q414043 Reelin Protein Q8054Property:P31 is a Glycoprotein Q187126 NeuralProperty:P128 regulates Q1345738 development VLDL receptor Q1979313Property:P129 Interacts Amyloid with precursor Q423510 protein http://www.wikidata.org/wiki/Q414043
  • 25Wikidata Q414043 Q8054Property:P31 Q187126Property:P128 Q1345738 Q1979313Property:P129 Q423510 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
  • 26Wikidata http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
  • 27Wikidata http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
  • 28From crowdsourcing to structured data The Gene Wiki GeneGames.org
  • 29Not just the biomedical literature…
  • 30 BioGPS aggregates gene-centric information http://biogps.orgWu, NAR, 2013; Wu, Genome Biology, 2009.
  • 31 The plugin interface is simple and universal Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}} STRING http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}} KEGG http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}} URL template Rendered URL Gene entityWu, NAR, 2013; Wu, Genome Biology, 2009.
  • 32The plugin interface is simple and universal
  • 33The plugin interface is simple and universal
  • 34The plugin interface is simple and universal
  • 35The plugin interface is simple and universal
  • 36The plugin interface is simple and universal Total of 389 gene-centric online databases registered as BioGPS plugins
  • 37 BioGPS has a critical mass of users Daily pageviews • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNCWu, NAR, 2013; Wu, Genome Biology, 2009.
  • 38All resources should provide RDF…
  • 39Mining structured content from HTML
  • 40Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
  • 41The BioGPS Semantic Annotator http://54.244.135.254:8000/
  • 42From crowdsourcing to structured data The Gene Wiki GeneGames.org
  • 43Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
  • 44Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
  • 45- 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
  • 46Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  • 47Using games to fold RNAs http://eterna.cmu.edu/
  • 48 Using games to align sequences http://phylo.cs.mcgill.caKawrykow, PLOS ONE, 2012.
  • 49Using games to annotate genes? http://genegames.org
  • 50No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease
  • 51No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility
  • 52No good gene-disease annotation database Query: Apolipoprotein E ? Alzheimers disease (AD) ? Lipoprotein glomerulopathy ? Sea-blue histiocyte disease Hyperlipoproteinemia, type III ? Macular degeneration, age-related ? Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases
  • 53No good gene-disease annotation database Query: Apolipoprotein E Alzheimers disease (AD) Memory Coronary Artery Disease Neuropsychological Tests Hypertension Cognition Disorders Mental Status Schedule Psychiatric Status Rating Dementia Scales Cognition Hyperlipidemias Atrophy Disease Progression Dementia, Vascular Cardiovascular Diseases Parkinson Disease Brain Injuries Coronary Disease Myocardial Infarction Diabetes Mellitus, Type 2 … Memory Disorders 477 diseases!
  • 54Play Dizeez to annotate gene-disease links 6. Play to win! 5. Hurry! 4. Then on to the next question… 3. If it‟s „right‟, you get points 1. Read the clue (gene) 2. Click the related disease (only one is “right”)
  • 55Dizeez players seem pretty smart… In total (since Dec 2011): • 230 unique gamers • 1045 games played • 8525 guesses# Occurrences Gene Disease Gene Wiki OMIM PharmGKB PubMed 11 NBPF3 neuroblastoma 11 SOX8 mental retardation 9 ABL1 leukemia 9 SSX1 synovial sarcoma 8 APC colorectal cancer 8 FES sarcoma 8 RBP3 retinoblastoma 8 GAST gastrinoma 8 DCC colorectal cancer 8 MAP3K5 cancer
  • 56Using games to predict phenotype from genotype? http://genegames.org
  • 57Classification problems in genome biology Classify new cancer normal samples find patterns cancer 100,000s features normal SVM Neural networks Naïve Bayes KNN … 100s samples
  • 58Random forests Sample subset of cases and Train decision cancer normal features tree 100,000s features 100s samples
  • 59Random forests cancer normal 100,000s features 100s samples
  • 60Random forests Classify new cancer normal samples cancer 100,000s features normal How to interject biological 100s samples knowledge?
  • 61Network-guided forests Dutkowski & Ideker (2011). PLoS Computational Biology
  • 62Network-guided forests Sample features by PPI Train decision cancer normal network tree 100,000s features 100s samples
  • 63Human-guided forests Sample features by Train decision cancer normal human tree intelligence 100,000s features 100s samples
  • 64
  • 65The Cure: Genomic predictors for disease
  • 66The Cure: Genomic predictors for disease
  • 67The Cure: Genomic predictors for disease
  • 68The Cure: Genomic predictors for disease
  • 69The Cure: Genomic predictors for disease
  • 70The Cure: Genomic predictors for disease
  • 71Human-guided forests Classify new samples cancer normal
  • 72“Critical Assessment”-style challenge
  • 73Results• 214 registered players – 50% declared knowledge of cancer biology – 40% self-identified as having Ph.D.• Prediction results – 70% correct on survival concordance index – Best scoring model was 76% – Player registrations still increasing!
  • 74 Crowdsourcing empowers the entire scientific community todirectly participate in thegene annotation process.
  • 75 Collaborators Group membersDoug Howe, ZFIN Katie Fisch Max NanisJohn Hogenesch, U PennLuca de Alfaro, UCSC Ben Good Chunlei WuAngel Pizzaro, U Penn Salvatore LoguercioFaramarz Valafar, SDSUPierre Lindenbaum, Fondation Jean Dausset Key group alumniMichael Martone, RushKonrad Koehler, Karo Bio Erik ClarkeWarren Kibbe, Simon Lim, Northwestern Jon HussMany Wikipedia editors Marc Leglise WP:MCB Project Maximilian Ludvigsson Ian MacLeod Camilo Orozco Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)