Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

2,196 views
2,044 views

Published on

Screencast video now at: https://www.youtube.com/watch?v=oe7pjHJU-z4
Talk info at http://1.usa.gov/1kPcRxC

Published in: Science, Technology
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,196
On SlideShare
0
From Embeds
0
Number of Embeds
165
Actions
Shares
0
Downloads
34
Comments
0
Likes
15
Embeds 0
No embeds

No notes for slide
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discoveryNo IEA
  • If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Tried on 773 GO categories, significant in 356 cases (46%)
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • Developer resources do not scale with usagePractical effects:Core developers’ time is always the rate-limiting step Addition of new features and data always feels slowEventually, new databases are created to fill the gap80% duplication for 20% innovation
  • MODs and portals
  • Genetics resources
  • Literature resources
  • Protein resources
  • Pathway and expression databases
  • Pathway and expression databases
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • … but the amount of knowledge that is amenable to query and computation is tiny. We would like to have more efficient methods for information extraction.
  • Harmonic mean of the precision and recall593 training corpus
  • On 100 development data set
  • On 100 development data set
  • On 100 development data set
  • On 100 development data set
  • Phase 1: pairs of annotators work independently on computationally pre-annotated documents. Phase 2: annotators get to see each other’s annotations and then make changes Phase 3: all remaining inconsistencies resolved collaboratively
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

    1. 1. Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org May 14, 2014 CBIIT Slides: slideshare.net/andrewsu Citizen Science!
    2. 2. Few genes are well annotated… 2 Data: NCBI, February 2013 41% 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 20,473 protein- coding genes Genes, sorted by decreasing counts GOAnnotation Counts
    3. 3. … because the literature is sparsely curated? 3 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles
    4. 4. … because the literature is sparsely curated? 4 0 10 20 30 40 1983 1988 1993 1998 2003 2008 2013 Average capacity of human scientist
    5. 5. 5 311,696 articles (1.5% of PubMed) have been cited by GO annotations
    6. 6. 6 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.
    7. 7. The Long Tail is a prolific source of content 7 Short Head Long Tail Content produced Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol
    8. 8. Wikipedia is reasonably accurate 8
    9. 9. Wikipedia has breadth and depth 9 http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008 Articles Words (millions) Wikipedia Britannica Online
    10. 10. 10 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
    11. 11. From crowdsourcing to structured data 11 The Gene Wiki Citizen Science
    12. 12. Filtering, extracting, and summarizing PubMed Documents Concepts Review article
    13. 13. Filtering, extracting, and summarizing PubMed Documents Concepts
    14. 14. Wiki success depends on a positive feedback 14 Gene wiki page utility Number of users Number of contributors 1001 2002
    15. 15. 10,000 gene “stubs” within Wikipedia 15 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 Utility Users Contributors
    16. 16. Gene Wiki has a critical mass of readers 16 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011 Utility Users Contributors
    17. 17. Gene Wiki has a critical mass of editors 17 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011 Utility Users Contributors Editorcount Editors Edits Editcount
    18. 18. A review article for every gene is powerful 18 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002
    19. 19. Making the Gene Wiki more computable 19 Structured annotationsFree text
    20. 20. Filling the gaps in gene annotation 20 Wikilink GO exact match Gene Wiki mapping NCBI Entrez Gene: 334 Candidate assertion GO:0006897 6319 novel GO annotations 2147 novel DO annotations
    21. 21. Gene Wiki content improves enrichment analysis 23 p-value (PubMed only) p-value (PubMed + GW) Muscle contraction More significant PubMed + GW More significant PubMed only Good BM et al., BMC Genomics, 2011
    22. 22. Making the Gene Wiki more computable 24 Structured annotationsFree text Analyses
    23. 23. Expansion through outreach and incentives 26 SP-A1 SP-A2 KIF11 LIG3 MIR155 EPHX2
    24. 24. Cardiovascular Gene Wiki Portal 27 • CAMK2D -- CaM kinase II subunit delta • CSRP3 -- Cysteine and glycine-rich protein 3 • GJA1 -- Gap junction alpha-1 protein / Connexin-43 • MAPK14 -- Mitogen-activated protein kinase 14 / p38-α • MYL7 -- Myosin regulatory light chain 2, atrial isoform • MYL2 -- Myosin regulatory light chain 2, ventricular/cardiac isoform • PECAM1 -- Platelet endothelial cell adhesion molecule/CD31 • RYR2 -- Ryanodine receptor 2 • ATP2A2 -- Sarcoplasmic/endoplasmic reticulum calcium ATPase 2 / SERCA2 • TNNI3 -- Troponin I, cardiac muscle • TNNT2 -- Troponin T, cardiac muscle Peipei Ping UCLA
    25. 25. The Long Tail of scientists is a valuable source of information on gene function 28
    26. 26. From crowdsourcing to structured data 29 The Gene Wiki Citizen Science
    27. 27. Gene databases are numerous and overlapping 30 … and hundreds more …
    28. 28. Why is there so much redundancy? 31 Users Requests Resources Time Community development BioGPS emphasizes community extensibility
    29. 29. Why do developers define the gene report view? 32 BioGPS emphasizes user customizability
    30. 30. http://biogps.org Community extensibility and user customizability 33
    31. 31. Utility UsersContributors Utility: A simple and universal plugin interface 34
    32. 32. Utility UsersContributors Utility: A simple and universal plugin interface 35
    33. 33. Utility UsersContributors Utility: A simple and universal plugin interface 36
    34. 34. Utility UsersContributors Utility: A simple and universal plugin interface 37
    35. 35. Utility UsersContributors Utility: A simple and universal plugin interface 38
    36. 36. Utility: A simple and universal plugin interface 39 Utility UsersContributors Total of > 540 gene-centric online databases registered as BioGPS plugins
    37. 37. Users: BioGPS has critical mass 40 • > 6400 registered users • 14,000 unique visitors per month • 155,000 page views per month 1. Harvard 2. NIH 3. UCSD 4. Scripps 5. MIT 6. Cambridge 7. U Penn 8. Stanford 9. Wash U 10. UNC Top 10 organizations Daily pageviewsUtility UsersContributors
    38. 38. Contributors: Explicit and implicit knowledge 41 540 plugins registered (>300 publicly shared) by over 120 users spanning 280+ domains Utility UsersContributors
    39. 39. Gene Annotation Query as a Service 42 http://mygene.info • High performance • 3M hits/month • Highly scalable • 13k species • 16M genes • Weekly data updates • JSON output • REST interface • Python/R/JS libraries
    40. 40. The Long Tail of bioinformaticians can collaboratively build a gene portal. 43
    41. 41. From crowdsourcing to structured data 44 The Gene Wiki Citizen Science
    42. 42. The biomedical literature is growing fast 45 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles
    43. 43. Information Extraction 46 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts
    44. 44. Disease mentions in PubMed abstracts 47 NCBI Disease corpus • 793 PubMed abstracts • (100 development, 593 training, 100 test) • 12 expert annotators (2 annotate each abstract) 6,900 “disease” mentions Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
    45. 45. Four types of disease mentions 48 Specific Disease: • “Diastrophic dysplasia” Disease Class: • “Cancers” Composite Mention: • “prostatic , skin , and lung cancer” Modifier: • ..the “familial breast cancer” gene , BRCA2.. Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
    46. 46. Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts? 49
    47. 47. The Turk 50 http://en.wikipedia.org/wiki/The_Turk
    48. 48. The Turk 51 http://en.wikipedia.org/wiki/The_Turk
    49. 49. Amazon Mechanical Turk (AMT) 52 Requester Amazon For each task, specify: • a qualification test • how many workers per task • how much we will pay per task Manages: • parallel execution of jobs • worker access to tasks via qualification tests • payments • task advertising Workers 1. Create tasks 2. Execute 3. Aggregate
    50. 50. Instructions to workers 53 • Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients received...” • “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked immunodeficiency…” • Highlight the longest span of text specific to a disease • “... contains the insulin-dependent diabetes mellitus locus …” • Highlight disease conjunctions as single, long spans. • “... a significant fraction of familial breast and ovarian cancer , but undergoes…” • Highlight symptoms - physical results of having a disease – “XFE progeroid syndrome can cause dwarfism, cachexia, and microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.
    51. 51. Qualification test 54 Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ” Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.” Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…” 26 yes / no questions
    52. 52. Qualification test results 55 Threshold for passing 33/194 passed 17% Workers qualified workers
    53. 53. Simple annotation interface 56 Click to see instructions Highlight disease mentions
    54. 54. Experimental design • Task: Identify the disease mentions in the 593 abstracts from the NCBI disease corpus – $0.06 per Human Intelligence Task (HIT) – HIT = annotate one abstract from PubMed – 5 workers annotate each abstract 57
    55. 55. This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. Aggregation function based on simple voting 58 5 1 or more votes (K=1) This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. K=2 K=3 K=4 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
    56. 56. Comparison to gold standard 59 F = 0.81, k = 2, N = 5 • 593 documents • 7 days • 17 workers • $192.90
    57. 57. Comparisons to text-mining algorithms 64
    58. 58. Comparisons to human annotators 65 Average level of agreement between expert annotators (stage 1) F = 0.76
    59. 59. Comparisons to human annotators 66 F = 0.76 F = 0.87 Average level of agreement between expert annotators (stage 2)
    60. 60. 67 In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease concept recognition.
    61. 61. Information Extraction 68 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts
    62. 62. Annotating the relationships 69 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. therapeutic target subject predicate object GENE DISEASE
    63. 63. Citizen Science at Mark2Cure.org 70
    64. 64. The Long Tail of citizen scientists can collaboratively annotate biomedical text. 71
    65. 65. 72 Doug Howe, ZFIN John Hogenesch, U Penn Jon Huss, GNF Luca de Alfaro, UCSC Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim, Northwestern Lynn Schriml, U Maryland Paul Pavlidis, U British Columbia Peipei Ping, UCLA Many Wikipedia editors WP:MCB Project Collaborators Katie Fisch Karthik Gangavarapu Louis Gioia Ben Good Salvatore Loguercio Adam Mark Max Nanis Ginger Tseung Chunlei Wu Group members Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Adriel Carolino Erik Clarke Jon Huss Marc Leglise Maximilian Ludvigsson Ian MacLeod Camilo Orozco Key group alumni Citizen Science logo based on http://thenounproject.com/term/team work/39543/ Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820, DA036134)
    66. 66. Related AMT work 73 • [1] Zhai et al 2013, used similar protocol to tag medication names in clinical trials descriptions. F = 0.88 compared to gold standard • [2] Burger et al, using microtask workers to identify relationships between genes and mutations. • [3] Aroyo & Welty, used workers to identify relations between concepts in medical text. [1] Zhai H. et al (2013) ”Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing” J Med Internet Res [2] Burger, John, et al. (2014) "Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing.” Mitre technical report [3] Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM Research, 2013.

    ×