Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

UCSD / DBMI seminar 2015-02-6

2,217 views

Published on

Presentation given at UCSD Division of Biomedical Informatics on Feb 6, 2015.

Published in: Science
  • Be the first to comment

UCSD / DBMI seminar 2015-02-6

  1. 1. Crowdsourcing and Citizen Science for Biology Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org February 6, 2015 UCSD Slides: slideshare.net/andrewsu
  2. 2. Few genes are well annotated… 2 Data: NCBI, February 2013 41% 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 20,473 protein- coding genes Genes, sorted by decreasing counts GOAnnotation Counts
  3. 3. … because the literature is sparsely curated? 3 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles
  4. 4. … because the literature is sparsely curated? 4 0 10 20 30 40 1983 1988 1993 1998 2003 2008 2013 Average capacity of human scientist
  5. 5. 5 311,696 articles (1.5% of PubMed) have been cited by GO annotations
  6. 6. 6 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.
  7. 7. The Long Tail is a prolific source of content 7 Short Head Long Tail Content produced Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol
  8. 8. Wikipedia is reasonably accurate 8
  9. 9. Wikipedia has breadth and depth 9 http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008 Articles Words (millions) Wikipedia Britannica Online
  10. 10. 10 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
  11. 11. From crowdsourcing to structured data 11 The Gene Wiki Mark2Cure
  12. 12. Filtering, extracting, and summarizing PubMed Documents Concepts Review article
  13. 13. Filtering, extracting, and summarizing PubMed Documents Concepts
  14. 14. Wiki success depends on a positive feedback 14 Gene wiki page utility Number of users Number of contributors 1001 2002
  15. 15. 10,000 gene “stubs” within Wikipedia 15 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 Utility Users Contributors
  16. 16. Gene Wiki has a critical mass of readers 16 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011 Utility Users Contributors
  17. 17. Gene Wiki has a critical mass of editors 17 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011 Utility Users Contributors Editorcount Editors Edits Editcount
  18. 18. A review article for every gene is powerful 18 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002
  19. 19. Making the Gene Wiki more computable 19 Structured annotationsFree text Analyses Text-mining
  20. 20. Making the Gene Wiki more computable 20 Structured annotationsFree text Analyses Text-mining http://fiehnlab.ucdavis.edu/projects/rice_metabolome/
  21. 21. Making the Gene Wiki more computable 21 Structured annotationsFree text Analyses Text-mining
  22. 22. Making the Gene Wiki more computable 22 Structured annotationsFree text Databases
  23. 23. Making the Gene Wiki more computable 23 Structured annotationsFree text
  24. 24. Making the Gene Wiki more computable 24 Structured annotationsFree text
  25. 25. Wikidata 25 Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić
  26. 26. Centralizing key data storage 26 Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf
  27. 27. Centralizing key data storage 27
  28. 28. Centralizing key data storage 28
  29. 29. Centralizing key data storage 29 287 language editions of Wikipedia Bioinformatics community
  30. 30. Loading biological data into Wikidata 30 Entrez Gene Ensembl UniProt UCSC PDB RefSeq
  31. 31. Wikidata for biology 31 is a regulates Interacts with Protein Glycoprotein Neural development VLDL receptor Amyloid precursor protein Property:P31 Property:P128 Property:P129 Q8054 Q187126 Q1345738 Q1979313 Q423510 Q414043 Reelin http://www.wikidata.org/wiki/Q414043
  32. 32. Wikidata for biology 32 Property:P31 Property:P128 Property:P129 Q8054 Q187126 Q1345738 Q1979313 Q423510 Q414043 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
  33. 33. Current progress • All human and mouse genes and proteins loaded • All diseases (Human Disease Ontology) loaded • Dataset of all drugs in preparation • Datasets for gene-disease, drug- disease, and drug-protein relationships in preparation 33
  34. 34. The Long Tail of scientists is a valuable source of information on gene function 34
  35. 35. From crowdsourcing to structured data 35 The Gene Wiki Mark2Cure
  36. 36. The biomedical literature is growing fast… 36 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles
  37. 37. … but it is very hard to query and compute 37
  38. 38. … but it is very hard to query and compute 38 Imatinib Crizotinib Erlotinib Gefitinib Sorafenib Lapatinib Dasatinib … Acute myeloid leukemia Acute lymphoblastic leukemia Chronic myelogenous leukemia Chronic lymphocytic leukemia Hodgkin lymphoma Non-Hodgkin lymphoma Myeloma … AND
  39. 39. Information Extraction 39 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts
  40. 40. Disease mentions in PubMed abstracts 40 NCBI Disease corpus • 793 PubMed abstracts • (100 development, 593 training, 100 test) • 12 expert annotators (2 annotate each abstract) 6,900 “disease” mentions Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
  41. 41. Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts? 41
  42. 42. The Mechanical Turk 42 http://en.wikipedia.org/wiki/The_Turk
  43. 43. The Mechanical Turk 43 http://en.wikipedia.org/wiki/The_Turk
  44. 44. Amazon Mechanical Turk (AMT) 44 Requester Amazon For each task, specify: • a qualification test • how many workers per task • how much we will pay per task Manages: • parallel execution of jobs • worker access to tasks via qualification tests • payments • task advertising Workers 1. Create tasks 2. Execute 3. Aggregate
  45. 45. Instructions to workers 45 • Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients received...” • “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked immunodeficiency…” • Highlight the longest span of text specific to a disease • “... contains the insulin-dependent diabetes mellitus locus …” • Highlight disease conjunctions as single, long spans. • “... a significant fraction of familial breast and ovarian cancer , but undergoes…” • Highlight symptoms - physical results of having a disease – “XFE progeroid syndrome can cause dwarfism, cachexia, and microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.
  46. 46. Qualification test 46 Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ” Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.” Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…” 26 yes / no questions
  47. 47. Qualification test results 47 Threshold for passing 33/194 passed 17% Workers qualified workers
  48. 48. Simple annotation interface 48 Click to see instructions Highlight disease mentions
  49. 49. Experimental design • Task: Identify the disease mentions in the 593 abstracts from the NCBI disease corpus – $0.06 per Human Intelligence Task (HIT) – HIT = annotate one abstract from PubMed – 5 workers annotate each abstract 49
  50. 50. This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. Aggregation function based on simple voting 50 5 1 or more votes (K=1) This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. K=2 K=3 K=4 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
  51. 51. Comparison to gold standard 51 F = 0.81, k = 2 • 593 documents • 5 users / doc • 7 days • $192.90Precision Recall
  52. 52. Comparison to gold standard 52 F = 0.87, k = 6 • 593 documents • 15 users / doc • 9 days • $630.96 Precision Recall
  53. 53. Comparison to gold standard 53 0 161412108642 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Workers per document MaximumF-score
  54. 54. Comparisons to text-mining algorithms 54 Fscore Text-mining BANNER NCBOAnnotator Mechanical Turk
  55. 55. Comparisons to human annotators 55 Average level of agreement between expert annotators (stage 1) F = 0.76
  56. 56. Comparisons to human annotators 56 F = 0.76 F = 0.87 Average level of agreement between expert annotators (stage 2)
  57. 57. 57 In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease concept recognition.
  58. 58. Information Extraction 58 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts
  59. 59. Annotating the relationships 59 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. therapeutic target subject predicate object GENE DISEASE
  60. 60. Does Mechanical Turk scale? 60 1,000,000 articles per year 10 annotators / article 4 tasks / doc $0.06 / task $ 2,400,000 / year
  61. 61. 61 http://mark2cure.org
  62. 62. Key stats • Launched Jan 19, 2015 • In 2.5 weeks – 1984 document annotations – 80 unique users – 22% complete 62 Documentannotations
  63. 63. The Long Tail of citizen scientists can collaboratively annotate biomedical text. 63
  64. 64. 64 Ben Good Andra Waagmeester Lynn Schriml, U Maryland Elvira Mitraka, U Maryland Gang Fu, NCBI Evan Bolton, NCBI Paul Pavlidis, U British Columbia Peter Robinson, Charite Many Wikipedia and Wikidata editors WP:MCB Project Gene Wiki / Wikidata Ramya Gamini Louis Gioia Salvatore Loguercio Adam Mark Erick Scott Greg Stupp Kevin Xin Other Group members Funding and Support BioGPS: GM83924 Gene Wiki: GM089820 BD2K COE: GM114833 Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Mark2Cure Ben Good Max Nanis Ginger Tsueng Chunlei Wu Next slide!
  65. 65. Why do I Mark2Cure? 65 I am retired, have a doctorate in medical humanities, and have two children with Gaucher disease. I am just looking for some way to put my education to use. Sounds like a perfect situation for me. My 4 year old daughter Phoebe is living with and battling rare disease. I have Ehlers Danlos Syndrome. I hope to help people learn about this painful and debilitating disorder, so that others like me can receive more effective medical care. Take part in something that helps humanity. I Mark2Cure in memory of my son Mike who had type 1 diabetes. Studied biology in college and I really miss it! In memory of my daughter who had Cystic Fibrosis Give back
  66. 66. Worker demographics: gender 66 First HIT was a survey
  67. 67. Age 67
  68. 68. Occupation 68
  69. 69. Education 69
  70. 70. Why? 70

×