Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Microtask crowdsourcing for 
annotating diseases in 
PubMed abstracts 
Andrew Su, Ph.D. 
@andrewsu 
asu@scripps.edu 
http:...
Potential conflicts of interest 
• Novartis 
• Assay Depot 
• Avera Health 
2
3 
Condition A Condition B 
Genome-scale profiling 
Candidate 
genes/ 
proteins 
RNA-seq 
Exome seq 
Whole 
genome seq 
Pr...
4 
Related 
diseases 
Candidate 
genes/ 
proteins 
Related 
drugs 
Related 
pathways
Databases are fragmented and incomplete 
5 
KEGG 
(4) 
Disease links for Apolipoprotein E 
OMIM 
(6) 
HuGE 
Navigator 
Pha...
6
7 
1,200,000 
1,000,000 
800,000 
600,000 
400,000 
200,000 
0 
Number of new PubMed-indexed articles 
1983 1988 1993 1998...
8
9 
http://www.flickr.com/photos/portland_mike/6140660504/ 
Harnessing 
the crowd…
10 … to organize 
information 
http://www.flickr.com/photos/45697441@N00/6629580443
11 
Information extraction for a Network of BioThings 
1. Find mentions of high level concepts in 
text 
2. Map mentions t...
The NCBI Disease corpus 
12 
• 793 PubMed abstracts 
• 12 expert annotators (2 annotate each 
abstract) 
6,900 “disease” m...
Question: Can a group of non-scientists 
collectively perform concept 
recognition in biomedical texts? 
13
Experimental design 
Task: Identify the disease mentions in the 
PubMed abstracts from the NCBI disease 
corpus 
– 5 non-s...
Instructions to workers 
15 
• Highlight all diseases and disease abbreviations 
• “...are associated with Huntington dise...
Aggregation function based on simple voting 
This molecule inhibits the growth of a broad 
panel of cancer cell lines, and...
Comparison to gold standard 
17 
F score = 0.81 
Precision 
Recall
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
18 
Ma...
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
19 
Ma...
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
20 
Ma...
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
21 
Ma...
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
22 
Ma...
23 
In aggregate, our worker 
ensemble is faster, cheaper 
and as accurate as a single 
expert annotator for disease 
conc...
24 
Information extraction for a Network of BioThings 
1. Find mentions of high level concepts in 
text 
2. Map mentions t...
Vision-based Citizen Science 
• Galaxy Zoo (galaxy classification; 110M+ 
classifications, 300k+ volunteers) 
• Foldit (pr...
Language-based Citizen Science 
26 
http://mark2cure.org
` 
27 
Funding and Support 
(BioGPS: GM83924, Gene Wiki: GM089820, DA036134) 
The Su Lab 
Chunlei Wu 
Ben Good 
Salvatore ...
Upcoming SlideShare
Loading in …5
×

Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

1,618 views

Published on

Presentation on "Microtask crowdsourcing for annotating diseases in PubMed abstracts" at ASHG14 session on "Cloudy with a chance of big data".

Published in: Science
  • Be the first to comment

Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

  1. 1. Microtask crowdsourcing for annotating diseases in PubMed abstracts Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org October 20, 2014 ASHG Slides: slideshare.net/andrewsu OK OK OK
  2. 2. Potential conflicts of interest • Novartis • Assay Depot • Avera Health 2
  3. 3. 3 Condition A Condition B Genome-scale profiling Candidate genes/ proteins RNA-seq Exome seq Whole genome seq Proteomics Genotyping Copy-number analysis ChIP-seq Methylation Functional genomics
  4. 4. 4 Related diseases Candidate genes/ proteins Related drugs Related pathways
  5. 5. Databases are fragmented and incomplete 5 KEGG (4) Disease links for Apolipoprotein E OMIM (6) HuGE Navigator PharmGKB (10) (517) 0 2 0 2 0 0 0 0 507 0 0 x 2 1 6
  6. 6. 6
  7. 7. 7 1,200,000 1,000,000 800,000 600,000 400,000 200,000 0 Number of new PubMed-indexed articles 1983 1988 1993 1998 2003 2008 2013
  8. 8. 8
  9. 9. 9 http://www.flickr.com/photos/portland_mike/6140660504/ Harnessing the crowd…
  10. 10. 10 … to organize information http://www.flickr.com/photos/45697441@N00/6629580443
  11. 11. 11 Information extraction for a Network of BioThings 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts Diseases Genes/ proteins Drugs Pathways
  12. 12. The NCBI Disease corpus 12 • 793 PubMed abstracts • 12 expert annotators (2 annotate each abstract) 6,900 “disease” mentions Doğan, Rezarta, and Zhiyong Lu. Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
  13. 13. Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts? 13
  14. 14. Experimental design Task: Identify the disease mentions in the PubMed abstracts from the NCBI disease corpus – 5 non-scientists annotate each abstract – The details: • Recruit workers using Amazon Mechanical Turk • Pay $0.066 per Human Intelligence Task (HIT) • HIT = annotate one abstract from PubMed 14
  15. 15. Instructions to workers 15 • Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients received...” • “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked immunodeficiency…” • Highlight the longest span of text specific to a disease • “... contains the insulin-dependent diabetes mellitus locus …” • Highlight disease conjunctions as single, long spans. • “... a significant fraction of familial breast and ovarian cancer , but undergoes…” • Highlight symptoms - physical results of having a disease – “XFE progeroid syndrome can cause dwarfism, cachexia, and microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.
  16. 16. Aggregation function based on simple voting This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. 16 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. 1 or more votes (K=1) This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. K=2 K=3 K=4
  17. 17. Comparison to gold standard 17 F score = 0.81 Precision Recall
  18. 18. Comparison to gold standard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 18 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18
  19. 19. Comparison to gold standard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 19 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18
  20. 20. Comparison to gold standard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 20 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18
  21. 21. Comparison to gold standard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 21 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18
  22. 22. Comparison to gold standard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 22 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18 F = 0.76 – score of single Ph.D. annotator F = 0.87 – agreement between multiple Ph.D. annotators
  23. 23. 23 In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease concept recognition. Crowd-based biocuration • 7 days • 17 workers • $192.90 Professional biocuration • Many months • 12 experts • $150,000+
  24. 24. 24 Information extraction for a Network of BioThings 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts Diseases Genes/ proteins Drugs Pathways
  25. 25. Vision-based Citizen Science • Galaxy Zoo (galaxy classification; 110M+ classifications, 300k+ volunteers) • Foldit (protein folding; 350k+ players) • Eterna (RNA folding; 80k players) • Eyewire (3D neuron structure determination; 130k volunteers) • Phylo (multiple sequence alignment; 30k+ players, 285k alignments) • … 25
  26. 26. Language-based Citizen Science 26 http://mark2cure.org
  27. 27. ` 27 Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820, DA036134) The Su Lab Chunlei Wu Ben Good Salvatore Loguercio Max Nanis Louis Gioia Ramya Gamini Greg Stupp Ginger Tsueng Erick Scott Vyshakh Babji Karthik Gangavarapu Adam Mark Key Alumni Katie Fisch Tobias Meissner Key Collaborators Andra Waagmeester Lynn Schriml Peter Robinson Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su We are recruiting programmers, postdocs, and awesome people of all kinds! bit.ly/SuLabJobs We are hosting a hackathon Nov 7-9 for the Network of BioThings bit.ly/hackNoB

×