Microtask crowdsourcing for 
annotating diseases in 
PubMed abstracts 
Andrew Su, Ph.D. 
@andrewsu 
asu@scripps.edu 
http://sulab.org 
October 20, 2014 
ASHG 
Slides: slideshare.net/andrewsu 
OK 
OK 
OK
Potential conflicts of interest 
• Novartis 
• Assay Depot 
• Avera Health 
2
3 
Condition A Condition B 
Genome-scale profiling 
Candidate 
genes/ 
proteins 
RNA-seq 
Exome seq 
Whole 
genome seq 
Proteomics 
Genotyping 
Copy-number 
analysis 
ChIP-seq 
Methylation 
Functional 
genomics
4 
Related 
diseases 
Candidate 
genes/ 
proteins 
Related 
drugs 
Related 
pathways
Databases are fragmented and incomplete 
5 
KEGG 
(4) 
Disease links for Apolipoprotein E 
OMIM 
(6) 
HuGE 
Navigator 
PharmGKB 
(10) 
(517) 
0 
2 
0 
2 
0 
0 
0 
0 
507 
0 
0 
x 
2 
1 
6
6
7 
1,200,000 
1,000,000 
800,000 
600,000 
400,000 
200,000 
0 
Number of new PubMed-indexed articles 
1983 1988 1993 1998 2003 2008 2013
8
9 
http://www.flickr.com/photos/portland_mike/6140660504/ 
Harnessing 
the crowd…
10 … to organize 
information 
http://www.flickr.com/photos/45697441@N00/6629580443
11 
Information extraction for a Network of BioThings 
1. Find mentions of high level concepts in 
text 
2. Map mentions to specific terms in 
ontologies 
3. Identify relationships between concepts 
Diseases 
Genes/ 
proteins 
Drugs 
Pathways
The NCBI Disease corpus 
12 
• 793 PubMed abstracts 
• 12 expert annotators (2 annotate each 
abstract) 
6,900 “disease” mentions 
Doğan, Rezarta, and Zhiyong Lu. Proceedings of the 2012 Workshop on Biomedical 
Natural Language Processing. Association for Computational Linguistics.
Question: Can a group of non-scientists 
collectively perform concept 
recognition in biomedical texts? 
13
Experimental design 
Task: Identify the disease mentions in the 
PubMed abstracts from the NCBI disease 
corpus 
– 5 non-scientists annotate each abstract 
– The details: 
• Recruit workers using Amazon Mechanical Turk 
• Pay $0.066 per Human Intelligence Task (HIT) 
• HIT = annotate one abstract from PubMed 
14
Instructions to workers 
15 
• Highlight all diseases and disease abbreviations 
• “...are associated with Huntington disease ( HD )... HD patients 
received...” 
• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked 
immunodeficiency…” 
• Highlight the longest span of text specific to a disease 
• “... contains the insulin-dependent diabetes mellitus locus …” 
• Highlight disease conjunctions as single, long spans. 
• “... a significant fraction of familial breast and ovarian cancer , but 
undergoes…” 
• Highlight symptoms - physical results of having a 
disease 
– “XFE progeroid syndrome can cause dwarfism, cachexia, and 
microcephaly. Patients often display learning disabilities, hearing loss, 
and visual impairment.
Aggregation function based on simple voting 
This molecule inhibits the growth of a broad 
panel of cancer cell lines, and is particularly 
efficacious in leukemia cells, including 
orthotopic leukemia preclinical models as 
well as in ex vivo acute myeloid leukemia 
(AML) and chronic lymphocytic leukemia 
(CLL) patient tumor samples. Thus, inhibition 
of CDK9 may represent an interesting 
approach as a cancer therapeutic target 
especially in hematologic malignancies. 
16 
This molecule inhibits the growth of a broad 
panel of cancer cell lines, and is particularly 
efficacious in leukemia cells, including 
orthotopic leukemia preclinical models as 
well as in ex vivo acute myeloid leukemia 
(AML) and chronic lymphocytic leukemia 
(CLL) patient tumor samples. Thus, inhibition 
of CDK9 may represent an interesting 
approach as a cancer therapeutic target 
especially in hematologic malignancies. 
This molecule inhibits the growth of a broad 
panel of cancer cell lines, and is particularly 
efficacious in leukemia cells, including 
orthotopic leukemia preclinical models as 
well as in ex vivo acute myeloid leukemia 
(AML) and chronic lymphocytic leukemia 
(CLL) patient tumor samples. Thus, inhibition 
of CDK9 may represent an interesting 
approach as a cancer therapeutic target 
especially in hematologic malignancies. 
1 or more votes (K=1) 
This molecule inhibits the growth of a broad 
panel of cancer cell lines, and is particularly 
efficacious in leukemia cells, including 
orthotopic leukemia preclinical models as 
well as in ex vivo acute myeloid leukemia 
(AML) and chronic lymphocytic leukemia 
(CLL) patient tumor samples. Thus, inhibition 
of CDK9 may represent an interesting 
approach as a cancer therapeutic target 
especially in hematologic malignancies. 
K=2 
K=3 K=4
Comparison to gold standard 
17 
F score = 0.81 
Precision 
Recall
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
18 
Max F = 0.69 0.79 0.82 
k=1 
3 
2 
0.85 
k=1 
N = 3 6 9 12 15 18
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
19 
Max F = 0.69 0.79 0.82 
k=1 
3 
2 
0.85 
k=1 
N = 3 6 9 12 15 18
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
20 
Max F = 0.69 0.79 0.82 
k=1 
3 
2 
0.85 
k=1 
N = 3 6 9 12 15 18
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
21 
Max F = 0.69 0.79 0.82 
k=1 
3 
2 
0.85 
k=1 
N = 3 6 9 12 15 18
Comparison to gold standard 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 
3 4 5 
0.85 0.85 
7 8 
0 3 9 12 18 
22 
Max F = 0.69 0.79 0.82 
k=1 
3 
2 
0.85 
k=1 
N = 3 6 9 12 15 18 
F = 0.76 – score of single Ph.D. annotator 
F = 0.87 – agreement between multiple Ph.D. annotators
23 
In aggregate, our worker 
ensemble is faster, cheaper 
and as accurate as a single 
expert annotator for disease 
concept recognition. 
Crowd-based biocuration 
• 7 days 
• 17 workers 
• $192.90 
Professional biocuration 
• Many months 
• 12 experts 
• $150,000+
24 
Information extraction for a Network of BioThings 
1. Find mentions of high level concepts in 
text 
2. Map mentions to specific terms in 
ontologies 
3. Identify relationships between concepts 
Diseases 
Genes/ 
proteins 
Drugs 
Pathways
Vision-based Citizen Science 
• Galaxy Zoo (galaxy classification; 110M+ 
classifications, 300k+ volunteers) 
• Foldit (protein folding; 350k+ players) 
• Eterna (RNA folding; 80k players) 
• Eyewire (3D neuron structure determination; 
130k volunteers) 
• Phylo (multiple sequence alignment; 30k+ 
players, 285k alignments) 
• … 
25
Language-based Citizen Science 
26 
http://mark2cure.org
` 
27 
Funding and Support 
(BioGPS: GM83924, Gene Wiki: GM089820, DA036134) 
The Su Lab 
Chunlei Wu 
Ben Good 
Salvatore Loguercio 
Max Nanis 
Louis Gioia 
Ramya Gamini 
Greg Stupp 
Ginger Tsueng 
Erick Scott 
Vyshakh Babji 
Karthik Gangavarapu 
Adam Mark 
Key Alumni 
Katie Fisch 
Tobias Meissner 
Key Collaborators 
Andra Waagmeester 
Lynn Schriml 
Peter Robinson 
Contact 
http://sulab.org 
asu@scripps.edu 
@andrewsu 
+Andrew Su 
We are recruiting 
programmers, 
postdocs, and 
awesome people of 
all kinds! 
bit.ly/SuLabJobs 
We are hosting a hackathon 
Nov 7-9 for the Network of 
BioThings 
bit.ly/hackNoB

Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

  • 1.
    Microtask crowdsourcing for annotating diseases in PubMed abstracts Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org October 20, 2014 ASHG Slides: slideshare.net/andrewsu OK OK OK
  • 2.
    Potential conflicts ofinterest • Novartis • Assay Depot • Avera Health 2
  • 3.
    3 Condition ACondition B Genome-scale profiling Candidate genes/ proteins RNA-seq Exome seq Whole genome seq Proteomics Genotyping Copy-number analysis ChIP-seq Methylation Functional genomics
  • 4.
    4 Related diseases Candidate genes/ proteins Related drugs Related pathways
  • 5.
    Databases are fragmentedand incomplete 5 KEGG (4) Disease links for Apolipoprotein E OMIM (6) HuGE Navigator PharmGKB (10) (517) 0 2 0 2 0 0 0 0 507 0 0 x 2 1 6
  • 6.
  • 7.
    7 1,200,000 1,000,000 800,000 600,000 400,000 200,000 0 Number of new PubMed-indexed articles 1983 1988 1993 1998 2003 2008 2013
  • 8.
  • 9.
  • 10.
    10 … toorganize information http://www.flickr.com/photos/45697441@N00/6629580443
  • 11.
    11 Information extractionfor a Network of BioThings 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts Diseases Genes/ proteins Drugs Pathways
  • 12.
    The NCBI Diseasecorpus 12 • 793 PubMed abstracts • 12 expert annotators (2 annotate each abstract) 6,900 “disease” mentions Doğan, Rezarta, and Zhiyong Lu. Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
  • 13.
    Question: Can agroup of non-scientists collectively perform concept recognition in biomedical texts? 13
  • 14.
    Experimental design Task:Identify the disease mentions in the PubMed abstracts from the NCBI disease corpus – 5 non-scientists annotate each abstract – The details: • Recruit workers using Amazon Mechanical Turk • Pay $0.066 per Human Intelligence Task (HIT) • HIT = annotate one abstract from PubMed 14
  • 15.
    Instructions to workers 15 • Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients received...” • “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked immunodeficiency…” • Highlight the longest span of text specific to a disease • “... contains the insulin-dependent diabetes mellitus locus …” • Highlight disease conjunctions as single, long spans. • “... a significant fraction of familial breast and ovarian cancer , but undergoes…” • Highlight symptoms - physical results of having a disease – “XFE progeroid syndrome can cause dwarfism, cachexia, and microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.
  • 16.
    Aggregation function basedon simple voting This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. 16 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. 1 or more votes (K=1) This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. K=2 K=3 K=4
  • 17.
    Comparison to goldstandard 17 F score = 0.81 Precision Recall
  • 18.
    Comparison to goldstandard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 18 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18
  • 19.
    Comparison to goldstandard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 19 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18
  • 20.
    Comparison to goldstandard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 20 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18
  • 21.
    Comparison to goldstandard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 21 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18
  • 22.
    Comparison to goldstandard 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 4 5 0.85 0.85 7 8 0 3 9 12 18 22 Max F = 0.69 0.79 0.82 k=1 3 2 0.85 k=1 N = 3 6 9 12 15 18 F = 0.76 – score of single Ph.D. annotator F = 0.87 – agreement between multiple Ph.D. annotators
  • 23.
    23 In aggregate,our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease concept recognition. Crowd-based biocuration • 7 days • 17 workers • $192.90 Professional biocuration • Many months • 12 experts • $150,000+
  • 24.
    24 Information extractionfor a Network of BioThings 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts Diseases Genes/ proteins Drugs Pathways
  • 25.
    Vision-based Citizen Science • Galaxy Zoo (galaxy classification; 110M+ classifications, 300k+ volunteers) • Foldit (protein folding; 350k+ players) • Eterna (RNA folding; 80k players) • Eyewire (3D neuron structure determination; 130k volunteers) • Phylo (multiple sequence alignment; 30k+ players, 285k alignments) • … 25
  • 26.
    Language-based Citizen Science 26 http://mark2cure.org
  • 27.
    ` 27 Fundingand Support (BioGPS: GM83924, Gene Wiki: GM089820, DA036134) The Su Lab Chunlei Wu Ben Good Salvatore Loguercio Max Nanis Louis Gioia Ramya Gamini Greg Stupp Ginger Tsueng Erick Scott Vyshakh Babji Karthik Gangavarapu Adam Mark Key Alumni Katie Fisch Tobias Meissner Key Collaborators Andra Waagmeester Lynn Schriml Peter Robinson Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su We are recruiting programmers, postdocs, and awesome people of all kinds! bit.ly/SuLabJobs We are hosting a hackathon Nov 7-9 for the Network of BioThings bit.ly/hackNoB