Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses. Many biological natural language processing (BioNLP) projects attempt to address this challenge, but the state of the art still leaves much room for improvement. Progress in BioNLP research depends on large, annotated corpora for evaluating information extraction systems and training machine learning models. Traditionally, such corpora are created by small numbers of expert annotators often working over extended periods of time. Recent studies have shown that workers on microtask crowdsourcing platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text. Here, we investigated the use of the AMT in capturing disease mentions in PubMed abstracts. We used the NCBI Disease corpus as a gold standard for refining and benchmarking our crowdsourcing protocol. After several iterations, we arrived at a protocol that reproduced the annotations of the 593 documents in the ‘training set’ of this gold standard with an overall F measure of 0.872 (precision 0.862, recall 0.883). The output can also be tuned to optimize for precision (max = 0.984 when recall = 0.269) or recall (max = 0.980 when precision = 0.436). Each document was completed by 15 workers, and their annotations were merged based on a simple voting method. In total 145 workers combined to complete all 593 documents in the span of 9 days at a cost of $.066 per abstract per worker. The quality of the annotations, as judged with the F measure, increases with the number of workers assigned to each task; however minimal performance gains were observed beyond 8 workers per task. These results add further evidence that microtask crowdsourcing can be a valuable tool for generating well-annotated corpora in BioNLP. Data produced for this analysis are available at http://figshare.com/articles/Disease_Mention_Annotation_with_Mechanical_Turk/1126402.
4. 4
Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC. SemMedDB: a PubMed-scale
repository of biomedical semantic predications. Bioinformatics. 2012 Dec 1;28(23):
3158-60. doi: 10.1093/bioinformatics/bts591. Epub 2012 Oct 8.
70,364,020 subject-predicate-object relations
NLM tool
24 million abstracts
5. Example
What diseases are treated with curcumin (turmeric)?
5
478 results
select * from PREDICATION_AGGREGATE where s_name =
'Curcumin' and predicate = 'TREATS'
7. Example
What diseases are treated with curcumin (turmeric)?
7
478 results
select * from PREDICATION_AGGREGATE where s_name =
'Curcumin' and predicate = 'TREATS'
Data is easy to
access, but is it all
in there?
Is it correct?
9. 9
?!?!
Effect on curcumin on cholesterol gall-stone induction.
Influence of dietary capsaicin and curcumin during
experimental induction of cholesterol gallstone in mice.
Spice bioactive compounds, capsaicin and curcumin, were
both individually and in combination examined for antilithogenic
potential during experimental induction of cholesterol
gallstones in mice.
10. 10
The diet that contained capsaicin, curcumin, or their
combination reduced the incidence of cholesterol
gallstones by 50%, 66%, and 56%, respectively.
11. Facts of life in NLP
• False Positives and False Negatives always
present
• Human annotators remain the gold standard
• There are not nearly enough professional human
annotators to process every document
published
11
12. Observations
• There are about 2.92 billion Internet users
• Lots of them can read English
• Most of these would not have gotten that causal
relation wrong for curcumin…
12 http://www.statista.com/statistics/273018/number-of-internet-users-worldwide/
13. Hypothesis
• We can generate the equivalent of professional
annotators by incentivizing, guiding, and
aggregating the labor of large numbers of non-
professionals
13 Zhai 2013, Aroyo 2013, Burger 2014, Mortenson 2014, Good 2015
14. Information Extraction
1. Find mentions of high level concepts in text
2. Map mentions to specific terms in ontologies
3. Identify relationships between concepts
14
15. Microtask Crowdsourcing
• Distribute discrete units of work
(aka “human intelligence tasks” or
HITs) to many workers in parallel
who are paid to solve them.
15
Reported 500,000
registered workers in
2011 [1]
[1] Paritosh P, Ipeirotis P, Cooper M, Suri S: The computer is the new sewing
machine: benefits and perils of crowdsourcing. WWW '11 2011:325–326.
16. AMT, how it works
16
Requester Tasks
AmazonFor each task, specify:
• a qualification test
• how many workers per
task
• how much we will pay
per task
• A Web form for
completing the task
Interact directly with
Amazon system
Manages:
• parallel execution of jobs
• worker access to tasks
via qualification tests
• payments
• task advertising
Workers
17. How well can AMT workers, in aggregate,
reproduce a gold standard disease mention
corpus within the text of PubMed abstracts?
17
18. Corpus used for comparison
NCBI Disease corpus
• 793 PubMed abstracts
• (100 development, 593 training, 100 test)
• 12 expert annotators (2 annotate each abstract)
6,900 “disease” mentions
18
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012
Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
19. “Disease”
Phrase is a disease IF:
• it can be mapped to a unique UMLS metathesaurus
concept in one of these semantic types
19
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012
Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
• and it contains information helpful to physicians
21. Experiment
21
Identify the disease mentions in 593
abstracts from the NCBI disease corpus
• 6 cents per HIT
• HIT = annotate one abstract from PubMed
• First HIT = survey, next 4 = training, then real
• 10% of rest of hits are gold standard tests
• 15 workers annotate each abstract
22. Instructions
• Task: You will be presented with text from the biomedical literature which we believe may help
resolve some important medical questions. The task is to highlight words and phrases in that
text which are diseases, disease groups, or symptoms of diseases. This work will help
advance research in cancer and many other diseases!
• Highlight all diseases and disease abbreviations
• “...are associated with Huntington disease ( HD )... HD patients
received...”
• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…”
• Highlight the longest span of text specific to a disease
• “... contains the insulin-dependent diabetes mellitus locus …”
• and not just ‘diabetes’.
• Highlight disease conjunctions as single, long spans.
• “... a significant fraction of familial breast and ovarian cancer patients…”
• Highlight symptoms - physical results of having a disease
• “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss, and
visual impairment.
22
23. Qualification task: Q1
Select all and only the terms that should be
highlighted for each text segment:
23
1. “Myotonic dystrophy ( DM ) is associated with a ( CTG ) n trinucleotide repeat expansion in
the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to
chromosome 19q13 . 3 . ”
• Myotonic
• dystrophy
• Myotonic dystrophy
• DM
• CTG
• trinucleotide repeat expansion
• kinase-encoding gene
• DMPK
24. Qualification task: Q2
24
2. “Germline mutations in BRCA1 are responsible for most cases of inherited breast
and ovarian cancer . However , the function of the BRCA1 protein has remained
elusive . As a regulated secretory protein , BRCA1 appears to function by a
mechanism not previously described for tumour suppressor gene products.”
• Germline mutations
• BRCA1
• breast
• ovarian cancer
• inherited breast and ovarian cancer
• cancer
• tumour
• tumour suppressor
25. Qualification task: Q3
25
3. “We report about Dr . Kniest , who first described the condition in 1952 , and his patient ,
who , at the age of 50 years is severely handicapped with short stature , restricted joint
mobility , and blindness but is mentally alert and leads an active life . This is in accordance
with molecular findings in other patients with Kniest dysplasia and…”
• age of 50 years
• severely handicapped
• short
• short stature
• restricted joint mobility
• blindness
• mentally alert
• molecular findings
• Kniest dysplasia
• dysplasia
26. Qualification task results
26
• Experiment ran for 9 days
• 346 workers attempted the qualification test
• 145 (42%) passed
Passing
threshold
37. AMT, how it really works
37
Requester
Tasks
Amazon
Aggregation
function
Workers
http://www.thesheepmarket.com/
38. Increase precision with voting
38
1 or more votes (K=1)
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as well
as in ex vivo acute myeloid leukemia (AML)
and chronic lymphocytic leukemia (CLL)
patient tumor samples. Thus, inhibition of
CDK9 may represent an interesting approach
as a cancer therapeutic target especially in
hematologic malignancies.
K=2
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as well
as in ex vivo acute myeloid leukemia (AML)
and chronic lymphocytic leukemia (CLL)
patient tumor samples. Thus, inhibition of
CDK9 may represent an interesting approach
as a cancer therapeutic target especially in
hematologic malignancies.
K=3
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as well
as in ex vivo acute myeloid leukemia (AML)
and chronic lymphocytic leukemia (CLL)
patient tumor samples. Thus, inhibition of
CDK9 may represent an interesting approach
as a cancer therapeutic target especially in
hematologic malignancies.
K=4
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as well
as in ex vivo acute myeloid leukemia (AML)
and chronic lymphocytic leukemia (CLL)
patient tumor samples. Thus, inhibition of
CDK9 may represent an interesting approach
as a cancer therapeutic target especially in
hematologic malignancies.
Aggregation
function
40. Inter-Annotator agreement among
experts, NCBI Disease corpus
40
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of
the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics, 2012.
0.76
0.87
Average level
of agreement
between expert
annotators
(stage 1)
42. In aggregate, our worker ensemble is faster,
cheaper and more accurate than a single
expert annotator for this task
• experts had consistency (F) with other experts = 0.76.
• Only after viewing each other’s annotators did experts
reach 0.87 consistency
• The turker ensemble had consistency with the finalized
standard = 0.87 (with access to much less information)
42
43. We are not alone
• Mortenson et al (2014), 25 workers, 2¢/task = 1 biomedical
ontology expert. “Using the wisdom of the crowds to find critical
errors in biomedical ontologies: a study of SNOMED CT”. JAMIA
• Burger et al (2014). 5 workers, 7¢/task = 1 expert curator.
Hybrid curation of gene–mutation relations combining automated
extraction and crowdsourcing. Database.
• Zhai et al (2013), 5 workers, 3¢/task = 1 expert curator. Web
2.0-Based Crowdsourcing for High-Quality Gold Standard
Development in Clinical Natural Language Processing” J Med
Internet Res
• .. more (e.g. IBM research “Crowd Watson” project by Arroyo
and Welty)
44. To do list
• Machine learning experiment on TopCoder
• Citizen Science (volunteer) implementation of
this
• New tasks
44
45. mturk -> machine learning
• The main purpose of building this
particular corpus was to train a
disease tagging algorithm.
45
46. Next Steps with Disease
Corpus
46
• We have assembled a new
1,000 document corpus
• (took 6 days)
• Simply adding it to the
training data didn’t help
• Execute TopCoder contest
to produce a better
algorithm.
47. could we just do them all?
• we peaked at a rate of 500 abstracts processed
per day (assuming 5 workers/doc)
• 284 workers contributing in a span of 6 days
• at 1 million/year we would need to get to 2,700/
day to do them all
• $0.066*5*1000000 = $330,000
47
48. Moving towards $0/task and
many more workers
• mark2cure.org
• A citizen science portal
for volunteers to do the
same stuff
• first experiment will
recapitulate results
from AMT
48
49. Information Extraction
1. Find mentions of high level concepts in text
2. Map mentions to specific terms in ontologies
3. Identify relationships between concepts
49
50. 50
?!?!
Effect on curcumin on cholesterol gall-stone induction.
Influence of dietary capsaicin and curcumin during
experimental induction of cholesterol gallstone in mice.
Spice bioactive compounds, capsaicin and curcumin, were
both individually and in combination examined for antilithogenic
potential during experimental induction of cholesterol
gallstones in mice.
70,364,020 subject-predicate-object relations