Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin M. Good, Max Nanis, Andrew I. Su
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses that would otherwise be impossible. As a result, many biological natural language processing (BioNLP) projects attempt to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are vital to the process of knowledge extraction but are always in short supply. Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text.
Here, we investigated the use of the AMT in capturing disease mentions in Pubmed abstracts. We used the recently published NCBI Disease corpus as a gold standard for refining and benchmarking the crowdsourcing protocol. After merging the responses from 5 AMT workers per abstract with a simple voting scheme, we were able to achieve a maximum f measure of 0.815 (precision 0.823, recall 0.807) over 593 abstracts as compared to the NCBI annotations on the same abstracts. Comparisons were based on exact matches to annotation spans. The results can also be tuned to optimize for precision (max = 0.98 when recall = 0.23) or recall (max = 0.89 when precision = 0.45). It took 7 days and cost $192.90 to complete all 593 abstracts considered here (at $.06/abstract with 50 additional abstracts used for spam detection).
This experiment demonstrated that microtask-based crowdsourcing can be applied to the disease mention recognition problem in the text of biomedical research articles. The f-measure of 0.815 indicates that there is room for improvement in the crowdsourcing protocol but that, overall, AMT workers are clearly capable of performing this annotation task.