R&D Focus: Amazon Mechanical Turk as a platform for curating research articles


Published on

The volume of biomedical literature is massive -- there are over one million new research articles published every year (roughly one every thirty seconds). To make those articles more usable in research, Scripps Research Institute is exploring ways in which Citizen Scientists can perform "biocuration". They’ll share learnings from one experiment conducted to identify all mentions of diseases and disease concepts in the abstracts of 973 biomedical research articles. Scripps used Amazon Mechanical Turk as the platform for testing the Citizen Science concept and interface, and will discuss how researchers can utilize crowdsourcing to solve complex R&D compute problems that require a degree of human intelligence.

Published in: Technology
  1. 1. Amazon Mechanical Turk as a platform for curating research articles Andrew Su, Ph.D. @andrewsu March 25, 2015 Cloud Computing & Life Sciences with AWS Slides:
  3. 3. The biomedical literature is growing fast… 3 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles
  4. 4. … but it is very hard to query and compute 4
  5. 5. … but it is very hard to query and compute 5 Imatinib Crizotinib Erlotinib Gefitinib Sorafenib Lapatinib Dasatinib … Acute myeloid leukemia Acute lymphoblastic leukemia Chronic myelogenous leukemia Chronic lymphocytic leukemia Hodgkin lymphoma Non-Hodgkin lymphoma Myeloma … AND
  6. 6. The Network of BioThings 6 1. Identify biomedical concepts in text … We report a case of familial systemic mastocytosis with the rare KIT K509I germ line mutation. In vitro treatment with imatinib, dasatinib and PKC412 reduced cell viability of primary mast cells harboring KIT K509I mutation. Both patients with familial systemic mastocytosis had remarkable hematological and skin improvement after three months of imatinib treatment. Leuk Res. 2014 Oct;38(10):1245-51. doi: 10.1016/j.leukres. GENES DISEASES DRUGS VARIANTS
  7. 7. The Network of BioThings 7 imatinib dasatinib PKC412 Familial systemic mastocytosis KIT K509I 1. Identify biomedical concepts in text 2. Identify relationships between concepts Mutation of Mutation causes causes treats inhibits
  8. 8. 8 Goal: Assemble a network of biomedical knowledge that is comprehensive, current, computable and traceable.
  9. 9. Information Extraction 9 1. Identify biomedical concepts in text 2. Identify relationships between concepts
  10. 10. 10 Doğan and Lu. Proceedings of the 2012 Workshop on BioNLP, 2012, 91-9. NCBI Disease Corpus as Gold Standard 593 PubMed abstracts 12 expert annotators (2 per document) 6,900 mentions of “disease concepts”
  11. 11. Question: Can a group of non-scientists collectively replicate the expert-generated NCBI Disease Corpus? 11
  12. 12. Amazon Mechanical Turk (AMT) 12 Requester Amazon Workers 1. Create tasks 2. Execute 3. Aggregate
  13. 13. Qualification test 13 Passing threshold 145 workers (42%) passed test Score
  14. 14. Comparison to gold standard 14 K = 6 F score = 0.87 Precision Recall 15 workers / abstract
  15. 15. Comparison to text mining, experts 15 F = 0.76 F = 0.87 Fscore Text-mining AMT experiments
  16. 16. AMT annotation summary 16 • 593 documents • 9 days • 145 workers • $0.06 / task • Total: $630.96
  17. 17. 17
  18. 18. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Comparison to gold standard 18 k = 6 F score = 0.84 PrecisionRecall Voting threshold Total cost: $0
  19. 19. Does Citizen Science scale? 19 15,828 volunteers needed 175,000 volunteers 300,000 volunteers 37,000 volunteers 1,000,000 volunteers
  20. 20. 20 Cyrus Afrasiabi Sebastian Burgstaller Ramya Gamini Louis Gioia Toby Li Salvatore Loguercio Adam Mark Erick Scott Greg Stupp Kevin Xin Other group members Contact @andrewsu +Andrew Su Mark2Cure Ben Good Max Nanis Ginger Tsueng Chunlei Wu All Mark2Curators! Funding and Support BioGPS: R01 GM83924 Gene Wiki: R01 GM089820 BD2K Center of Excellence: U54 GM114833 STSI: UL1 TR001114 Icon credits (Noun Project, Wikimedia Commons): Zach VanDeHey, hunotika, Viktorvoigt, Alberto Rojas, Lloyd Humphreys Matt and Cristina Might NGLY1 community
  21. 21. Why do I Mark2Cure? 21 I am retired, have a doctorate in medical humanities, and have two children with Gaucher disease. I am just looking for some way to put my education to use. Sounds like a perfect situation for me. My 4 year old daughter Phoebe is living with and battling rare disease. I have Ehlers Danlos Syndrome. I hope to help people learn about this painful and debilitating disorder, so that others like me can receive more effective medical care. Take part in something that helps humanity. I Mark2Cure in memory of my son Mike who had type 1 diabetes. Studied biology in college and I really miss it! In memory of my daughter who had Cystic Fibrosis Give back