Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mark2Cure: a crowdsourcing platform for biomedical literature annotation


Published on

Poster about

  • Be the first to comment

Mark2Cure: a crowdsourcing platform for biomedical literature annotation

  1. 1. Mark2Cure: a crowdsourcing platform for biomedical literature annotation Benjamin M Good, Max Nanis, Andrew I Su The Scripps Research Institute, La Jolla, California, USA ABSTRACT   ABSTRACT   Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate highquality annotations of biomedical text. In addition, several recent volunteer-based citizen science projects have demonstrated the public’s strong desire and ability to participate in the scientific process even without any financial incentives. Based on these observations, the mark2cure initiative is developing a Web interface for engaging large groups of people in the process of manual literature annotation. The system will support both microtask workers and volunteers. These workers will be directed by scientific leaders from the community to help accomplish ‘quests’ associated with specific knowledge extraction problems. In particular, we are working with patient advocacy groups such as the Chordoma Foundation to identify motivated volunteers and to develop focused knowledge extraction challenges. We are currently evaluating the first prototype of the annotation interface using the AMT platform. Challenge   1000000   900000   Can  non-­‐experts  annotate  disease  occurrences  in  text  beRer   than  machines?   •  •  •  •  6900  disease  men9ons  in  793  PubMed  abstracts   developed  by  a  team  of  12  annotators   covers  all  sentences  in  a  PubMed  abstract   Disease  men9ons  are  categorized  into  Specific  Disease,   Disease  Class,  Composite  Men9on  and  Modifier  categories.     Use  the  AMT  to  test  the  concept  before  aRemp9ng   to  mo9vate  a  ci9zen  science  movement   Objec9ves  for  Annotators   Highlight  all  diseases  and  disease  abbreviaFons     “...are  associated  with  Hun9ngton  disease  (  HD  )...  HD  pa9ents  received...”   “The  WiskoR-­‐Aldrich  syndrome  (  WAS  )  …”     Highlight  the  longest  span  of  text  specific  to  a  disease     “...  contains  the  insulin-­‐dependent  diabetes  mellitus  locus  …”   and  not  just  ‘diabetes’.   “...was  ini9ally  detected  in  four  of  33  colorectal  cancer  families…”   Highlight  disease  conjuncFons  as  single,  long  spans.     “...the  life  expectancy  of  Duchenne  and  Becker  muscular  dystrophy  pa9ents..”   “...  a  significant  frac9on  of  familial  breast  and  ovarian  cancer  ,  but  undergoes…”   Highlight  symptoms  -­‐  physical  results  of  having  a  disease   “XFE  progeroid  syndrome  can  cause    dwarfism,  cachexia,  and  microcephaly.  Pa9ents  ofen  display  learning   disabili9es,  hearing  loss,  and  visual  impairment.   Highlight  all  occurrences  of  disease  terms   “Women  who  carry  a  muta9on  in  the  BRCA1  gene  have  an  80  %  risk  of  breast  cancer  by  the  age  of  70.   Individuals  who  have  rare  alleles  of  the  VNTR  also  have  an  increased  risk  of  breast  cancer  (  2-­‐4  )”.       Number   600000   arFcles   500000   added  to   PubMed   400000   300000   200000   100000   0.8   0   Worker   instruc9ons   Examples   Idea:  People  are  very  effec9ve   processors  of  text,  even  in  areas   where  they  aren’t  experts  [1].     Numerous  experiments  have  shown   the  public’s  desire  to  contribute  to   science.    Lets  give  them  an   opportunity  to  help  annotate  the   biomedical  literature.   0.6   precision   0.4   recall   0.2   Approach:  CiFzen  Science   F   0   1   2   3   4   5   Number  of  votes  per  annota9on     Costs   •  one  week  each,  ($30)   •  one  month  turk-­‐specific   developer  9me...   Consistency  with  NCBI  standard,  Development  Corpus   mturk  experiment  1,   minimum  3  votes  per   annota9on   60   50   mturk  experiment  2,   minimum  3  votes  per   annota9on   40   30   NCBO  annotator  (Human   Disease  Ontology)   20   10   NCBI  condi9onal  random   field  trained  on  the  AZ  corpus   (only  "all"  reported)   Next  Steps   Exp.  1  results   1   70   0   Tes9ng  on  the  100  abstract  “development  set”,  5  workers  per   abstract,  $.06  per  completed  abstract   700000   (N(A)  +  N(B))   To  what  degree  can  we  reproduce  the  NCBI  disease  corpus  [2]?   RESULTS,  2  experiments   800000   Consistency(A,B)  =  2*100*(N  shared  annota9ons)   consistency  with  NCBI  gold  standard   Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses, such as gene set enrichment evaluations, that would otherwise be impossible. As such, there is a long and fruitful history of BioNLP projects that apply natural language processing to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are still vital to the process of knowledge extraction but are in short supply. Goal:  structure  all   knowledge  published   as  text  on  the  same   day  it  appears  in   PubMed  with  expert-­‐ human  level  precision   and  recall   RESULTS,  Comparison  to  concept  recogniFon  tools   Proof  of  Concept  Experiment  with  AMT  (work  in  progress)   Exp.  2  changes   •  Expanded  instruc9ons  with  more  examples   •  Minor  interface  changes  (selec9ng  one   term  automa9cally  selects  all  other   occurrences)   Nearly  iden9cal  results   •  Con9nued  refinement  of  the   annota9on  interface  with  AMT   •  Experiment  to  compare  AMT   results  versus  volunteers   •  Collabora9ons  with  disease   groups  such  as  the  Chordoma   Founda9on  to  prime  the  flow  of   ci9zen  scien9st  annotators   AMT  workers  performed   beRer  than  condi9onal   random  field  trained  on   the  AZ  corpus.   We  are  hiring!    Looking  for   postdocs,  programmers   interested  in  crowdsourcing   and  bioinforma9cs  contact   REFERENCES   1.  Zhai, Haijun, et al. "Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing." Journal of medical Internet research 15.4 (2013). 2.  Doğan, Rezarta Islamaj, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics, 2012. CONTACT   Benjamin Good: Andrew Su: FUNDING   We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924).