SlideShare a Scribd company logo
1 of 1
Download to read offline
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Benjamin M Good, Max Nanis, Andrew I Su

The Scripps Research Institute, La Jolla, California, USA
ABSTRACT	
  
ABSTRACT	
  

Recent studies have shown that workers on microtasking platforms such
as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate highquality annotations of biomedical text. In addition, several recent
volunteer-based citizen science projects have demonstrated the public’s
strong desire and ability to participate in the scientific process even
without any financial incentives. Based on these observations, the
mark2cure initiative is developing a Web interface for engaging large
groups of people in the process of manual literature annotation. The
system will support both microtask workers and volunteers. These
workers will be directed by scientific leaders from the community to
help accomplish ‘quests’ associated with specific knowledge extraction
problems. In particular, we are working with patient advocacy groups
such as the Chordoma Foundation to identify motivated volunteers and
to develop focused knowledge extraction challenges. We are currently
evaluating the first prototype of the annotation interface using the AMT
platform.

Challenge	
  
1000000	
  
900000	
  

Can	
  non-­‐experts	
  annotate	
  disease	
  occurrences	
  in	
  text	
  beRer	
  
than	
  machines?	
  
• 
• 
• 
• 

6900	
  disease	
  men9ons	
  in	
  793	
  PubMed	
  abstracts	
  
developed	
  by	
  a	
  team	
  of	
  12	
  annotators	
  
covers	
  all	
  sentences	
  in	
  a	
  PubMed	
  abstract	
  
Disease	
  men9ons	
  are	
  categorized	
  into	
  Specific	
  Disease,	
  
Disease	
  Class,	
  Composite	
  Men9on	
  and	
  Modifier	
  categories.	
  	
  

Use	
  the	
  AMT	
  to	
  test	
  the	
  concept	
  before	
  aRemp9ng	
  
to	
  mo9vate	
  a	
  ci9zen	
  science	
  movement	
  
Objec9ves	
  for	
  Annotators	
  

Highlight	
  all	
  diseases	
  and	
  disease	
  abbreviaFons	
  	
  
“...are	
  associated	
  with	
  Hun9ngton	
  disease	
  (	
  HD	
  )...	
  HD	
  pa9ents	
  received...”	
  
“The	
  WiskoR-­‐Aldrich	
  syndrome	
  (	
  WAS	
  )	
  …”	
  	
  
Highlight	
  the	
  longest	
  span	
  of	
  text	
  specific	
  to	
  a	
  disease	
  	
  
“...	
  contains	
  the	
  insulin-­‐dependent	
  diabetes	
  mellitus	
  locus	
  …”	
  
and	
  not	
  just	
  ‘diabetes’.	
  
“...was	
  ini9ally	
  detected	
  in	
  four	
  of	
  33	
  colorectal	
  cancer	
  families…”	
  
Highlight	
  disease	
  conjuncFons	
  as	
  single,	
  long	
  spans.	
  	
  
“...the	
  life	
  expectancy	
  of	
  Duchenne	
  and	
  Becker	
  muscular	
  dystrophy	
  pa9ents..”	
  
“...	
  a	
  significant	
  frac9on	
  of	
  familial	
  breast	
  and	
  ovarian	
  cancer	
  ,	
  but	
  undergoes…”	
  
Highlight	
  symptoms	
  -­‐	
  physical	
  results	
  of	
  having	
  a	
  disease	
  
“XFE	
  progeroid	
  syndrome	
  can	
  cause	
  	
  dwarfism,	
  cachexia,	
  and	
  microcephaly.	
  Pa9ents	
  ofen	
  display	
  learning	
  
disabili9es,	
  hearing	
  loss,	
  and	
  visual	
  impairment.	
  
Highlight	
  all	
  occurrences	
  of	
  disease	
  terms	
  
“Women	
  who	
  carry	
  a	
  muta9on	
  in	
  the	
  BRCA1	
  gene	
  have	
  an	
  80	
  %	
  risk	
  of	
  breast	
  cancer	
  by	
  the	
  age	
  of	
  70.	
  
Individuals	
  who	
  have	
  rare	
  alleles	
  of	
  the	
  VNTR	
  also	
  have	
  an	
  increased	
  risk	
  of	
  breast	
  cancer	
  (	
  2-­‐4	
  )”.	
  	
  
	
  

Number	
   600000	
  
arFcles	
  
500000	
  
added	
  to	
  
PubMed	
   400000	
  
300000	
  
200000	
  
100000	
  

0.8	
  

0	
  

Worker	
  
instruc9ons	
  

Examples	
  
Idea:	
  People	
  are	
  very	
  effec9ve	
  
processors	
  of	
  text,	
  even	
  in	
  areas	
  
where	
  they	
  aren’t	
  experts	
  [1].	
  	
  
Numerous	
  experiments	
  have	
  shown	
  
the	
  public’s	
  desire	
  to	
  contribute	
  to	
  
science.	
  	
  Lets	
  give	
  them	
  an	
  
opportunity	
  to	
  help	
  annotate	
  the	
  
biomedical	
  literature.	
  

0.6	
  

precision	
  

0.4	
  

recall	
  

0.2	
  

Approach:	
  CiFzen	
  Science	
  

F	
  

0	
  
1	
  

2	
  

3	
  

4	
  

5	
  

Number	
  of	
  votes	
  per	
  annota9on	
  	
  

Costs	
  
•  one	
  week	
  each,	
  ($30)	
  
•  one	
  month	
  turk-­‐specific	
  
developer	
  9me...	
  

Consistency	
  with	
  NCBI	
  standard,	
  Development	
  Corpus	
  
mturk	
  experiment	
  1,	
  
minimum	
  3	
  votes	
  per	
  
annota9on	
  

60	
  
50	
  

mturk	
  experiment	
  2,	
  
minimum	
  3	
  votes	
  per	
  
annota9on	
  

40	
  
30	
  

NCBO	
  annotator	
  (Human	
  
Disease	
  Ontology)	
  

20	
  
10	
  

NCBI	
  condi9onal	
  random	
  
field	
  trained	
  on	
  the	
  AZ	
  corpus	
  
(only	
  "all"	
  reported)	
  

Next	
  Steps	
  

Exp.	
  1	
  results	
  

1	
  

70	
  

0	
  

Tes9ng	
  on	
  the	
  100	
  abstract	
  “development	
  set”,	
  5	
  workers	
  per	
  
abstract,	
  $.06	
  per	
  completed	
  abstract	
  

700000	
  

(N(A)	
  +	
  N(B))	
  

To	
  what	
  degree	
  can	
  we	
  reproduce	
  the	
  NCBI	
  disease	
  corpus	
  [2]?	
  

RESULTS,	
  2	
  experiments	
  

800000	
  

Consistency(A,B)	
  =	
  2*100*(N	
  shared	
  annota9ons)	
  

consistency	
  with	
  NCBI	
  gold	
  standard	
  

Identifying concepts and relationships in biomedical text enables
knowledge to be applied in computational analyses, such as gene set
enrichment evaluations, that would otherwise be impossible. As such,
there is a long and fruitful history of BioNLP projects that apply natural
language processing to address this challenge. However, the state of the
art in BioNLP still leaves much room for improvement in terms of
precision, recall and the complexity of knowledge structures that can be
extracted automatically. Expert curators are still vital to the process of
knowledge extraction but are in short supply.

Goal:	
  structure	
  all	
  
knowledge	
  published	
  
as	
  text	
  on	
  the	
  same	
  
day	
  it	
  appears	
  in	
  
PubMed	
  with	
  expert-­‐
human	
  level	
  precision	
  
and	
  recall	
  

RESULTS,	
  Comparison	
  to	
  concept	
  recogniFon	
  tools	
  

Proof	
  of	
  Concept	
  Experiment	
  with	
  AMT	
  (work	
  in	
  progress)	
  

Exp.	
  2	
  changes	
  
•  Expanded	
  instruc9ons	
  with	
  more	
  examples	
  
•  Minor	
  interface	
  changes	
  (selec9ng	
  one	
  
term	
  automa9cally	
  selects	
  all	
  other	
  
occurrences)	
  

Nearly	
  iden9cal	
  results	
  

•  Con9nued	
  refinement	
  of	
  the	
  
annota9on	
  interface	
  with	
  AMT	
  
•  Experiment	
  to	
  compare	
  AMT	
  
results	
  versus	
  volunteers	
  
•  Collabora9ons	
  with	
  disease	
  
groups	
  such	
  as	
  the	
  Chordoma	
  
Founda9on	
  to	
  prime	
  the	
  flow	
  of	
  
ci9zen	
  scien9st	
  annotators	
  

AMT	
  workers	
  performed	
  
beRer	
  than	
  condi9onal	
  
random	
  field	
  trained	
  on	
  
the	
  AZ	
  corpus.	
  
We	
  are	
  hiring!	
  	
  Looking	
  for	
  
postdocs,	
  programmers	
  
interested	
  in	
  crowdsourcing	
  
and	
  bioinforma9cs	
  contact	
  
asu@scripps.edu	
  

REFERENCES	
  
1.  Zhai, Haijun, et al. "Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural
language processing." Journal of medical Internet research 15.4 (2013).
2.  Doğan, Rezarta Islamaj, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations."
Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational
Linguistics, 2012.

CONTACT	
  
Benjamin Good: bgood@scripps.edu Andrew Su: asu@scripps.edu

FUNDING	
  
We acknowledge support from the National Institute of General Medical
Sciences (GM089820 and GM083924).

More Related Content

Viewers also liked

Channeling Collaborative Spirit
Channeling Collaborative SpiritChanneling Collaborative Spirit
Channeling Collaborative SpiritBenjamin Good
 
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsBenjamin Good
 
Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingBenjamin Good
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidataBenjamin Good
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbioBenjamin Good
 
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...Benjamin Good
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionBenjamin Good
 
Citizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdfCitizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdfBenjamin Good
 
Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3Benjamin Good
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Benjamin Good
 
Opportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocurationOpportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocurationBenjamin Good
 
Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Benjamin Good
 
Wikidata and the Semantic Web of Food
Wikidata and the  Semantic Web of FoodWikidata and the  Semantic Web of Food
Wikidata and the Semantic Web of FoodBenjamin Good
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBenjamin Good
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giantsBenjamin Good
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopGene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopBenjamin Good
 

Viewers also liked (20)

(Bio)Hackathons
(Bio)Hackathons(Bio)Hackathons
(Bio)Hackathons
 
Channeling Collaborative Spirit
Channeling Collaborative SpiritChanneling Collaborative Spirit
Channeling Collaborative Spirit
 
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
 
Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meeting
 
2016 mem good
2016 mem good2016 mem good
2016 mem good
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio
 
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
 
Citizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdfCitizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdf
 
Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3
 
Science Game Lab
Science Game LabScience Game Lab
Science Game Lab
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
 
Opportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocurationOpportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocuration
 
Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016
 
Wikidata and the Semantic Web of Food
Wikidata and the  Semantic Web of FoodWikidata and the  Semantic Web of Food
Wikidata and the Semantic Web of Food
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen science
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopGene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshop
 

Similar to Mark2Cure: a crowdsourcing platform for biomedical literature annotation

ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKpetermurrayrust
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literaturepetermurrayrust
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literaturepetermurrayrust
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...CSCJournals
 
Clinical Genomics and Medicine
Clinical Genomics and MedicineClinical Genomics and Medicine
Clinical Genomics and MedicineWarren Kibbe
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Librariespetermurrayrust
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesAshutosh Jogalekar
 
Towards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspectiveTowards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspectivepetermurrayrust
 
How Semantic Technology Helps Researchers
How Semantic Technology Helps ResearchersHow Semantic Technology Helps Researchers
How Semantic Technology Helps ResearchersDarrell W. Gunter
 
Amia tb-review-13
Amia tb-review-13Amia tb-review-13
Amia tb-review-13Russ Altman
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...Human Variome Project
 
Can SAR Database: An Overview on System, Role and Application
Can SAR Database: An Overview on System, Role and ApplicationCan SAR Database: An Overview on System, Role and Application
Can SAR Database: An Overview on System, Role and Applicationinventionjournals
 

Similar to Mark2Cure: a crowdsourcing platform for biomedical literature annotation (20)

ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literature
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
Clinical Genomics and Medicine
Clinical Genomics and MedicineClinical Genomics and Medicine
Clinical Genomics and Medicine
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related Sciences
 
Paul Groth
Paul GrothPaul Groth
Paul Groth
 
Towards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspectiveTowards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspective
 
C0344023028
C0344023028C0344023028
C0344023028
 
How Semantic Technology Helps Researchers
How Semantic Technology Helps ResearchersHow Semantic Technology Helps Researchers
How Semantic Technology Helps Researchers
 
Cesse July 22 2009
Cesse   July 22 2009Cesse   July 22 2009
Cesse July 22 2009
 
Amia tb-review-13
Amia tb-review-13Amia tb-review-13
Amia tb-review-13
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
 
Working with Quertle
Working with QuertleWorking with Quertle
Working with Quertle
 
Can SAR Database: An Overview on System, Role and Application
Can SAR Database: An Overview on System, Role and ApplicationCan SAR Database: An Overview on System, Role and Application
Can SAR Database: An Overview on System, Role and Application
 

More from Benjamin Good

Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledgeBenjamin Good
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsBenjamin Good
 
Pathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMsPathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMsBenjamin Good
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Benjamin Good
 
Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2Benjamin Good
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery Benjamin Good
 
Serious games for bioinformatics education. ISMB 2014 education workshop
Serious games for bioinformatics education.  ISMB 2014 education workshopSerious games for bioinformatics education.  ISMB 2014 education workshop
Serious games for bioinformatics education. ISMB 2014 education workshopBenjamin Good
 
Short update on The Cure game first week
Short update on The Cure game first weekShort update on The Cure game first week
Short update on The Cure game first weekBenjamin Good
 
An online game for human phenotype prediction
An online game for human phenotype predictionAn online game for human phenotype prediction
An online game for human phenotype predictionBenjamin Good
 

More from Benjamin Good (10)

Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
 
Pathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMsPathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMs
 
Knowledge Beacons
Knowledge BeaconsKnowledge Beacons
Knowledge Beacons
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden
 
Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
 
Serious games for bioinformatics education. ISMB 2014 education workshop
Serious games for bioinformatics education.  ISMB 2014 education workshopSerious games for bioinformatics education.  ISMB 2014 education workshop
Serious games for bioinformatics education. ISMB 2014 education workshop
 
Short update on The Cure game first week
Short update on The Cure game first weekShort update on The Cure game first week
Short update on The Cure game first week
 
An online game for human phenotype prediction
An online game for human phenotype predictionAn online game for human phenotype prediction
An online game for human phenotype prediction
 

Mark2Cure: a crowdsourcing platform for biomedical literature annotation

  • 1. Mark2Cure: a crowdsourcing platform for biomedical literature annotation Benjamin M Good, Max Nanis, Andrew I Su The Scripps Research Institute, La Jolla, California, USA ABSTRACT   ABSTRACT   Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate highquality annotations of biomedical text. In addition, several recent volunteer-based citizen science projects have demonstrated the public’s strong desire and ability to participate in the scientific process even without any financial incentives. Based on these observations, the mark2cure initiative is developing a Web interface for engaging large groups of people in the process of manual literature annotation. The system will support both microtask workers and volunteers. These workers will be directed by scientific leaders from the community to help accomplish ‘quests’ associated with specific knowledge extraction problems. In particular, we are working with patient advocacy groups such as the Chordoma Foundation to identify motivated volunteers and to develop focused knowledge extraction challenges. We are currently evaluating the first prototype of the annotation interface using the AMT platform. Challenge   1000000   900000   Can  non-­‐experts  annotate  disease  occurrences  in  text  beRer   than  machines?   •  •  •  •  6900  disease  men9ons  in  793  PubMed  abstracts   developed  by  a  team  of  12  annotators   covers  all  sentences  in  a  PubMed  abstract   Disease  men9ons  are  categorized  into  Specific  Disease,   Disease  Class,  Composite  Men9on  and  Modifier  categories.     Use  the  AMT  to  test  the  concept  before  aRemp9ng   to  mo9vate  a  ci9zen  science  movement   Objec9ves  for  Annotators   Highlight  all  diseases  and  disease  abbreviaFons     “...are  associated  with  Hun9ngton  disease  (  HD  )...  HD  pa9ents  received...”   “The  WiskoR-­‐Aldrich  syndrome  (  WAS  )  …”     Highlight  the  longest  span  of  text  specific  to  a  disease     “...  contains  the  insulin-­‐dependent  diabetes  mellitus  locus  …”   and  not  just  ‘diabetes’.   “...was  ini9ally  detected  in  four  of  33  colorectal  cancer  families…”   Highlight  disease  conjuncFons  as  single,  long  spans.     “...the  life  expectancy  of  Duchenne  and  Becker  muscular  dystrophy  pa9ents..”   “...  a  significant  frac9on  of  familial  breast  and  ovarian  cancer  ,  but  undergoes…”   Highlight  symptoms  -­‐  physical  results  of  having  a  disease   “XFE  progeroid  syndrome  can  cause    dwarfism,  cachexia,  and  microcephaly.  Pa9ents  ofen  display  learning   disabili9es,  hearing  loss,  and  visual  impairment.   Highlight  all  occurrences  of  disease  terms   “Women  who  carry  a  muta9on  in  the  BRCA1  gene  have  an  80  %  risk  of  breast  cancer  by  the  age  of  70.   Individuals  who  have  rare  alleles  of  the  VNTR  also  have  an  increased  risk  of  breast  cancer  (  2-­‐4  )”.       Number   600000   arFcles   500000   added  to   PubMed   400000   300000   200000   100000   0.8   0   Worker   instruc9ons   Examples   Idea:  People  are  very  effec9ve   processors  of  text,  even  in  areas   where  they  aren’t  experts  [1].     Numerous  experiments  have  shown   the  public’s  desire  to  contribute  to   science.    Lets  give  them  an   opportunity  to  help  annotate  the   biomedical  literature.   0.6   precision   0.4   recall   0.2   Approach:  CiFzen  Science   F   0   1   2   3   4   5   Number  of  votes  per  annota9on     Costs   •  one  week  each,  ($30)   •  one  month  turk-­‐specific   developer  9me...   Consistency  with  NCBI  standard,  Development  Corpus   mturk  experiment  1,   minimum  3  votes  per   annota9on   60   50   mturk  experiment  2,   minimum  3  votes  per   annota9on   40   30   NCBO  annotator  (Human   Disease  Ontology)   20   10   NCBI  condi9onal  random   field  trained  on  the  AZ  corpus   (only  "all"  reported)   Next  Steps   Exp.  1  results   1   70   0   Tes9ng  on  the  100  abstract  “development  set”,  5  workers  per   abstract,  $.06  per  completed  abstract   700000   (N(A)  +  N(B))   To  what  degree  can  we  reproduce  the  NCBI  disease  corpus  [2]?   RESULTS,  2  experiments   800000   Consistency(A,B)  =  2*100*(N  shared  annota9ons)   consistency  with  NCBI  gold  standard   Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses, such as gene set enrichment evaluations, that would otherwise be impossible. As such, there is a long and fruitful history of BioNLP projects that apply natural language processing to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are still vital to the process of knowledge extraction but are in short supply. Goal:  structure  all   knowledge  published   as  text  on  the  same   day  it  appears  in   PubMed  with  expert-­‐ human  level  precision   and  recall   RESULTS,  Comparison  to  concept  recogniFon  tools   Proof  of  Concept  Experiment  with  AMT  (work  in  progress)   Exp.  2  changes   •  Expanded  instruc9ons  with  more  examples   •  Minor  interface  changes  (selec9ng  one   term  automa9cally  selects  all  other   occurrences)   Nearly  iden9cal  results   •  Con9nued  refinement  of  the   annota9on  interface  with  AMT   •  Experiment  to  compare  AMT   results  versus  volunteers   •  Collabora9ons  with  disease   groups  such  as  the  Chordoma   Founda9on  to  prime  the  flow  of   ci9zen  scien9st  annotators   AMT  workers  performed   beRer  than  condi9onal   random  field  trained  on   the  AZ  corpus.   We  are  hiring!    Looking  for   postdocs,  programmers   interested  in  crowdsourcing   and  bioinforma9cs  contact   asu@scripps.edu   REFERENCES   1.  Zhai, Haijun, et al. "Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing." Journal of medical Internet research 15.4 (2013). 2.  Doğan, Rezarta Islamaj, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics, 2012. CONTACT   Benjamin Good: bgood@scripps.edu Andrew Su: asu@scripps.edu FUNDING   We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924).