SlideShare a Scribd company logo
1 of 52
Download to read offline
Crowdsourcing for
information extraction:
(dynamic assembly of expert “humans”)
Benjamin Good
The Scripps Research Institute
bgood@scripps.edu
@bgood
High level goal: improve access
to published knowledge
2
articles added to
PubMed per year
>100/hour
Thanks to Suzi Lewis from GO for smoothie
Example use
What diseases are treated with curcumin (turmeric)?
3
Data is
in there,
just hard
to get
4
Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC. SemMedDB: a PubMed-scale
repository of biomedical semantic predications. Bioinformatics. 2012 Dec 1;28(23):
3158-60. doi: 10.1093/bioinformatics/bts591. Epub 2012 Oct 8.
70,364,020 subject-predicate-object relations
NLM tool
24 million abstracts
Example
What diseases are treated with curcumin (turmeric)?
5
478 results
select * from PREDICATION_AGGREGATE where s_name =
'Curcumin' and predicate = 'TREATS'
Turmeric, the miracle spice!
6
Example
What diseases are treated with curcumin (turmeric)?
7
478 results
select * from PREDICATION_AGGREGATE where s_name =
'Curcumin' and predicate = 'TREATS'
Data is easy to
access, but is it all
in there?
Is it correct?
More about Curcumin…
8
9
?!?!
Effect on curcumin on cholesterol gall-stone induction.
Influence of dietary capsaicin and curcumin during
experimental induction of cholesterol gallstone in mice.
Spice bioactive compounds, capsaicin and curcumin, were
both individually and in combination examined for antilithogenic
potential during experimental induction of cholesterol
gallstones in mice.
10
The diet that contained capsaicin, curcumin, or their
combination reduced the incidence of cholesterol
gallstones by 50%, 66%, and 56%, respectively.
Facts of life in NLP
• False Positives and False Negatives always
present
• Human annotators remain the gold standard
• There are not nearly enough professional human
annotators to process every document
published
11
Observations
• There are about 2.92 billion Internet users
• Lots of them can read English
• Most of these would not have gotten that causal
relation wrong for curcumin…
12 http://www.statista.com/statistics/273018/number-of-internet-users-worldwide/
Hypothesis
• We can generate the equivalent of professional
annotators by incentivizing, guiding, and
aggregating the labor of large numbers of non-
professionals
13 Zhai 2013, Aroyo 2013, Burger 2014, Mortenson 2014, Good 2015
Information Extraction
1. Find mentions of high level concepts in text
2. Map mentions to specific terms in ontologies
3. Identify relationships between concepts
14
Microtask Crowdsourcing
• Distribute discrete units of work
(aka “human intelligence tasks” or
HITs) to many workers in parallel
who are paid to solve them.
15
Reported 500,000
registered workers in
2011 [1]
[1] Paritosh P, Ipeirotis P, Cooper M, Suri S: The computer is the new sewing
machine: benefits and perils of crowdsourcing. WWW '11 2011:325–326.
AMT, how it works
16
Requester Tasks
AmazonFor each task, specify:
• a qualification test
• how many workers per
task
• how much we will pay
per task
• A Web form for
completing the task
Interact directly with
Amazon system
Manages:
• parallel execution of jobs
• worker access to tasks
via qualification tests
• payments
• task advertising
Workers
How well can AMT workers, in aggregate,
reproduce a gold standard disease mention
corpus within the text of PubMed abstracts?
17
Corpus used for comparison
NCBI Disease corpus
• 793 PubMed abstracts
• (100 development, 593 training, 100 test)
• 12 expert annotators (2 annotate each abstract)
6,900 “disease” mentions
18
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012
Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
“Disease”
Phrase is a disease IF:
• it can be mapped to a unique UMLS metathesaurus
concept in one of these semantic types
19
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012
Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
• and it contains information helpful to physicians
20
• Specific Disease:
• “Diastrophic dysplasia”
• Disease Class:
• “Cancers”
• Composite Mention:
• “prostatic , skin , and lung cancer”
• Modifier:
• ..the “familial breast cancer” gene , BRCA2..
Disease
mentions
Experiment
21
Identify the disease mentions in 593
abstracts from the NCBI disease corpus
• 6 cents per HIT
• HIT = annotate one abstract from PubMed
• First HIT = survey, next 4 = training, then real
• 10% of rest of hits are gold standard tests
• 15 workers annotate each abstract
Instructions
• Task: You will be presented with text from the biomedical literature which we believe may help
resolve some important medical questions. The task is to highlight words and phrases in that
text which are diseases, disease groups, or symptoms of diseases. This work will help
advance research in cancer and many other diseases!
• Highlight all diseases and disease abbreviations
• “...are associated with Huntington disease ( HD )... HD patients
received...”
• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…”
• Highlight the longest span of text specific to a disease
• “... contains the insulin-dependent diabetes mellitus locus …”
• and not just ‘diabetes’.
• Highlight disease conjunctions as single, long spans.
• “... a significant fraction of familial breast and ovarian cancer patients…”
• Highlight symptoms - physical results of having a disease
• “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss, and
visual impairment.
22
Qualification task: Q1
Select all and only the terms that should be
highlighted for each text segment:
23
1. “Myotonic dystrophy ( DM ) is associated with a ( CTG ) n trinucleotide repeat expansion in
the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to
chromosome 19q13 . 3 . ”
• Myotonic
• dystrophy
• Myotonic dystrophy
• DM
• CTG
• trinucleotide repeat expansion
• kinase-encoding gene
• DMPK
Qualification task: Q2
24
2. “Germline mutations in BRCA1 are responsible for most cases of inherited breast
and ovarian cancer . However , the function of the BRCA1 protein has remained
elusive . As a regulated secretory protein , BRCA1 appears to function by a
mechanism not previously described for tumour suppressor gene products.”
• Germline mutations
• BRCA1
• breast
• ovarian cancer
• inherited breast and ovarian cancer
• cancer
• tumour
• tumour suppressor
Qualification task: Q3
25
3. “We report about Dr . Kniest , who first described the condition in 1952 , and his patient ,
who , at the age of 50 years is severely handicapped with short stature , restricted joint
mobility , and blindness but is mentally alert and leads an active life . This is in accordance
with molecular findings in other patients with Kniest dysplasia and…”
• age of 50 years
• severely handicapped
• short
• short stature
• restricted joint mobility
• blindness
• mentally alert
• molecular findings
• Kniest dysplasia
• dysplasia
Qualification task results
26
• Experiment ran for 9 days
• 346 workers attempted the qualification test
• 145 (42%) passed
Passing
threshold
Worker demographics:
gender
27
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
female" male"
First HIT was a survey
Age
28
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
age"18/21" age"21/35" age"36/45" age"46"and"greater"
Occupation
29 0" 0.05" 0.1" 0.15" 0.2" 0.25"
Unemployed"
Student"
Technical"
Science"
Computer"
Business"
Educa=on"
Programmer"
Art"
Re=red"
Labor"
Finance"
Legal"
AEorney"
Team"Leader"
Human"Resources"
stay"at"home"mom"
Biological"Sciences"
Bussiness"
Caretaker"
Administra=ve"Assistant"
microbiology"graduate"student"
Transporta=on"Industry"
sales"
Hardware"
Homemaker"
manufacturing"
Chemical"Sciences"
mom"
Web"Assessor"
Licensed"Prac=cal"Nurse"
customer"service"rep"
Education
30
0" 0.05" 0.1" 0.15" 0.2" 0.25" 0.3"
Some"high"school"
Finished"high"school"
Some"community"college"
Finished"community"college"
Some"49year"college"
Finished"49year"college"
Some"masters"program"
Finished"masters"program"
Some"PhD"program"
Finished"PhD"program"
Why?
31
Tagging interface
32
Click to see instructions
Highlight
mentions
Feedback
interface:
• Game-like
learning signal
• Either see gold
standard data
or data from
other workers
33
Results: quantity, cost
• 9 days
• 589 abstracts annotated by 15 different workers
(8,835 tasks completed)
• 4 hits for training + survey overhead cost
• total cost: $630.96
34
Worker contributions
35
Worker quality
36
AMT, how it really works
37
Requester
Tasks
Amazon
Aggregation
function
Workers
http://www.thesheepmarket.com/
Increase precision with voting
38
1 or more votes (K=1)
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as well
as in ex vivo acute myeloid leukemia (AML)
and chronic lymphocytic leukemia (CLL)
patient tumor samples. Thus, inhibition of
CDK9 may represent an interesting approach
as a cancer therapeutic target especially in
hematologic malignancies.
K=2
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as well
as in ex vivo acute myeloid leukemia (AML)
and chronic lymphocytic leukemia (CLL)
patient tumor samples. Thus, inhibition of
CDK9 may represent an interesting approach
as a cancer therapeutic target especially in
hematologic malignancies.
K=3
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as well
as in ex vivo acute myeloid leukemia (AML)
and chronic lymphocytic leukemia (CLL)
patient tumor samples. Thus, inhibition of
CDK9 may represent an interesting approach
as a cancer therapeutic target especially in
hematologic malignancies.
K=4
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as well
as in ex vivo acute myeloid leukemia (AML)
and chronic lymphocytic leukemia (CLL)
patient tumor samples. Thus, inhibition of
CDK9 may represent an interesting approach
as a cancer therapeutic target especially in
hematologic malignancies.
Aggregation
function
Results 589 abstracts
compared to gold standard
39
F = 0.87, k = 6
Inter-Annotator agreement among
experts, NCBI Disease corpus
40
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of
the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics, 2012.
0.76
0.87
Average level
of agreement
between expert
annotators
(stage 1)
Professionals achieve equivalent
agreement only after reviewing each
other’s annotations.
41
0.76
0.87
In aggregate, our worker ensemble is faster,
cheaper and more accurate than a single
expert annotator for this task
• experts had consistency (F) with other experts = 0.76.
• Only after viewing each other’s annotators did experts
reach 0.87 consistency
• The turker ensemble had consistency with the finalized
standard = 0.87 (with access to much less information)
42
We are not alone
• Mortenson et al (2014), 25 workers, 2¢/task = 1 biomedical
ontology expert. “Using the wisdom of the crowds to find critical
errors in biomedical ontologies: a study of SNOMED CT”. JAMIA
• Burger et al (2014). 5 workers, 7¢/task = 1 expert curator.
Hybrid curation of gene–mutation relations combining automated
extraction and crowdsourcing. Database.
• Zhai et al (2013), 5 workers, 3¢/task = 1 expert curator. Web
2.0-Based Crowdsourcing for High-Quality Gold Standard
Development in Clinical Natural Language Processing” J Med
Internet Res
• .. more (e.g. IBM research “Crowd Watson” project by Arroyo
and Welty)
To do list
• Machine learning experiment on TopCoder
• Citizen Science (volunteer) implementation of
this
• New tasks
44
mturk -> machine learning
• The main purpose of building this
particular corpus was to train a
disease tagging algorithm.
45
Next Steps with Disease
Corpus
46
• We have assembled a new
1,000 document corpus
• (took 6 days)
• Simply adding it to the
training data didn’t help
• Execute TopCoder contest
to produce a better
algorithm.
could we just do them all?
• we peaked at a rate of 500 abstracts processed
per day (assuming 5 workers/doc)
• 284 workers contributing in a span of 6 days
• at 1 million/year we would need to get to 2,700/
day to do them all
• $0.066*5*1000000 = $330,000
47
Moving towards $0/task and
many more workers
• mark2cure.org
• A citizen science portal
for volunteers to do the
same stuff
• first experiment will
recapitulate results
from AMT
48
Information Extraction
1. Find mentions of high level concepts in text
2. Map mentions to specific terms in ontologies
3. Identify relationships between concepts
49
50
?!?!
Effect on curcumin on cholesterol gall-stone induction.
Influence of dietary capsaicin and curcumin during
experimental induction of cholesterol gallstone in mice.
Spice bioactive compounds, capsaicin and curcumin, were
both individually and in combination examined for antilithogenic
potential during experimental induction of cholesterol
gallstones in mice.
70,364,020 subject-predicate-object relations
Thanks
51
Max Nanis Andrew Su
Mechanical Turk Workers!
@bgood
bgood@scripps.edu
Ginger TsuengChunlei Wu
52
Could do well with far fewer workers..

More Related Content

Viewers also liked

Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1 Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1 schelby
 
Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingBenjamin Good
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbioBenjamin Good
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaCominvent AS
 
Buyer Remorse
Buyer RemorseBuyer Remorse
Buyer Remorsesmfox
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkCominvent AS
 
B2B Branding Explained
B2B Branding ExplainedB2B Branding Explained
B2B Branding Explainedcsadhy
 
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...Benjamin Good
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyCominvent AS
 
Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3Benjamin Good
 
EISHI CO. main eps machine catalogue
EISHI CO. main eps machine catalogueEISHI CO. main eps machine catalogue
EISHI CO. main eps machine catalogueeishimachinery
 
Human Guided Forests (HGF)
Human Guided Forests (HGF)Human Guided Forests (HGF)
Human Guided Forests (HGF)Benjamin Good
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
 
Light steel villa catalogue log
Light steel villa catalogue logLight steel villa catalogue log
Light steel villa catalogue logeishimachinery
 
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsBenjamin Good
 
Citizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdfCitizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdfBenjamin Good
 

Viewers also liked (19)

Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1 Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1
 
Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meeting
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio
 
(Bio)Hackathons
(Bio)Hackathons(Bio)Hackathons
(Bio)Hackathons
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alpha
 
Buyer Remorse
Buyer RemorseBuyer Remorse
Buyer Remorse
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søk
 
B2B Branding Explained
B2B Branding ExplainedB2B Branding Explained
B2B Branding Explained
 
Gene wiki jamboree
Gene wiki jamboreeGene wiki jamboree
Gene wiki jamboree
 
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
 
IMSafer Angel Round
IMSafer Angel RoundIMSafer Angel Round
IMSafer Angel Round
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
 
Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3
 
EISHI CO. main eps machine catalogue
EISHI CO. main eps machine catalogueEISHI CO. main eps machine catalogue
EISHI CO. main eps machine catalogue
 
Human Guided Forests (HGF)
Human Guided Forests (HGF)Human Guided Forests (HGF)
Human Guided Forests (HGF)
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
Light steel villa catalogue log
Light steel villa catalogue logLight steel villa catalogue log
Light steel villa catalogue log
 
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
 
Citizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdfCitizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdf
 

More from Benjamin Good

Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledgeBenjamin Good
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsBenjamin Good
 
Pathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMsPathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMsBenjamin Good
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Benjamin Good
 
Wikidata and the Semantic Web of Food
Wikidata and the  Semantic Web of FoodWikidata and the  Semantic Web of Food
Wikidata and the Semantic Web of FoodBenjamin Good
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopGene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopBenjamin Good
 
Opportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocurationOpportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocurationBenjamin Good
 
Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Benjamin Good
 
Channeling Collaborative Spirit
Channeling Collaborative SpiritChanneling Collaborative Spirit
Channeling Collaborative SpiritBenjamin Good
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidataBenjamin Good
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery Benjamin Good
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBenjamin Good
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
 
Serious games for bioinformatics education. ISMB 2014 education workshop
Serious games for bioinformatics education.  ISMB 2014 education workshopSerious games for bioinformatics education.  ISMB 2014 education workshop
Serious games for bioinformatics education. ISMB 2014 education workshopBenjamin Good
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionBenjamin Good
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Benjamin Good
 
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotationMark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotationBenjamin Good
 

More from Benjamin Good (19)

Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
 
Pathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMsPathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMs
 
Knowledge Beacons
Knowledge BeaconsKnowledge Beacons
Knowledge Beacons
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden
 
Science Game Lab
Science Game LabScience Game Lab
Science Game Lab
 
Wikidata and the Semantic Web of Food
Wikidata and the  Semantic Web of FoodWikidata and the  Semantic Web of Food
Wikidata and the Semantic Web of Food
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopGene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshop
 
Opportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocurationOpportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocuration
 
Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016
 
Channeling Collaborative Spirit
Channeling Collaborative SpiritChanneling Collaborative Spirit
Channeling Collaborative Spirit
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen science
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Serious games for bioinformatics education. ISMB 2014 education workshop
Serious games for bioinformatics education.  ISMB 2014 education workshopSerious games for bioinformatics education.  ISMB 2014 education workshop
Serious games for bioinformatics education. ISMB 2014 education workshop
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
 
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotationMark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
 

Recently uploaded

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 

Recently uploaded (20)

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 

2014 stsi research_meeting_mturk_pdf

  • 1. Crowdsourcing for information extraction: (dynamic assembly of expert “humans”) Benjamin Good The Scripps Research Institute bgood@scripps.edu @bgood
  • 2. High level goal: improve access to published knowledge 2 articles added to PubMed per year >100/hour Thanks to Suzi Lewis from GO for smoothie
  • 3. Example use What diseases are treated with curcumin (turmeric)? 3 Data is in there, just hard to get
  • 4. 4 Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC. SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics. 2012 Dec 1;28(23): 3158-60. doi: 10.1093/bioinformatics/bts591. Epub 2012 Oct 8. 70,364,020 subject-predicate-object relations NLM tool 24 million abstracts
  • 5. Example What diseases are treated with curcumin (turmeric)? 5 478 results select * from PREDICATION_AGGREGATE where s_name = 'Curcumin' and predicate = 'TREATS'
  • 7. Example What diseases are treated with curcumin (turmeric)? 7 478 results select * from PREDICATION_AGGREGATE where s_name = 'Curcumin' and predicate = 'TREATS' Data is easy to access, but is it all in there? Is it correct?
  • 9. 9 ?!?! Effect on curcumin on cholesterol gall-stone induction. Influence of dietary capsaicin and curcumin during experimental induction of cholesterol gallstone in mice. Spice bioactive compounds, capsaicin and curcumin, were both individually and in combination examined for antilithogenic potential during experimental induction of cholesterol gallstones in mice.
  • 10. 10 The diet that contained capsaicin, curcumin, or their combination reduced the incidence of cholesterol gallstones by 50%, 66%, and 56%, respectively.
  • 11. Facts of life in NLP • False Positives and False Negatives always present • Human annotators remain the gold standard • There are not nearly enough professional human annotators to process every document published 11
  • 12. Observations • There are about 2.92 billion Internet users • Lots of them can read English • Most of these would not have gotten that causal relation wrong for curcumin… 12 http://www.statista.com/statistics/273018/number-of-internet-users-worldwide/
  • 13. Hypothesis • We can generate the equivalent of professional annotators by incentivizing, guiding, and aggregating the labor of large numbers of non- professionals 13 Zhai 2013, Aroyo 2013, Burger 2014, Mortenson 2014, Good 2015
  • 14. Information Extraction 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts 14
  • 15. Microtask Crowdsourcing • Distribute discrete units of work (aka “human intelligence tasks” or HITs) to many workers in parallel who are paid to solve them. 15 Reported 500,000 registered workers in 2011 [1] [1] Paritosh P, Ipeirotis P, Cooper M, Suri S: The computer is the new sewing machine: benefits and perils of crowdsourcing. WWW '11 2011:325–326.
  • 16. AMT, how it works 16 Requester Tasks AmazonFor each task, specify: • a qualification test • how many workers per task • how much we will pay per task • A Web form for completing the task Interact directly with Amazon system Manages: • parallel execution of jobs • worker access to tasks via qualification tests • payments • task advertising Workers
  • 17. How well can AMT workers, in aggregate, reproduce a gold standard disease mention corpus within the text of PubMed abstracts? 17
  • 18. Corpus used for comparison NCBI Disease corpus • 793 PubMed abstracts • (100 development, 593 training, 100 test) • 12 expert annotators (2 annotate each abstract) 6,900 “disease” mentions 18 Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
  • 19. “Disease” Phrase is a disease IF: • it can be mapped to a unique UMLS metathesaurus concept in one of these semantic types 19 Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics. • and it contains information helpful to physicians
  • 20. 20 • Specific Disease: • “Diastrophic dysplasia” • Disease Class: • “Cancers” • Composite Mention: • “prostatic , skin , and lung cancer” • Modifier: • ..the “familial breast cancer” gene , BRCA2.. Disease mentions
  • 21. Experiment 21 Identify the disease mentions in 593 abstracts from the NCBI disease corpus • 6 cents per HIT • HIT = annotate one abstract from PubMed • First HIT = survey, next 4 = training, then real • 10% of rest of hits are gold standard tests • 15 workers annotate each abstract
  • 22. Instructions • Task: You will be presented with text from the biomedical literature which we believe may help resolve some important medical questions. The task is to highlight words and phrases in that text which are diseases, disease groups, or symptoms of diseases. This work will help advance research in cancer and many other diseases! • Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients received...” • “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked immunodeficiency…” • Highlight the longest span of text specific to a disease • “... contains the insulin-dependent diabetes mellitus locus …” • and not just ‘diabetes’. • Highlight disease conjunctions as single, long spans. • “... a significant fraction of familial breast and ovarian cancer patients…” • Highlight symptoms - physical results of having a disease • “XFE progeroid syndrome can cause dwarfism, cachexia, and microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment. 22
  • 23. Qualification task: Q1 Select all and only the terms that should be highlighted for each text segment: 23 1. “Myotonic dystrophy ( DM ) is associated with a ( CTG ) n trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ” • Myotonic • dystrophy • Myotonic dystrophy • DM • CTG • trinucleotide repeat expansion • kinase-encoding gene • DMPK
  • 24. Qualification task: Q2 24 2. “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.” • Germline mutations • BRCA1 • breast • ovarian cancer • inherited breast and ovarian cancer • cancer • tumour • tumour suppressor
  • 25. Qualification task: Q3 25 3. “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…” • age of 50 years • severely handicapped • short • short stature • restricted joint mobility • blindness • mentally alert • molecular findings • Kniest dysplasia • dysplasia
  • 26. Qualification task results 26 • Experiment ran for 9 days • 346 workers attempted the qualification test • 145 (42%) passed Passing threshold
  • 29. Occupation 29 0" 0.05" 0.1" 0.15" 0.2" 0.25" Unemployed" Student" Technical" Science" Computer" Business" Educa=on" Programmer" Art" Re=red" Labor" Finance" Legal" AEorney" Team"Leader" Human"Resources" stay"at"home"mom" Biological"Sciences" Bussiness" Caretaker" Administra=ve"Assistant" microbiology"graduate"student" Transporta=on"Industry" sales" Hardware" Homemaker" manufacturing" Chemical"Sciences" mom" Web"Assessor" Licensed"Prac=cal"Nurse" customer"service"rep"
  • 30. Education 30 0" 0.05" 0.1" 0.15" 0.2" 0.25" 0.3" Some"high"school" Finished"high"school" Some"community"college" Finished"community"college" Some"49year"college" Finished"49year"college" Some"masters"program" Finished"masters"program" Some"PhD"program" Finished"PhD"program"
  • 32. Tagging interface 32 Click to see instructions Highlight mentions
  • 33. Feedback interface: • Game-like learning signal • Either see gold standard data or data from other workers 33
  • 34. Results: quantity, cost • 9 days • 589 abstracts annotated by 15 different workers (8,835 tasks completed) • 4 hits for training + survey overhead cost • total cost: $630.96 34
  • 37. AMT, how it really works 37 Requester Tasks Amazon Aggregation function Workers http://www.thesheepmarket.com/
  • 38. Increase precision with voting 38 1 or more votes (K=1) This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. K=2 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. K=3 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. K=4 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. Aggregation function
  • 39. Results 589 abstracts compared to gold standard 39 F = 0.87, k = 6
  • 40. Inter-Annotator agreement among experts, NCBI Disease corpus 40 Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics, 2012. 0.76 0.87 Average level of agreement between expert annotators (stage 1)
  • 41. Professionals achieve equivalent agreement only after reviewing each other’s annotations. 41 0.76 0.87
  • 42. In aggregate, our worker ensemble is faster, cheaper and more accurate than a single expert annotator for this task • experts had consistency (F) with other experts = 0.76. • Only after viewing each other’s annotators did experts reach 0.87 consistency • The turker ensemble had consistency with the finalized standard = 0.87 (with access to much less information) 42
  • 43. We are not alone • Mortenson et al (2014), 25 workers, 2¢/task = 1 biomedical ontology expert. “Using the wisdom of the crowds to find critical errors in biomedical ontologies: a study of SNOMED CT”. JAMIA • Burger et al (2014). 5 workers, 7¢/task = 1 expert curator. Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing. Database. • Zhai et al (2013), 5 workers, 3¢/task = 1 expert curator. Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing” J Med Internet Res • .. more (e.g. IBM research “Crowd Watson” project by Arroyo and Welty)
  • 44. To do list • Machine learning experiment on TopCoder • Citizen Science (volunteer) implementation of this • New tasks 44
  • 45. mturk -> machine learning • The main purpose of building this particular corpus was to train a disease tagging algorithm. 45
  • 46. Next Steps with Disease Corpus 46 • We have assembled a new 1,000 document corpus • (took 6 days) • Simply adding it to the training data didn’t help • Execute TopCoder contest to produce a better algorithm.
  • 47. could we just do them all? • we peaked at a rate of 500 abstracts processed per day (assuming 5 workers/doc) • 284 workers contributing in a span of 6 days • at 1 million/year we would need to get to 2,700/ day to do them all • $0.066*5*1000000 = $330,000 47
  • 48. Moving towards $0/task and many more workers • mark2cure.org • A citizen science portal for volunteers to do the same stuff • first experiment will recapitulate results from AMT 48
  • 49. Information Extraction 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts 49
  • 50. 50 ?!?! Effect on curcumin on cholesterol gall-stone induction. Influence of dietary capsaicin and curcumin during experimental induction of cholesterol gallstone in mice. Spice bioactive compounds, capsaicin and curcumin, were both individually and in combination examined for antilithogenic potential during experimental induction of cholesterol gallstones in mice. 70,364,020 subject-predicate-object relations
  • 51. Thanks 51 Max Nanis Andrew Su Mechanical Turk Workers! @bgood bgood@scripps.edu Ginger TsuengChunlei Wu
  • 52. 52 Could do well with far fewer workers..