SlideShare a Scribd company logo
1 of 70
Crowdsourcing and
Citizen Science for
Biology
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
February 6, 2015
UCSD
Slides: slideshare.net/andrewsu
Few genes are well annotated…
2
Data: NCBI, February 2013
41%
65%
CTNNB1
VEGFA
SIRT1
FGFR2
TGFB1
TP53
MEF2C
BMP4
LEF1
WNT5A
TNF
20,473
protein-
coding
genes
Genes, sorted by decreasing counts
GOAnnotation
Counts
… because the literature is sparsely curated?
3
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
… because the literature is sparsely curated?
4
0
10
20
30
40
1983 1988 1993 1998 2003 2008 2013
Average capacity of human scientist
5
311,696 articles (1.5% of PubMed)
have been cited by GO annotations
6
0
Sooner or later, the
research community will
need to be involved in the
annotation effort to scale
up to the rate of data
generation.
The Long Tail is a prolific source of content
7
Short
Head
Long Tail
Content
produced
Contributors (sorted)
News :
Video:
Product reviews:
Food reviews:
Talent judging:
Newspapers
TV/Hollywood
Consumer reports
Food critics
Olympics
Blogs
YouTube
Amazon reviews
Yelp
American Idol
Wikipedia is reasonably accurate
8
Wikipedia has breadth and depth
9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words
(millions)
Wikipedia Britannica
Online
10
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
From crowdsourcing to structured data
11
The Gene Wiki
Mark2Cure
Filtering, extracting, and summarizing PubMed
Documents
Concepts Review article
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Wiki success depends on a positive feedback
14
Gene wiki page utility
Number of
users
Number of
contributors
1001
2002
10,000 gene “stubs” within Wikipedia
15
Protein structure
Symbols and
identifiers
Tissue expression
pattern
Gene Ontology
annotations
Links to structured
databases
Gene
summary
Protein
interactions
Linked
references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
Gene Wiki has a critical mass of readers
16
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
Gene Wiki has a critical mass of editors
17
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Editorcount
Editors
Edits
Editcount
A review article for every gene is powerful
18
References to the literature
Hyperlinks to related concepts
Reelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
Making the Gene Wiki more computable
19
Structured annotationsFree text
Analyses
Text-mining
Making the Gene Wiki more computable
20
Structured annotationsFree text
Analyses
Text-mining
http://fiehnlab.ucdavis.edu/projects/rice_metabolome/
Making the Gene Wiki more computable
21
Structured annotationsFree text
Analyses
Text-mining
Making the Gene Wiki more computable
22
Structured annotationsFree text
Databases
Making the Gene Wiki more computable
23
Structured annotationsFree text
Making the Gene Wiki more computable
24
Structured annotationsFree text
Wikidata
25
Provide a database of the
world’s knowledge that
anyone can edit
- Denny Vrandečić
Centralizing key data storage
26
Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf
Centralizing key data storage
27
Centralizing key data storage
28
Centralizing key data storage
29
287 language editions of Wikipedia
Bioinformatics community
Loading biological data into Wikidata
30
Entrez
Gene
Ensembl
UniProt
UCSC
PDB
RefSeq
Wikidata for biology
31
is a
regulates
Interacts
with
Protein
Glycoprotein
Neural
development
VLDL receptor
Amyloid
precursor
protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
Wikidata for biology
32
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
Current progress
• All human and mouse genes and
proteins loaded
• All diseases (Human Disease Ontology)
loaded
• Dataset of all drugs in preparation
• Datasets for gene-disease, drug-
disease, and drug-protein relationships
in preparation
33
The
Long Tail of scientists
is a valuable source of
information on gene
function
34
From crowdsourcing to structured data
35
The Gene Wiki
Mark2Cure
The biomedical literature is growing fast…
36
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
… but it is very hard to query and compute
37
… but it is very hard to query and compute
38
Imatinib
Crizotinib
Erlotinib
Gefitinib
Sorafenib
Lapatinib
Dasatinib
…
Acute myeloid leukemia
Acute lymphoblastic leukemia
Chronic myelogenous leukemia
Chronic lymphocytic leukemia
Hodgkin lymphoma
Non-Hodgkin lymphoma
Myeloma
…
AND
Information Extraction
39
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
Disease mentions in PubMed abstracts
40
NCBI Disease corpus
• 793 PubMed abstracts
• (100 development, 593 training, 100 test)
• 12 expert annotators (2 annotate each abstract)
6,900 “disease” mentions
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in
PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural
Language Processing. Association for Computational Linguistics.
Question: Can a group of non-scientists
collectively perform concept recognition in
biomedical texts?
41
The Mechanical Turk
42
http://en.wikipedia.org/wiki/The_Turk
The Mechanical Turk
43
http://en.wikipedia.org/wiki/The_Turk
Amazon Mechanical Turk (AMT)
44
Requester
Amazon
For each task, specify:
• a qualification test
• how many workers per task
• how much we will pay per task
Manages:
• parallel execution of jobs
• worker access to tasks
via qualification tests
• payments
• task advertising
Workers
1. Create tasks
2. Execute
3. Aggregate
Instructions to workers
45
• Highlight all diseases and disease abbreviations
• “...are associated with Huntington disease ( HD )... HD patients
received...”
• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…”
• Highlight the longest span of text specific to a disease
• “... contains the insulin-dependent diabetes mellitus locus …”
• Highlight disease conjunctions as single, long spans.
• “... a significant fraction of familial breast and ovarian cancer , but
undergoes…”
• Highlight symptoms - physical results of having a
disease
– “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss,
and visual impairment.
Qualification test
46
Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in
trinucleotide repeat expansion in the 3-untranslated region of a protein
kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”
Test #2: “Germline mutations in BRCA1 are responsible for most cases of
inherited breast and ovarian cancer . However , the function of the BRCA1
protein has remained elusive . As a regulated secretory protein , BRCA1
appears to function by a mechanism not previously described for tumour
suppressor gene products.”
Test #3: “We report about Dr . Kniest , who first described the condition in
1952 , and his patient , who , at the age of 50 years is severely
handicapped with short stature , restricted joint mobility , and blindness but
is mentally alert and leads an active life . This is in accordance with
molecular findings in other patients with Kniest dysplasia and…”
26 yes / no questions
Qualification test results
47
Threshold
for passing
33/194 passed
17%
Workers
qualified
workers
Simple annotation interface
48
Click to see
instructions
Highlight
disease
mentions
Experimental design
• Task: Identify the disease mentions in
the 593 abstracts from the NCBI disease
corpus
– $0.06 per Human Intelligence Task (HIT)
– HIT = annotate one abstract from PubMed
– 5 workers annotate each abstract
49
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
Aggregation function based on simple voting
50
5
1 or more votes (K=1)
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
K=2
K=3 K=4
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
Comparison to gold standard
51
F = 0.81, k = 2
• 593 documents
• 5 users / doc
• 7 days
• $192.90Precision
Recall
Comparison to gold standard
52
F = 0.87, k = 6
• 593 documents
• 15 users / doc
• 9 days
• $630.96
Precision
Recall
Comparison to gold standard
53
0 161412108642
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Workers per document
MaximumF-score
Comparisons to text-mining algorithms
54
Fscore
Text-mining
BANNER
NCBOAnnotator
Mechanical
Turk
Comparisons to human annotators
55
Average level of
agreement
between expert
annotators
(stage 1)
F = 0.76
Comparisons to human annotators
56
F = 0.76
F = 0.87
Average level of
agreement
between expert
annotators
(stage 2)
57
In aggregate, our worker
ensemble is faster, cheaper
and as accurate as a single
expert annotator for disease
concept recognition.
Information Extraction
58
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
Annotating the relationships
59
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
therapeutic target
subject
predicate
object
GENE
DISEASE
Does Mechanical Turk scale?
60
1,000,000 articles per year
10 annotators / article
4 tasks / doc
$0.06 / task
$ 2,400,000 / year
61
http://mark2cure.org
Key stats
• Launched Jan 19, 2015
• In 2.5 weeks
– 1984 document annotations
– 80 unique users
– 22% complete
62
Documentannotations
The
Long Tail of
citizen scientists
can collaboratively
annotate biomedical
text.
63
64
Ben Good
Andra Waagmeester
Lynn Schriml, U Maryland
Elvira Mitraka, U Maryland
Gang Fu, NCBI
Evan Bolton, NCBI
Paul Pavlidis, U British Columbia
Peter Robinson, Charite
Many Wikipedia and Wikidata
editors
WP:MCB Project
Gene Wiki / Wikidata
Ramya Gamini
Louis Gioia
Salvatore Loguercio
Adam Mark
Erick Scott
Greg Stupp
Kevin Xin
Other Group members
Funding and Support
BioGPS: GM83924
Gene Wiki: GM089820
BD2K COE: GM114833
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Mark2Cure
Ben Good
Max Nanis
Ginger Tsueng
Chunlei Wu
Next slide!
Why do I Mark2Cure?
65
I am retired, have a doctorate in
medical humanities, and have two
children with Gaucher disease. I am
just looking for some way to put my
education to use. Sounds like a perfect
situation for me.
My 4 year old daughter Phoebe is
living with and battling rare
disease.
I have Ehlers Danlos Syndrome. I hope to help people
learn about this painful and debilitating disorder, so that
others like me can receive more effective medical care.
Take part in
something that
helps humanity.
I Mark2Cure in memory of
my son Mike who had type 1
diabetes.
Studied biology in
college and I really
miss it!
In memory of my daughter
who had Cystic Fibrosis
Give back
Worker demographics: gender
66
First HIT was a survey
Age
67
Occupation 68
Education 69
Why? 70

More Related Content

What's hot

Revised Bio 1wfx Recombinant D N A
Revised  Bio 1wfx   Recombinant  D N ARevised  Bio 1wfx   Recombinant  D N A
Revised Bio 1wfx Recombinant D N A
Hans Lim
 
Research project
Research project Research project
Research project
Dingquan Yu
 
Altering the Code of Life
Altering the Code of LifeAltering the Code of Life
Altering the Code of Life
April Johnson
 
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET
 

What's hot (20)

Enriching Scholarship Personal Genomics presentation
Enriching Scholarship Personal Genomics presentationEnriching Scholarship Personal Genomics presentation
Enriching Scholarship Personal Genomics presentation
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
An Introduction to Crispr Genome Editing
An Introduction to Crispr Genome EditingAn Introduction to Crispr Genome Editing
An Introduction to Crispr Genome Editing
 
Gene Editing - Challenges and Future of CRISPR in Clinical Development
Gene Editing - Challenges and Future of CRISPR in Clinical DevelopmentGene Editing - Challenges and Future of CRISPR in Clinical Development
Gene Editing - Challenges and Future of CRISPR in Clinical Development
 
2015 03 13_puurs_v_public
2015 03 13_puurs_v_public2015 03 13_puurs_v_public
2015 03 13_puurs_v_public
 
The Application of Next Generation Sequencing (NGS) in cancer treatment
The Application of Next Generation Sequencing (NGS) in cancer treatmentThe Application of Next Generation Sequencing (NGS) in cancer treatment
The Application of Next Generation Sequencing (NGS) in cancer treatment
 
CRISPR Presentation
CRISPR PresentationCRISPR Presentation
CRISPR Presentation
 
Integration of biomedical literature and databases
Integration of biomedical literature and databasesIntegration of biomedical literature and databases
Integration of biomedical literature and databases
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen Science
 
Revised Bio 1wfx Recombinant D N A
Revised  Bio 1wfx   Recombinant  D N ARevised  Bio 1wfx   Recombinant  D N A
Revised Bio 1wfx Recombinant D N A
 
Research project
Research project Research project
Research project
 
CRISPR-Revolutionary Genome editing tools for Plants.....
CRISPR-Revolutionary Genome editing tools for Plants.....CRISPR-Revolutionary Genome editing tools for Plants.....
CRISPR-Revolutionary Genome editing tools for Plants.....
 
APPLICATION OF NEXT GENERATION SEQUENCING (NGS) IN CANCER TREATMENT
APPLICATION OF  NEXT GENERATION SEQUENCING (NGS)  IN CANCER TREATMENTAPPLICATION OF  NEXT GENERATION SEQUENCING (NGS)  IN CANCER TREATMENT
APPLICATION OF NEXT GENERATION SEQUENCING (NGS) IN CANCER TREATMENT
 
Altering the Code of Life
Altering the Code of LifeAltering the Code of Life
Altering the Code of Life
 
Integration of heterogeneous data
Integration of heterogeneous dataIntegration of heterogeneous data
Integration of heterogeneous data
 
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 

Viewers also liked

Social Media and the Archive. Anthony Browne. BBC Scotland - FIAT/IFTA MMC Se...
Social Media and the Archive. Anthony Browne. BBC Scotland - FIAT/IFTA MMC Se...Social Media and the Archive. Anthony Browne. BBC Scotland - FIAT/IFTA MMC Se...
Social Media and the Archive. Anthony Browne. BBC Scotland - FIAT/IFTA MMC Se...
FIAT/IFTA
 

Viewers also liked (20)

Figshare for institutions presentation swets customer day 2014
Figshare for institutions   presentation swets customer day 2014Figshare for institutions   presentation swets customer day 2014
Figshare for institutions presentation swets customer day 2014
 
Social Media and the Archive. Anthony Browne. BBC Scotland - FIAT/IFTA MMC Se...
Social Media and the Archive. Anthony Browne. BBC Scotland - FIAT/IFTA MMC Se...Social Media and the Archive. Anthony Browne. BBC Scotland - FIAT/IFTA MMC Se...
Social Media and the Archive. Anthony Browne. BBC Scotland - FIAT/IFTA MMC Se...
 
ePADD and Access -- Society of American Archivists (SAA) Annual Meeting, 2015
ePADD and Access -- Society of American Archivists (SAA) Annual Meeting, 2015ePADD and Access -- Society of American Archivists (SAA) Annual Meeting, 2015
ePADD and Access -- Society of American Archivists (SAA) Annual Meeting, 2015
 
FIBO & Schema.org
FIBO & Schema.orgFIBO & Schema.org
FIBO & Schema.org
 
The Danish Open Access Indicator
The Danish Open Access IndicatorThe Danish Open Access Indicator
The Danish Open Access Indicator
 
Imperial College London - journey to open scholarship
Imperial College London - journey to open scholarshipImperial College London - journey to open scholarship
Imperial College London - journey to open scholarship
 
Scaling Islandora
Scaling IslandoraScaling Islandora
Scaling Islandora
 
Shifting Scientific Practice - ORCID 2015
Shifting Scientific Practice - ORCID 2015Shifting Scientific Practice - ORCID 2015
Shifting Scientific Practice - ORCID 2015
 
NSW Open Data Challenge: Data Request Service
NSW Open Data Challenge: Data Request ServiceNSW Open Data Challenge: Data Request Service
NSW Open Data Challenge: Data Request Service
 
Introduction to A-Frame
Introduction to A-FrameIntroduction to A-Frame
Introduction to A-Frame
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
Lecture 1: Human-Computer Interaction Course (2015) @VU University Amsterdam
Lecture 1: Human-Computer Interaction Course (2015) @VU University AmsterdamLecture 1: Human-Computer Interaction Course (2015) @VU University Amsterdam
Lecture 1: Human-Computer Interaction Course (2015) @VU University Amsterdam
 
Dsp bbc-jem rayfield-semtech2011
Dsp bbc-jem rayfield-semtech2011Dsp bbc-jem rayfield-semtech2011
Dsp bbc-jem rayfield-semtech2011
 
The FP7 Post-Grant Open Access Pilot: An All-Encompassing Gold Open Access Fu...
The FP7 Post-Grant Open Access Pilot: An All-Encompassing Gold Open Access Fu...The FP7 Post-Grant Open Access Pilot: An All-Encompassing Gold Open Access Fu...
The FP7 Post-Grant Open Access Pilot: An All-Encompassing Gold Open Access Fu...
 
Securing the future of OA policies - Rob Johnson
Securing the future of OA policies - Rob JohnsonSecuring the future of OA policies - Rob Johnson
Securing the future of OA policies - Rob Johnson
 
Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016
 
The DiNAR Project: Meaningful Mixed Reality for Heritage - Gareth Beale
The DiNAR Project: Meaningful Mixed Reality for Heritage - Gareth BealeThe DiNAR Project: Meaningful Mixed Reality for Heritage - Gareth Beale
The DiNAR Project: Meaningful Mixed Reality for Heritage - Gareth Beale
 
RDA Publishing Workflows
RDA Publishing WorkflowsRDA Publishing Workflows
RDA Publishing Workflows
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013
 

Similar to UCSD / DBMI seminar 2015-02-6

Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin Good
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Andrew Su
 
Data-integration platform for cancer research:cBioPortal demo
Data-integration platform for cancer research:cBioPortal demoData-integration platform for cancer research:cBioPortal demo
Data-integration platform for cancer research:cBioPortal demo
CORBEL
 
Mie2012 27 aug12_shublaq
Mie2012 27 aug12_shublaqMie2012 27 aug12_shublaq
Mie2012 27 aug12_shublaq
Nour Shublaq
 
Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graph
Andrew Su
 

Similar to UCSD / DBMI seminar 2015-02-6 (20)

Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
 
AI in medicine: COVID-19 and beyond
AI in medicine: COVID-19 and beyondAI in medicine: COVID-19 and beyond
AI in medicine: COVID-19 and beyond
 
Biocuration activities for the International Cancer Genome Consortium (ICGC).
Biocuration activities for the International Cancer Genome Consortium (ICGC).Biocuration activities for the International Cancer Genome Consortium (ICGC).
Biocuration activities for the International Cancer Genome Consortium (ICGC).
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
 
Next Gen Sequencing and Associated Big Data / AI problem
Next Gen Sequencing and Associated Big Data / AI problemNext Gen Sequencing and Associated Big Data / AI problem
Next Gen Sequencing and Associated Big Data / AI problem
 
Nov 2014 ouellette_windsor_icgc_final
Nov 2014 ouellette_windsor_icgc_finalNov 2014 ouellette_windsor_icgc_final
Nov 2014 ouellette_windsor_icgc_final
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
Bioinformatics-definitionScope.pdf
Bioinformatics-definitionScope.pdfBioinformatics-definitionScope.pdf
Bioinformatics-definitionScope.pdf
 
Human Disease and Genomics
Human Disease and GenomicsHuman Disease and Genomics
Human Disease and Genomics
 
NIH Data Science Special Interest Group
NIH Data Science Special Interest GroupNIH Data Science Special Interest Group
NIH Data Science Special Interest Group
 
Biobanking a user’s perspective: Dr. Jonathan Pevsner
Biobanking a user’s perspective: Dr. Jonathan PevsnerBiobanking a user’s perspective: Dr. Jonathan Pevsner
Biobanking a user’s perspective: Dr. Jonathan Pevsner
 
Cell centered database for immunology and cancer research feb252016
Cell centered database for immunology and cancer research feb252016Cell centered database for immunology and cancer research feb252016
Cell centered database for immunology and cancer research feb252016
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & Blockchain
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio
 
Data-integration platform for cancer research:cBioPortal demo
Data-integration platform for cancer research:cBioPortal demoData-integration platform for cancer research:cBioPortal demo
Data-integration platform for cancer research:cBioPortal demo
 
Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013
 
Mie2012 27 aug12_shublaq
Mie2012 27 aug12_shublaqMie2012 27 aug12_shublaq
Mie2012 27 aug12_shublaq
 
Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graph
 
Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03
 

More from Andrew Su

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Andrew Su
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium
Andrew Su
 

More from Andrew Su (20)

Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciences
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebase
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)
 
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
 
Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
 
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
 
Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
 
Crowdsourcing to structure biological knowledge (USC/ISI)
Crowdsourcing to structure biological knowledge (USC/ISI)Crowdsourcing to structure biological knowledge (USC/ISI)
Crowdsourcing to structure biological knowledge (USC/ISI)
 
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotationISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
 
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISB2012: The Gene Wiki: Crowdsourcing human gene annotationISB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium
 

Recently uploaded

Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
Bhagirath Gogikar
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 

Recently uploaded (20)

PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 

UCSD / DBMI seminar 2015-02-6

  • 1. Crowdsourcing and Citizen Science for Biology Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org February 6, 2015 UCSD Slides: slideshare.net/andrewsu
  • 2. Few genes are well annotated… 2 Data: NCBI, February 2013 41% 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 20,473 protein- coding genes Genes, sorted by decreasing counts GOAnnotation Counts
  • 3. … because the literature is sparsely curated? 3 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles
  • 4. … because the literature is sparsely curated? 4 0 10 20 30 40 1983 1988 1993 1998 2003 2008 2013 Average capacity of human scientist
  • 5. 5 311,696 articles (1.5% of PubMed) have been cited by GO annotations
  • 6. 6 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.
  • 7. The Long Tail is a prolific source of content 7 Short Head Long Tail Content produced Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol
  • 9. Wikipedia has breadth and depth 9 http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008 Articles Words (millions) Wikipedia Britannica Online
  • 10. 10 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
  • 11. From crowdsourcing to structured data 11 The Gene Wiki Mark2Cure
  • 12. Filtering, extracting, and summarizing PubMed Documents Concepts Review article
  • 13. Filtering, extracting, and summarizing PubMed Documents Concepts
  • 14. Wiki success depends on a positive feedback 14 Gene wiki page utility Number of users Number of contributors 1001 2002
  • 15. 10,000 gene “stubs” within Wikipedia 15 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 Utility Users Contributors
  • 16. Gene Wiki has a critical mass of readers 16 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011 Utility Users Contributors
  • 17. Gene Wiki has a critical mass of editors 17 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011 Utility Users Contributors Editorcount Editors Edits Editcount
  • 18. A review article for every gene is powerful 18 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002
  • 19. Making the Gene Wiki more computable 19 Structured annotationsFree text Analyses Text-mining
  • 20. Making the Gene Wiki more computable 20 Structured annotationsFree text Analyses Text-mining http://fiehnlab.ucdavis.edu/projects/rice_metabolome/
  • 21. Making the Gene Wiki more computable 21 Structured annotationsFree text Analyses Text-mining
  • 22. Making the Gene Wiki more computable 22 Structured annotationsFree text Databases
  • 23. Making the Gene Wiki more computable 23 Structured annotationsFree text
  • 24. Making the Gene Wiki more computable 24 Structured annotationsFree text
  • 25. Wikidata 25 Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić
  • 26. Centralizing key data storage 26 Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf
  • 27. Centralizing key data storage 27
  • 28. Centralizing key data storage 28
  • 29. Centralizing key data storage 29 287 language editions of Wikipedia Bioinformatics community
  • 30. Loading biological data into Wikidata 30 Entrez Gene Ensembl UniProt UCSC PDB RefSeq
  • 31. Wikidata for biology 31 is a regulates Interacts with Protein Glycoprotein Neural development VLDL receptor Amyloid precursor protein Property:P31 Property:P128 Property:P129 Q8054 Q187126 Q1345738 Q1979313 Q423510 Q414043 Reelin http://www.wikidata.org/wiki/Q414043
  • 33. Current progress • All human and mouse genes and proteins loaded • All diseases (Human Disease Ontology) loaded • Dataset of all drugs in preparation • Datasets for gene-disease, drug- disease, and drug-protein relationships in preparation 33
  • 34. The Long Tail of scientists is a valuable source of information on gene function 34
  • 35. From crowdsourcing to structured data 35 The Gene Wiki Mark2Cure
  • 36. The biomedical literature is growing fast… 36 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles
  • 37. … but it is very hard to query and compute 37
  • 38. … but it is very hard to query and compute 38 Imatinib Crizotinib Erlotinib Gefitinib Sorafenib Lapatinib Dasatinib … Acute myeloid leukemia Acute lymphoblastic leukemia Chronic myelogenous leukemia Chronic lymphocytic leukemia Hodgkin lymphoma Non-Hodgkin lymphoma Myeloma … AND
  • 39. Information Extraction 39 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts
  • 40. Disease mentions in PubMed abstracts 40 NCBI Disease corpus • 793 PubMed abstracts • (100 development, 593 training, 100 test) • 12 expert annotators (2 annotate each abstract) 6,900 “disease” mentions Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
  • 41. Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts? 41
  • 44. Amazon Mechanical Turk (AMT) 44 Requester Amazon For each task, specify: • a qualification test • how many workers per task • how much we will pay per task Manages: • parallel execution of jobs • worker access to tasks via qualification tests • payments • task advertising Workers 1. Create tasks 2. Execute 3. Aggregate
  • 45. Instructions to workers 45 • Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients received...” • “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked immunodeficiency…” • Highlight the longest span of text specific to a disease • “... contains the insulin-dependent diabetes mellitus locus …” • Highlight disease conjunctions as single, long spans. • “... a significant fraction of familial breast and ovarian cancer , but undergoes…” • Highlight symptoms - physical results of having a disease – “XFE progeroid syndrome can cause dwarfism, cachexia, and microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.
  • 46. Qualification test 46 Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ” Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.” Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…” 26 yes / no questions
  • 47. Qualification test results 47 Threshold for passing 33/194 passed 17% Workers qualified workers
  • 48. Simple annotation interface 48 Click to see instructions Highlight disease mentions
  • 49. Experimental design • Task: Identify the disease mentions in the 593 abstracts from the NCBI disease corpus – $0.06 per Human Intelligence Task (HIT) – HIT = annotate one abstract from PubMed – 5 workers annotate each abstract 49
  • 50. This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. Aggregation function based on simple voting 50 5 1 or more votes (K=1) This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. K=2 K=3 K=4 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
  • 51. Comparison to gold standard 51 F = 0.81, k = 2 • 593 documents • 5 users / doc • 7 days • $192.90Precision Recall
  • 52. Comparison to gold standard 52 F = 0.87, k = 6 • 593 documents • 15 users / doc • 9 days • $630.96 Precision Recall
  • 53. Comparison to gold standard 53 0 161412108642 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Workers per document MaximumF-score
  • 54. Comparisons to text-mining algorithms 54 Fscore Text-mining BANNER NCBOAnnotator Mechanical Turk
  • 55. Comparisons to human annotators 55 Average level of agreement between expert annotators (stage 1) F = 0.76
  • 56. Comparisons to human annotators 56 F = 0.76 F = 0.87 Average level of agreement between expert annotators (stage 2)
  • 57. 57 In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease concept recognition.
  • 58. Information Extraction 58 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts
  • 59. Annotating the relationships 59 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. therapeutic target subject predicate object GENE DISEASE
  • 60. Does Mechanical Turk scale? 60 1,000,000 articles per year 10 annotators / article 4 tasks / doc $0.06 / task $ 2,400,000 / year
  • 62. Key stats • Launched Jan 19, 2015 • In 2.5 weeks – 1984 document annotations – 80 unique users – 22% complete 62 Documentannotations
  • 63. The Long Tail of citizen scientists can collaboratively annotate biomedical text. 63
  • 64. 64 Ben Good Andra Waagmeester Lynn Schriml, U Maryland Elvira Mitraka, U Maryland Gang Fu, NCBI Evan Bolton, NCBI Paul Pavlidis, U British Columbia Peter Robinson, Charite Many Wikipedia and Wikidata editors WP:MCB Project Gene Wiki / Wikidata Ramya Gamini Louis Gioia Salvatore Loguercio Adam Mark Erick Scott Greg Stupp Kevin Xin Other Group members Funding and Support BioGPS: GM83924 Gene Wiki: GM089820 BD2K COE: GM114833 Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Mark2Cure Ben Good Max Nanis Ginger Tsueng Chunlei Wu Next slide!
  • 65. Why do I Mark2Cure? 65 I am retired, have a doctorate in medical humanities, and have two children with Gaucher disease. I am just looking for some way to put my education to use. Sounds like a perfect situation for me. My 4 year old daughter Phoebe is living with and battling rare disease. I have Ehlers Danlos Syndrome. I hope to help people learn about this painful and debilitating disorder, so that others like me can receive more effective medical care. Take part in something that helps humanity. I Mark2Cure in memory of my son Mike who had type 1 diabetes. Studied biology in college and I really miss it! In memory of my daughter who had Cystic Fibrosis Give back