This document summarizes Andrew Su's presentation on using crowdsourcing and citizen science for biology. Some key points:
- The biomedical literature is growing rapidly but most genes are poorly annotated due to the large amount of data and limited curation by human scientists.
- Projects like the Gene Wiki and Wikidata have harnessed the "long tail" of scientists to collaboratively curate and annotate gene information, resulting in high-quality structured data.
- Experiments using Amazon Mechanical Turk showed that non-experts can accurately perform tasks like identifying disease mentions in text, matching the performance of experts. This approach could scale to annotate the vast biomedical literature.
- The presenter's
Seismic Method Estimate velocity from seismic data.pptx
UCSD / DBMI seminar 2015-02-6
1. Crowdsourcing and
Citizen Science for
Biology
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
February 6, 2015
UCSD
Slides: slideshare.net/andrewsu
2. Few genes are well annotated…
2
Data: NCBI, February 2013
41%
65%
CTNNB1
VEGFA
SIRT1
FGFR2
TGFB1
TP53
MEF2C
BMP4
LEF1
WNT5A
TNF
20,473
protein-
coding
genes
Genes, sorted by decreasing counts
GOAnnotation
Counts
3. … because the literature is sparsely curated?
3
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
4. … because the literature is sparsely curated?
4
0
10
20
30
40
1983 1988 1993 1998 2003 2008 2013
Average capacity of human scientist
6. 6
0
Sooner or later, the
research community will
need to be involved in the
annotation effort to scale
up to the rate of data
generation.
7. The Long Tail is a prolific source of content
7
Short
Head
Long Tail
Content
produced
Contributors (sorted)
News :
Video:
Product reviews:
Food reviews:
Talent judging:
Newspapers
TV/Hollywood
Consumer reports
Food critics
Olympics
Blogs
YouTube
Amazon reviews
Yelp
American Idol
9. Wikipedia has breadth and depth
9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words
(millions)
Wikipedia Britannica
Online
10. 10
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
14. Wiki success depends on a positive feedback
14
Gene wiki page utility
Number of
users
Number of
contributors
1001
2002
15. 10,000 gene “stubs” within Wikipedia
15
Protein structure
Symbols and
identifiers
Tissue expression
pattern
Gene Ontology
annotations
Links to structured
databases
Gene
summary
Protein
interactions
Linked
references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
16. Gene Wiki has a critical mass of readers
16
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
17. Gene Wiki has a critical mass of editors
17
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Editorcount
Editors
Edits
Editcount
18. A review article for every gene is powerful
18
References to the literature
Hyperlinks to related concepts
Reelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
19. Making the Gene Wiki more computable
19
Structured annotationsFree text
Analyses
Text-mining
20. Making the Gene Wiki more computable
20
Structured annotationsFree text
Analyses
Text-mining
http://fiehnlab.ucdavis.edu/projects/rice_metabolome/
21. Making the Gene Wiki more computable
21
Structured annotationsFree text
Analyses
Text-mining
22. Making the Gene Wiki more computable
22
Structured annotationsFree text
Databases
23. Making the Gene Wiki more computable
23
Structured annotationsFree text
24. Making the Gene Wiki more computable
24
Structured annotationsFree text
31. Wikidata for biology
31
is a
regulates
Interacts
with
Protein
Glycoprotein
Neural
development
VLDL receptor
Amyloid
precursor
protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
33. Current progress
• All human and mouse genes and
proteins loaded
• All diseases (Human Disease Ontology)
loaded
• Dataset of all drugs in preparation
• Datasets for gene-disease, drug-
disease, and drug-protein relationships
in preparation
33
34. The
Long Tail of scientists
is a valuable source of
information on gene
function
34
36. The biomedical literature is growing fast…
36
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
38. … but it is very hard to query and compute
38
Imatinib
Crizotinib
Erlotinib
Gefitinib
Sorafenib
Lapatinib
Dasatinib
…
Acute myeloid leukemia
Acute lymphoblastic leukemia
Chronic myelogenous leukemia
Chronic lymphocytic leukemia
Hodgkin lymphoma
Non-Hodgkin lymphoma
Myeloma
…
AND
39. Information Extraction
39
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
40. Disease mentions in PubMed abstracts
40
NCBI Disease corpus
• 793 PubMed abstracts
• (100 development, 593 training, 100 test)
• 12 expert annotators (2 annotate each abstract)
6,900 “disease” mentions
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in
PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural
Language Processing. Association for Computational Linguistics.
41. Question: Can a group of non-scientists
collectively perform concept recognition in
biomedical texts?
41
44. Amazon Mechanical Turk (AMT)
44
Requester
Amazon
For each task, specify:
• a qualification test
• how many workers per task
• how much we will pay per task
Manages:
• parallel execution of jobs
• worker access to tasks
via qualification tests
• payments
• task advertising
Workers
1. Create tasks
2. Execute
3. Aggregate
45. Instructions to workers
45
• Highlight all diseases and disease abbreviations
• “...are associated with Huntington disease ( HD )... HD patients
received...”
• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…”
• Highlight the longest span of text specific to a disease
• “... contains the insulin-dependent diabetes mellitus locus …”
• Highlight disease conjunctions as single, long spans.
• “... a significant fraction of familial breast and ovarian cancer , but
undergoes…”
• Highlight symptoms - physical results of having a
disease
– “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss,
and visual impairment.
46. Qualification test
46
Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in
trinucleotide repeat expansion in the 3-untranslated region of a protein
kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”
Test #2: “Germline mutations in BRCA1 are responsible for most cases of
inherited breast and ovarian cancer . However , the function of the BRCA1
protein has remained elusive . As a regulated secretory protein , BRCA1
appears to function by a mechanism not previously described for tumour
suppressor gene products.”
Test #3: “We report about Dr . Kniest , who first described the condition in
1952 , and his patient , who , at the age of 50 years is severely
handicapped with short stature , restricted joint mobility , and blindness but
is mentally alert and leads an active life . This is in accordance with
molecular findings in other patients with Kniest dysplasia and…”
26 yes / no questions
49. Experimental design
• Task: Identify the disease mentions in
the 593 abstracts from the NCBI disease
corpus
– $0.06 per Human Intelligence Task (HIT)
– HIT = annotate one abstract from PubMed
– 5 workers annotate each abstract
49
50. This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
Aggregation function based on simple voting
50
5
1 or more votes (K=1)
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
K=2
K=3 K=4
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
51. Comparison to gold standard
51
F = 0.81, k = 2
• 593 documents
• 5 users / doc
• 7 days
• $192.90Precision
Recall
52. Comparison to gold standard
52
F = 0.87, k = 6
• 593 documents
• 15 users / doc
• 9 days
• $630.96
Precision
Recall
53. Comparison to gold standard
53
0 161412108642
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Workers per document
MaximumF-score
55. Comparisons to human annotators
55
Average level of
agreement
between expert
annotators
(stage 1)
F = 0.76
56. Comparisons to human annotators
56
F = 0.76
F = 0.87
Average level of
agreement
between expert
annotators
(stage 2)
57. 57
In aggregate, our worker
ensemble is faster, cheaper
and as accurate as a single
expert annotator for disease
concept recognition.
58. Information Extraction
58
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
59. Annotating the relationships
59
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
therapeutic target
subject
predicate
object
GENE
DISEASE
60. Does Mechanical Turk scale?
60
1,000,000 articles per year
10 annotators / article
4 tasks / doc
$0.06 / task
$ 2,400,000 / year
64. 64
Ben Good
Andra Waagmeester
Lynn Schriml, U Maryland
Elvira Mitraka, U Maryland
Gang Fu, NCBI
Evan Bolton, NCBI
Paul Pavlidis, U British Columbia
Peter Robinson, Charite
Many Wikipedia and Wikidata
editors
WP:MCB Project
Gene Wiki / Wikidata
Ramya Gamini
Louis Gioia
Salvatore Loguercio
Adam Mark
Erick Scott
Greg Stupp
Kevin Xin
Other Group members
Funding and Support
BioGPS: GM83924
Gene Wiki: GM089820
BD2K COE: GM114833
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Mark2Cure
Ben Good
Max Nanis
Ginger Tsueng
Chunlei Wu
Next slide!
65. Why do I Mark2Cure?
65
I am retired, have a doctorate in
medical humanities, and have two
children with Gaucher disease. I am
just looking for some way to put my
education to use. Sounds like a perfect
situation for me.
My 4 year old daughter Phoebe is
living with and battling rare
disease.
I have Ehlers Danlos Syndrome. I hope to help people
learn about this painful and debilitating disorder, so that
others like me can receive more effective medical care.
Take part in
something that
helps humanity.
I Mark2Cure in memory of
my son Mike who had type 1
diabetes.
Studied biology in
college and I really
miss it!
In memory of my daughter
who had Cystic Fibrosis
Give back