Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
1. 1
Gene-specific review
article for every
human gene
Data integration for
genes, drugs,
diseases
Robust classifiers of breast
cancer prognosis
Annotation of
biomedical literature
Expert-guided
classifier design
Gene-centric
web portal
Bioinformatics
algorithm
optimization
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
Slides: slideshare.net/andrewsu
2. Mark2Cure – biocuration by microtasking
• Challenge: The biomedical literature is
massive and growing exponentially, but it is
largely inaccessible
• Opportunity: Better access to existing
knowledge can make scientific process more
efficient and productive
• Current situation
– Manual biocuration by experts
– Natural language processing
2
3. Mark2Cure – biocuration by microtasking
• Our approach: Use Amazon Mechanical Turk
platform for paid microtask crowdsourcing
• Results: reproduced an expert-generated gold
standard at equivalent accuracy, shorter time,
fraction of cost
3
K = 6
F score = 0.87
Precision
Recall
• 593 documents
• 9 days
• 145 workers
• $0.06 / task
• Total cost: $630.96
4. Mark2Cure – biocuration by citizen science
• Our approach: Use volunteer-based citizen
science for microtask crowdsourcing
• Results: reproduced an expert-generated gold
standard at equivalent accuracy, shorter time,
at no cost
4
• 593 documents
• 28 days
• 212 workers
• Total cost: $0.00
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k = 6
F score = 0.84
PrecisionRecall
Voting threshold
http://mark2cure.org
5. Collaborative knowledge management
• Challenge: Biomedical research allows for
genome-scale profiling, but few genes are
previously known to researcher
• Opportunity: Better access to existing
knowledge can make scientific process more
efficient and productive
• Current situation
– Review articles (but sparse coverage)
– Lots of reading of primary literature
5
6. Collaborative knowledge management
• Our approach: Create
a gene-specific review
article for every human
gene that is
collaboratively written,
continuously updated,
and community
reviewed
• Results: 5M page
views and >1000 edits
per month
6
7. Collaborative knowledge management
• Our approach: Create
a gene-specific Wikidata
database entry for every
human gene that is
collaboratively
integrated, continuously
updated, and
community reviewed
• Results: all human
genes and diseases
loaded in Wikidata, soon
to have drugs and
relationships
7
8. Bioinformatics algorithm optimization
• Challenge: Antibody sequence clustering is
computationally expensive (CPU and memory)
• Opportunity: Large-scale clustering of
antibody sequences can aid vaccine
development
• Current situation: Research-grade code can
cluster ~100k sequences in 1.7 hours on high
memory (150 GB) machine.
8
9. Bioinformatics algorithm optimization
• Our approach: Ran TopCoder contest for 10
days, offering $7500 in prize money
• Results: Best solution can cluster 2.3M
sequences in 30 seconds on a typical desktop
computer (1.1 GB)
9
log(# sequences processed)
log(executiontime)
Benchmarks
10. 10
Cyrus Afrasiabi
Ramya Gamini
Louis Gioia
Salvatore Loguercio
Adam Mark
Erick Scott
Greg Stupp
Kevin Xin
Other group members
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Mark2Cure
Ben Good
Max Nanis
Ginger Tsueng
Chunlei Wu
All Mark2Curators!
Funding and Support
BioGPS: GM83924
Gene Wiki: GM089820
BD2K Center of Excellence: GM114833
Gene Wiki
Ben Good
Sebastian Burgstaller
Andra Waagmeester
Elvira Mitraka, UMB
Lynn Schriml, UMB
Paul Pavlidis, UBC
Gang Fu, NCBI
Contests
Chunlei Wu
Ben Good
Brian Briney, TSRI
Dennis Burton, TSRI
Rinat Sergeev, HBS
Jin Paik, HBS
Karim Laklani, HBS
Jingbo Shang
Rashid Sial, Appirio
Join the team! bit.ly/sulabawesome
11. Game for breast cancer prognosis
• Challenge: Genomic classifiers of disease are
difficult to train in a way that consistently
validates on secondary datasets
• Opportunity: Better classifiers of disease
diagnosis and/or prognosis have many clinical
applications
• Current situation: Most attempts to train
classifiers rely on machine learning methods
that utilize little or no biological knowledge
11
12. Game for breast cancer prognosis
• Our approach: Enlist a crowd of expert game
players with diverse perspectives to identify
most biologically relevant genes
• Results: Gene sets derived from game player
data showed comparable performance to
expert-generated gene sets
12
• 1077 registered players
• 15,669 games played
• Demographics
– 59% male, 41% female
– 21-29 is most frequent age group
– 35% had graduate degree, 32%
were biologists