MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

MetaCrowd: Crowdsourcing
Gene Expression Metadata
Quality Assessment
Amrapali Zaveri and Michel Dumontier
@AmrapaliZamrapali.zaveri@maastrichtuniversity.nl
Bio-ontologies 2017 July 24-25th, 2017

BIOMEDICAL METADATA ON THE WEB — SIGNIFICANCE
3
➤ For (re-)using this data, we need to understand the
structure of datasets and the experimental conditions under
which they were produced
➤ We require accurate, structured and complete description of
the data -- deﬁned as metadata
➤ Good quality metadata is essential in ﬁnding, interpreting, and
reusing existing data beyond what the original investigators
envisioned
➤ Facilitates a data-driven approach by combining and analyzing
similar data to uncover novel insights or even more subtle
trends in the data

BIOMEDICAL METADATA ON THE WEB - CHALLENGES
4
SIZE complexity QUALITY measures
TIME consuming COSTLY, requires experts

HYPOTHESIS
Crowdsourcing i.e. non-expert workers can
be used to curate large-scale digital
biomedical metadata on the Web.
5

CROWDSOURCING - WHAT & WHY?
6
TIME MONEY
➤ Highly parallelizable tasks
➤ Work is broken down into
smaller — ‘micro’ — pieces
that can be solved
independently
➤ Tasks based on human skills
not easily replicable by machines
➤ Non-expert workers can perform
the tasks with a minimal
payment
Consolidated answers solve scientiﬁc problems !!

RELATED WORK - CROWDSOURCING BIOMEDICAL RESEARCH
➤ Improve automated mining of biomedical text for annotating
diseases [1]
➤ Curation of gene-mutation relations [2]
➤ Identifying relationships between drugs and side-eﬀects [3],
drugs and their indications [4]
➤ Annotation of microRNA functions [5].
7

GENE EXPRESSION OMNIBUS
➤ Unstructured
➤ Spreadsheet submission
➤ No controlled vocabulary
➤ Heterogeneity of terms
➤ Size complexity
➤ ~Billion records
8

Meta-analysis from GEO
data
A common rejection module (CRM) for acute rejection across multiple
organs identiﬁes novel therapeutics for organ transplantation
Khatri et al. JEM. 210 (11): 2205; DOI: 10.1084/jem.20122709
Metadata issues:
• Missing
• Incomplete
• Inaccurate

GEO METADATA - EXAMPLE
10
44,000,000
Key: value pairs

GEO METADATA - QUALITY PROBLEMS FOR KEYS
➤ Minor spelling discrepancies
➤ genotype/varaiation, genotype/varat,
genotype/varation, genotype/variaion,
genotype/variataion, genotype/variation
➤ Different syntactic representations
➤ age (years), age(yrs) and age_year
➤ Different terms to denote one concept
➤ disease, illness, healthy control
➤ Two different key categories in one key name
➤ disease/cell type, tissue/cell line,
treatment age
11

METACROWD METHODOLOGY
12
GEO
Metadata
8 GEO Keys
5 Values (each)
• cell line
• disease
• gender/sex
• genotype
• strain
• time
• tissue
• treatment
Key Deﬁnitions
SemanticScience
Integration
Ontology

MICRO TASKS — CROWDFLOWER
13

MICRO TASKS — SETTINGS
14
• 3 workers per task
• ‘Dynamic Judgment’ to 7 workers, with 0.8 conﬁdence
• No. of gold standard questions — 60
• Min. accuracy — 80%
• 5 cents per judgment
• 10 tasks per page

RESULTS OVERVIEW
15
No. of microtasks (keys) 1643
Total no. of workers 145
Total no. of judgments 7835
Overall accuracy 0.934
No. of gold standard questions 60
Accuracy on gold standard questions 0.930
Total cost $451
Total time 1 hour

RESULTS FOR EACH KEY CATEGORY
16
Key Category No. of Keys
True Positive,
False Positive
Accuracy
Cell line 109 711, 21 0.955
Disease 85 412, 10 0.937
Gender 72 645, 23 0.902
Genotype 112 566, 10 0.984
Strain 181 788, 4 0.966
Time 698 2489, 120 0.908
Tissue 145 567, 6 0.947
Treatment 242 846, 49 0.944

RESULTS FOR EACH KEY CATEGORY — EXAMPLES (1)
17
Workers classiﬁed incorrectly for:
• Cell line
• cell line initiation date, cell line source age
• Disease
• diseasestatus
• Gender
• cell sex
• Strain
• strain ID
• Tissue
• tissue & age, tissue/development stage

CONCLUSIONS & LIMITATIONS
18
• Crowdsourcing i.e. non-expert workers can be used to curate
large-scale digital gene expression metadata on the Web.
• Several keys that did not achieve consensus amongst the
workers due to either
• lack of semantically annotated values
• ambiguous nomenclature of keys as well as the values
• values indicating that keys belong to more than one
category
• inconsistent usage of the particular metadata key

CROWDSOURCING GEO METADATA QUALITY — FUTURE WORK
19
• Perform crowdsourcing on values and key: value pairs
• Implement a semi-automated approach to identify similar keys
using ontologies
• Design a pipeline to involve semi-automated method+
crowdsourcing + experts

REFERENCES
[1] Benjamin, M. G., Max, N., Chunlei, W. U. & Andrew, I. S. in
Biocomputing 2015 282–293World Scientiﬁc (2014).
[2]Burger, J. D. et al. Hybrid curation of gene–mutation relations
combining automated extraction and crowdsourcing. Database
2014, bau094 (2014).
[3] Gottlieb, A., Hoehndorf, R., Dumontier, M. & Altman, R. B.
Ranking adverse drug reactions with crowdsourcing. J. Med.
Internet Res. 17, e80 (2015).
[4] Khare, R. et al. Scaling drug indication curation through
crowdsourcing. Database 2015, bav016 (2015).
[5] Vergoulis, T. et al. mirPub: a database for searching microRNA
publications. Bioinformatics 31, 1502–1504 (2015).
20

THANK YOU!
QUESTIONS?
21
@AmrapaliZamrapali.zaveri@maastrichtuniversity.nl

MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

Similar to MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment (20)

More from Amrapali Zaveri, PhD

More from Amrapali Zaveri, PhD (16)

Recently uploaded

Recently uploaded (20)

MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment