SlideShare a Scribd company logo
MetaCrowd: Crowdsourcing
Gene Expression Metadata
Quality Assessment
Amrapali Zaveri and Michel Dumontier
@AmrapaliZamrapali.zaveri@maastrichtuniversity.nl
Bio-ontologies 2017 July 24-25th, 2017
BIOMEDICAL DATA ON THE WEB
2
BIOMEDICAL METADATA ON THE WEB — SIGNIFICANCE
3
➤ For (re-)using this data, we need to understand the
structure of datasets and the experimental conditions under
which they were produced
➤ We require accurate, structured and complete description of
the data -- defined as metadata
➤ Good quality metadata is essential in finding, interpreting, and
reusing existing data beyond what the original investigators
envisioned
➤ Facilitates a data-driven approach by combining and analyzing
similar data to uncover novel insights or even more subtle
trends in the data
BIOMEDICAL METADATA ON THE WEB - CHALLENGES
4
SIZE complexity QUALITY measures
TIME consuming COSTLY, requires experts
HYPOTHESIS
Crowdsourcing i.e. non-expert workers can
be used to curate large-scale digital
biomedical metadata on the Web.
5
CROWDSOURCING - WHAT & WHY?
6
TIME MONEY
➤ Highly parallelizable tasks
➤ Work is broken down into
smaller — ‘micro’ — pieces
that can be solved
independently
➤ Tasks based on human skills
not easily replicable by machines
➤ Non-expert workers can perform
the tasks with a minimal
payment
Consolidated answers solve scientific problems !!
RELATED WORK - CROWDSOURCING BIOMEDICAL RESEARCH
➤ Improve automated mining of biomedical text for annotating
diseases [1]
➤ Curation of gene-mutation relations [2]
➤ Identifying relationships between drugs and side-effects [3],
drugs and their indications [4]
➤ Annotation of microRNA functions [5].
7
GENE EXPRESSION OMNIBUS
➤ Unstructured
➤ Spreadsheet submission
➤ No controlled vocabulary
➤ Heterogeneity of terms
➤ Size complexity
➤ ~Billion records
8
Meta-analysis from GEO
data
A common rejection module (CRM) for acute rejection across multiple
organs identifies novel therapeutics for organ transplantation
Khatri et al. JEM. 210 (11): 2205; DOI: 10.1084/jem.20122709
Metadata issues:
• Missing
• Incomplete
• Inaccurate
GEO METADATA - EXAMPLE
10
44,000,000
Key: value pairs
GEO METADATA - QUALITY PROBLEMS FOR KEYS
➤ Minor spelling discrepancies
➤ genotype/varaiation, genotype/varat,
genotype/varation, genotype/variaion,
genotype/variataion, genotype/variation
➤ Different syntactic representations
➤ age (years), age(yrs) and age_year
➤ Different terms to denote one concept
➤ disease, illness, healthy control
➤ Two different key categories in one key name
➤ disease/cell type, tissue/cell line,
treatment age
11
METACROWD METHODOLOGY
12
GEO
Metadata
8 GEO Keys
5 Values (each)
• cell line
• disease
• gender/sex
• genotype
• strain
• time
• tissue
• treatment
Key Definitions
SemanticScience
Integration
Ontology
MICRO TASKS — CROWDFLOWER
13
MICRO TASKS — SETTINGS
14
• 3 workers per task
• ‘Dynamic Judgment’ to 7 workers, with 0.8 confidence
• No. of gold standard questions — 60
• Min. accuracy — 80%
• 5 cents per judgment
• 10 tasks per page
RESULTS OVERVIEW
15
No. of microtasks (keys) 1643
Total no. of workers 145
Total no. of judgments 7835
Overall accuracy 0.934
No. of gold standard questions 60
Accuracy on gold standard questions 0.930
Total cost $451
Total time 1 hour
RESULTS FOR EACH KEY CATEGORY
16
Key Category No. of Keys
True Positive,
False Positive
Accuracy
Cell line 109 711, 21 0.955
Disease 85 412, 10 0.937
Gender 72 645, 23 0.902
Genotype 112 566, 10 0.984
Strain 181 788, 4 0.966
Time 698 2489, 120 0.908
Tissue 145 567, 6 0.947
Treatment 242 846, 49 0.944
RESULTS FOR EACH KEY CATEGORY — EXAMPLES (1)
17
Workers classified incorrectly for:
• Cell line
• cell line initiation date, cell line source age
• Disease
• diseasestatus
• Gender
• cell sex
• Strain
• strain ID
• Tissue
• tissue & age, tissue/development stage
CONCLUSIONS & LIMITATIONS
18
• Crowdsourcing i.e. non-expert workers can be used to curate
large-scale digital gene expression metadata on the Web.
• Several keys that did not achieve consensus amongst the
workers due to either
• lack of semantically annotated values
• ambiguous nomenclature of keys as well as the values
• values indicating that keys belong to more than one
category
• inconsistent usage of the particular metadata key
CROWDSOURCING GEO METADATA QUALITY — FUTURE WORK
19
• Perform crowdsourcing on values and key: value pairs
• Implement a semi-automated approach to identify similar keys
using ontologies
• Design a pipeline to involve semi-automated method+
crowdsourcing + experts
REFERENCES
[1] Benjamin, M. G., Max, N., Chunlei, W. U. & Andrew, I. S. in
Biocomputing 2015 282–293World Scientific (2014).
[2]Burger, J. D. et al. Hybrid curation of gene–mutation relations
combining automated extraction and crowdsourcing. Database
2014, bau094 (2014).
[3] Gottlieb, A., Hoehndorf, R., Dumontier, M. & Altman, R. B.
Ranking adverse drug reactions with crowdsourcing. J. Med.
Internet Res. 17, e80 (2015).
[4] Khare, R. et al. Scaling drug indication curation through
crowdsourcing. Database 2015, bav016 (2015).
[5] Vergoulis, T. et al. mirPub: a database for searching microRNA
publications. Bioinformatics 31, 1502–1504 (2015).
20
THANK YOU!
QUESTIONS?
21
@AmrapaliZamrapali.zaveri@maastrichtuniversity.nl

More Related Content

What's hot

Career oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of BioinformaticsCareer oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of Bioinformatics
Shikha Thakur
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206GenomeInABottle
 
JPROT-TargetedProteomics-CallforPapers
JPROT-TargetedProteomics-CallforPapersJPROT-TargetedProteomics-CallforPapers
JPROT-TargetedProteomics-CallforPapersmanrai1953
 
Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Cracking the (bio)code -- Professional Development Session at SACNAS 2014Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Tracy Heath
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Benjamin Good
 
Experimental Designs in Next Generation Sequencing
Experimental Designs in Next Generation Sequencing Experimental Designs in Next Generation Sequencing
Experimental Designs in Next Generation Sequencing
GuttiPavan
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
Barry Smith
 
Postdoctoral Position in the Translational Glycomaterials Laboratory
Postdoctoral Position in the Translational Glycomaterials LaboratoryPostdoctoral Position in the Translational Glycomaterials Laboratory
Postdoctoral Position in the Translational Glycomaterials Laboratory
Lohitash Karumbaiah
 
Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...
Malachi Griffith
 
Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Enrico Busto
 
V.A. Westbrook Resume
V.A. Westbrook ResumeV.A. Westbrook Resume
V.A. Westbrook Resume
V. Anne Westbrook, Ph.D.
 
Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked Data
Michel Dumontier
 
Oskar Laur-resume
Oskar Laur-resumeOskar Laur-resume
Oskar Laur-resumeOskar Laur
 
Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014pratikomics
 
Using ADAGE for pathway-style analyses
Using ADAGE for pathway-style analysesUsing ADAGE for pathway-style analyses
Using ADAGE for pathway-style analyses
Casey Greene
 
No Boundary Thinking in Bioinformatics Workshop Keynote
No Boundary Thinking in Bioinformatics Workshop KeynoteNo Boundary Thinking in Bioinformatics Workshop Keynote
No Boundary Thinking in Bioinformatics Workshop Keynote
Casey Greene
 
140127 Performance Metrics WG
140127 Performance Metrics WG140127 Performance Metrics WG
140127 Performance Metrics WGGenomeInABottle
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 

What's hot (20)

Career oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of BioinformaticsCareer oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of Bioinformatics
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206
 
JPROT-TargetedProteomics-CallforPapers
JPROT-TargetedProteomics-CallforPapersJPROT-TargetedProteomics-CallforPapers
JPROT-TargetedProteomics-CallforPapers
 
Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Cracking the (bio)code -- Professional Development Session at SACNAS 2014Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Cracking the (bio)code -- Professional Development Session at SACNAS 2014
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
 
Experimental Designs in Next Generation Sequencing
Experimental Designs in Next Generation Sequencing Experimental Designs in Next Generation Sequencing
Experimental Designs in Next Generation Sequencing
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
Postdoctoral Position in the Translational Glycomaterials Laboratory
Postdoctoral Position in the Translational Glycomaterials LaboratoryPostdoctoral Position in the Translational Glycomaterials Laboratory
Postdoctoral Position in the Translational Glycomaterials Laboratory
 
Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...
 
03 Guerra, Rudy
03 Guerra, Rudy03 Guerra, Rudy
03 Guerra, Rudy
 
Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Master's Thesis - deep genomics: harnessing the power of deep neural networks...
 
DanaM 0116 plus R6
DanaM 0116 plus R6DanaM 0116 plus R6
DanaM 0116 plus R6
 
V.A. Westbrook Resume
V.A. Westbrook ResumeV.A. Westbrook Resume
V.A. Westbrook Resume
 
Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked Data
 
Oskar Laur-resume
Oskar Laur-resumeOskar Laur-resume
Oskar Laur-resume
 
Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014
 
Using ADAGE for pathway-style analyses
Using ADAGE for pathway-style analysesUsing ADAGE for pathway-style analyses
Using ADAGE for pathway-style analyses
 
No Boundary Thinking in Bioinformatics Workshop Keynote
No Boundary Thinking in Bioinformatics Workshop KeynoteNo Boundary Thinking in Bioinformatics Workshop Keynote
No Boundary Thinking in Bioinformatics Workshop Keynote
 
140127 Performance Metrics WG
140127 Performance Metrics WG140127 Performance Metrics WG
140127 Performance Metrics WG
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 

Similar to MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
Pistoia Alliance
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Amit Sheth
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
EITESANGO
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
Chris Dwan
 
Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Fore FAIR ISMB 2019
Fore FAIR ISMB 2019
Ian Fore
 
Friend NAS 2013-01-10
Friend NAS 2013-01-10Friend NAS 2013-01-10
Friend NAS 2013-01-10
Sage Base
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
GenomeInABottle
 
Understanding Gaps between Data Quality Checks and Research Capabilities in a...
Understanding Gaps between Data Quality Checks and Research Capabilities in a...Understanding Gaps between Data Quality Checks and Research Capabilities in a...
Understanding Gaps between Data Quality Checks and Research Capabilities in a...
The Children's Hospital of Philadelphia
 
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Adam Ford
 
FAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and NeuroscienceFAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and Neuroscience
Susanna-Assunta Sansone
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
Genome Reference Consortium
 
Biostatistics and Statistical Bioinformatics
Biostatistics and Statistical BioinformaticsBiostatistics and Statistical Bioinformatics
Biostatistics and Statistical Bioinformatics
Setia Pramana
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
GenomeInABottle
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
Russ Altman
 
Ontologies: What Librarians Need to Know
Ontologies: What Librarians Need to KnowOntologies: What Librarians Need to Know
Ontologies: What Librarians Need to KnowBarry Smith
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
Paul Agapow
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traits
SOYEON KIM
 
Sabina Leonelli
Sabina LeonelliSabina Leonelli
Sabina Leonelli
Anita de Waard
 
Going FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standardsGoing FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standards
Susanna-Assunta Sansone
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management
inscit2006
 

Similar to MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment (20)

CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Fore FAIR ISMB 2019
Fore FAIR ISMB 2019
 
Friend NAS 2013-01-10
Friend NAS 2013-01-10Friend NAS 2013-01-10
Friend NAS 2013-01-10
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
Understanding Gaps between Data Quality Checks and Research Capabilities in a...
Understanding Gaps between Data Quality Checks and Research Capabilities in a...Understanding Gaps between Data Quality Checks and Research Capabilities in a...
Understanding Gaps between Data Quality Checks and Research Capabilities in a...
 
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
 
FAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and NeuroscienceFAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and Neuroscience
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Biostatistics and Statistical Bioinformatics
Biostatistics and Statistical BioinformaticsBiostatistics and Statistical Bioinformatics
Biostatistics and Statistical Bioinformatics
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
Ontologies: What Librarians Need to Know
Ontologies: What Librarians Need to KnowOntologies: What Librarians Need to Know
Ontologies: What Librarians Need to Know
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traits
 
Sabina Leonelli
Sabina LeonelliSabina Leonelli
Sabina Leonelli
 
Going FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standardsGoing FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standards
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management
 

More from Amrapali Zaveri, PhD

Data Quality and the FAIR principles
Data Quality and the FAIR principlesData Quality and the FAIR principles
Data Quality and the FAIR principles
Amrapali Zaveri, PhD
 
Workshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in WikidataWorkshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in Wikidata
Amrapali Zaveri, PhD
 
ESOF Panel 2018
ESOF Panel 2018ESOF Panel 2018
ESOF Panel 2018
Amrapali Zaveri, PhD
 
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignCrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
Amrapali Zaveri, PhD
 
smartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIssmartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIs
Amrapali Zaveri, PhD
 
Introduction to Bio SPARQL
Introduction to Bio SPARQL Introduction to Bio SPARQL
Introduction to Bio SPARQL
Amrapali Zaveri, PhD
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
Amrapali Zaveri, PhD
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
Amrapali Zaveri, PhD
 
Amrapali Zaveri Defense
Amrapali Zaveri DefenseAmrapali Zaveri Defense
Amrapali Zaveri Defense
Amrapali Zaveri, PhD
 
LDQ 2014 DQ Methodology
LDQ 2014 DQ MethodologyLDQ 2014 DQ Methodology
LDQ 2014 DQ Methodology
Amrapali Zaveri, PhD
 
LOD-SEM
LOD-SEMLOD-SEM
TripleCheckMate
TripleCheckMateTripleCheckMate
TripleCheckMate
Amrapali Zaveri, PhD
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Amrapali Zaveri, PhD
 
User-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpediaUser-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpedia
Amrapali Zaveri, PhD
 
ReDD-Observatory
ReDD-ObservatoryReDD-Observatory
ReDD-Observatory
Amrapali Zaveri, PhD
 

More from Amrapali Zaveri, PhD (16)

Data Quality and the FAIR principles
Data Quality and the FAIR principlesData Quality and the FAIR principles
Data Quality and the FAIR principles
 
Workshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in WikidataWorkshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in Wikidata
 
ESOF Panel 2018
ESOF Panel 2018ESOF Panel 2018
ESOF Panel 2018
 
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignCrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
 
smartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIssmartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIs
 
Introduction to Bio SPARQL
Introduction to Bio SPARQL Introduction to Bio SPARQL
Introduction to Bio SPARQL
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
 
Amrapali Zaveri Defense
Amrapali Zaveri DefenseAmrapali Zaveri Defense
Amrapali Zaveri Defense
 
LDQ 2014 DQ Methodology
LDQ 2014 DQ MethodologyLDQ 2014 DQ Methodology
LDQ 2014 DQ Methodology
 
LOD-SEM
LOD-SEMLOD-SEM
LOD-SEM
 
TripleCheckMate
TripleCheckMateTripleCheckMate
TripleCheckMate
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
 
User-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpediaUser-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpedia
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 
ReDD-Observatory
ReDD-ObservatoryReDD-Observatory
ReDD-Observatory
 

Recently uploaded

Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 

Recently uploaded (20)

Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 

MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

  • 1. MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment Amrapali Zaveri and Michel Dumontier @AmrapaliZamrapali.zaveri@maastrichtuniversity.nl Bio-ontologies 2017 July 24-25th, 2017
  • 2. BIOMEDICAL DATA ON THE WEB 2
  • 3. BIOMEDICAL METADATA ON THE WEB — SIGNIFICANCE 3 ➤ For (re-)using this data, we need to understand the structure of datasets and the experimental conditions under which they were produced ➤ We require accurate, structured and complete description of the data -- defined as metadata ➤ Good quality metadata is essential in finding, interpreting, and reusing existing data beyond what the original investigators envisioned ➤ Facilitates a data-driven approach by combining and analyzing similar data to uncover novel insights or even more subtle trends in the data
  • 4. BIOMEDICAL METADATA ON THE WEB - CHALLENGES 4 SIZE complexity QUALITY measures TIME consuming COSTLY, requires experts
  • 5. HYPOTHESIS Crowdsourcing i.e. non-expert workers can be used to curate large-scale digital biomedical metadata on the Web. 5
  • 6. CROWDSOURCING - WHAT & WHY? 6 TIME MONEY ➤ Highly parallelizable tasks ➤ Work is broken down into smaller — ‘micro’ — pieces that can be solved independently ➤ Tasks based on human skills not easily replicable by machines ➤ Non-expert workers can perform the tasks with a minimal payment Consolidated answers solve scientific problems !!
  • 7. RELATED WORK - CROWDSOURCING BIOMEDICAL RESEARCH ➤ Improve automated mining of biomedical text for annotating diseases [1] ➤ Curation of gene-mutation relations [2] ➤ Identifying relationships between drugs and side-effects [3], drugs and their indications [4] ➤ Annotation of microRNA functions [5]. 7
  • 8. GENE EXPRESSION OMNIBUS ➤ Unstructured ➤ Spreadsheet submission ➤ No controlled vocabulary ➤ Heterogeneity of terms ➤ Size complexity ➤ ~Billion records 8
  • 9. Meta-analysis from GEO data A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation Khatri et al. JEM. 210 (11): 2205; DOI: 10.1084/jem.20122709 Metadata issues: • Missing • Incomplete • Inaccurate
  • 10. GEO METADATA - EXAMPLE 10 44,000,000 Key: value pairs
  • 11. GEO METADATA - QUALITY PROBLEMS FOR KEYS ➤ Minor spelling discrepancies ➤ genotype/varaiation, genotype/varat, genotype/varation, genotype/variaion, genotype/variataion, genotype/variation ➤ Different syntactic representations ➤ age (years), age(yrs) and age_year ➤ Different terms to denote one concept ➤ disease, illness, healthy control ➤ Two different key categories in one key name ➤ disease/cell type, tissue/cell line, treatment age 11
  • 12. METACROWD METHODOLOGY 12 GEO Metadata 8 GEO Keys 5 Values (each) • cell line • disease • gender/sex • genotype • strain • time • tissue • treatment Key Definitions SemanticScience Integration Ontology
  • 13. MICRO TASKS — CROWDFLOWER 13
  • 14. MICRO TASKS — SETTINGS 14 • 3 workers per task • ‘Dynamic Judgment’ to 7 workers, with 0.8 confidence • No. of gold standard questions — 60 • Min. accuracy — 80% • 5 cents per judgment • 10 tasks per page
  • 15. RESULTS OVERVIEW 15 No. of microtasks (keys) 1643 Total no. of workers 145 Total no. of judgments 7835 Overall accuracy 0.934 No. of gold standard questions 60 Accuracy on gold standard questions 0.930 Total cost $451 Total time 1 hour
  • 16. RESULTS FOR EACH KEY CATEGORY 16 Key Category No. of Keys True Positive, False Positive Accuracy Cell line 109 711, 21 0.955 Disease 85 412, 10 0.937 Gender 72 645, 23 0.902 Genotype 112 566, 10 0.984 Strain 181 788, 4 0.966 Time 698 2489, 120 0.908 Tissue 145 567, 6 0.947 Treatment 242 846, 49 0.944
  • 17. RESULTS FOR EACH KEY CATEGORY — EXAMPLES (1) 17 Workers classified incorrectly for: • Cell line • cell line initiation date, cell line source age • Disease • diseasestatus • Gender • cell sex • Strain • strain ID • Tissue • tissue & age, tissue/development stage
  • 18. CONCLUSIONS & LIMITATIONS 18 • Crowdsourcing i.e. non-expert workers can be used to curate large-scale digital gene expression metadata on the Web. • Several keys that did not achieve consensus amongst the workers due to either • lack of semantically annotated values • ambiguous nomenclature of keys as well as the values • values indicating that keys belong to more than one category • inconsistent usage of the particular metadata key
  • 19. CROWDSOURCING GEO METADATA QUALITY — FUTURE WORK 19 • Perform crowdsourcing on values and key: value pairs • Implement a semi-automated approach to identify similar keys using ontologies • Design a pipeline to involve semi-automated method+ crowdsourcing + experts
  • 20. REFERENCES [1] Benjamin, M. G., Max, N., Chunlei, W. U. & Andrew, I. S. in Biocomputing 2015 282–293World Scientific (2014). [2]Burger, J. D. et al. Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing. Database 2014, bau094 (2014). [3] Gottlieb, A., Hoehndorf, R., Dumontier, M. & Altman, R. B. Ranking adverse drug reactions with crowdsourcing. J. Med. Internet Res. 17, e80 (2015). [4] Khare, R. et al. Scaling drug indication curation through crowdsourcing. Database 2015, bav016 (2015). [5] Vergoulis, T. et al. mirPub: a database for searching microRNA publications. Bioinformatics 31, 1502–1504 (2015). 20