Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

473 views

Published on

Flash update talk at Bio-ontologies 2017 https://www.iscb.org/cms_addon/conferences/ismbeccb2017/bio-ontologies.php, http://www.bio-ontologies.org.uk/

Published in: Education
  • Be the first to comment

MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

  1. 1. MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment Amrapali Zaveri and Michel Dumontier @AmrapaliZamrapali.zaveri@maastrichtuniversity.nl Bio-ontologies 2017 July 24-25th, 2017
  2. 2. BIOMEDICAL DATA ON THE WEB 2
  3. 3. BIOMEDICAL METADATA ON THE WEB — SIGNIFICANCE 3 ➤ For (re-)using this data, we need to understand the structure of datasets and the experimental conditions under which they were produced ➤ We require accurate, structured and complete description of the data -- defined as metadata ➤ Good quality metadata is essential in finding, interpreting, and reusing existing data beyond what the original investigators envisioned ➤ Facilitates a data-driven approach by combining and analyzing similar data to uncover novel insights or even more subtle trends in the data
  4. 4. BIOMEDICAL METADATA ON THE WEB - CHALLENGES 4 SIZE complexity QUALITY measures TIME consuming COSTLY, requires experts
  5. 5. HYPOTHESIS Crowdsourcing i.e. non-expert workers can be used to curate large-scale digital biomedical metadata on the Web. 5
  6. 6. CROWDSOURCING - WHAT & WHY? 6 TIME MONEY ➤ Highly parallelizable tasks ➤ Work is broken down into smaller — ‘micro’ — pieces that can be solved independently ➤ Tasks based on human skills not easily replicable by machines ➤ Non-expert workers can perform the tasks with a minimal payment Consolidated answers solve scientific problems !!
  7. 7. RELATED WORK - CROWDSOURCING BIOMEDICAL RESEARCH ➤ Improve automated mining of biomedical text for annotating diseases [1] ➤ Curation of gene-mutation relations [2] ➤ Identifying relationships between drugs and side-effects [3], drugs and their indications [4] ➤ Annotation of microRNA functions [5]. 7
  8. 8. GENE EXPRESSION OMNIBUS ➤ Unstructured ➤ Spreadsheet submission ➤ No controlled vocabulary ➤ Heterogeneity of terms ➤ Size complexity ➤ ~Billion records 8
  9. 9. Meta-analysis from GEO data A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation Khatri et al. JEM. 210 (11): 2205; DOI: 10.1084/jem.20122709 Metadata issues: • Missing • Incomplete • Inaccurate
  10. 10. GEO METADATA - EXAMPLE 10 44,000,000 Key: value pairs
  11. 11. GEO METADATA - QUALITY PROBLEMS FOR KEYS ➤ Minor spelling discrepancies ➤ genotype/varaiation, genotype/varat, genotype/varation, genotype/variaion, genotype/variataion, genotype/variation ➤ Different syntactic representations ➤ age (years), age(yrs) and age_year ➤ Different terms to denote one concept ➤ disease, illness, healthy control ➤ Two different key categories in one key name ➤ disease/cell type, tissue/cell line, treatment age 11
  12. 12. METACROWD METHODOLOGY 12 GEO Metadata 8 GEO Keys 5 Values (each) • cell line • disease • gender/sex • genotype • strain • time • tissue • treatment Key Definitions SemanticScience Integration Ontology
  13. 13. MICRO TASKS — CROWDFLOWER 13
  14. 14. MICRO TASKS — SETTINGS 14 • 3 workers per task • ‘Dynamic Judgment’ to 7 workers, with 0.8 confidence • No. of gold standard questions — 60 • Min. accuracy — 80% • 5 cents per judgment • 10 tasks per page
  15. 15. RESULTS OVERVIEW 15 No. of microtasks (keys) 1643 Total no. of workers 145 Total no. of judgments 7835 Overall accuracy 0.934 No. of gold standard questions 60 Accuracy on gold standard questions 0.930 Total cost $451 Total time 1 hour
  16. 16. RESULTS FOR EACH KEY CATEGORY 16 Key Category No. of Keys True Positive, False Positive Accuracy Cell line 109 711, 21 0.955 Disease 85 412, 10 0.937 Gender 72 645, 23 0.902 Genotype 112 566, 10 0.984 Strain 181 788, 4 0.966 Time 698 2489, 120 0.908 Tissue 145 567, 6 0.947 Treatment 242 846, 49 0.944
  17. 17. RESULTS FOR EACH KEY CATEGORY — EXAMPLES (1) 17 Workers classified incorrectly for: • Cell line • cell line initiation date, cell line source age • Disease • diseasestatus • Gender • cell sex • Strain • strain ID • Tissue • tissue & age, tissue/development stage
  18. 18. CONCLUSIONS & LIMITATIONS 18 • Crowdsourcing i.e. non-expert workers can be used to curate large-scale digital gene expression metadata on the Web. • Several keys that did not achieve consensus amongst the workers due to either • lack of semantically annotated values • ambiguous nomenclature of keys as well as the values • values indicating that keys belong to more than one category • inconsistent usage of the particular metadata key
  19. 19. CROWDSOURCING GEO METADATA QUALITY — FUTURE WORK 19 • Perform crowdsourcing on values and key: value pairs • Implement a semi-automated approach to identify similar keys using ontologies • Design a pipeline to involve semi-automated method+ crowdsourcing + experts
  20. 20. REFERENCES [1] Benjamin, M. G., Max, N., Chunlei, W. U. & Andrew, I. S. in Biocomputing 2015 282–293World Scientific (2014). [2]Burger, J. D. et al. Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing. Database 2014, bau094 (2014). [3] Gottlieb, A., Hoehndorf, R., Dumontier, M. & Altman, R. B. Ranking adverse drug reactions with crowdsourcing. J. Med. Internet Res. 17, e80 (2015). [4] Khare, R. et al. Scaling drug indication curation through crowdsourcing. Database 2015, bav016 (2015). [5] Vergoulis, T. et al. mirPub: a database for searching microRNA publications. Bioinformatics 31, 1502–1504 (2015). 20
  21. 21. THANK YOU! QUESTIONS? 21 @AmrapaliZamrapali.zaveri@maastrichtuniversity.nl

×