Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fairification experience clarifying the semantics of data matrices

75 views

Published on

This webinar presents the Statistics Ontology, STATO which is a semantic framework to support the creation of standardized analysis reports to help with review of results in the form of data matrices. STATO includes a hierarchy of classes and a vocabulary for annotating statistical methods used in life, natural and biomedical sciences investigations, text mining and statistical analyses.

Published in: Healthcare
  • Be the first to comment

  • Be the first to like this

Fairification experience clarifying the semantics of data matrices

  1. 1. FAIRification experience: Clarifying the semantics of data matrices on Thurs, 28th May 2020 at 15:00 BST Hosted by: Ian Harrow, Pistoia Alliance Speaker: Philippe Rocca-Serra, Oxford e-Research Group, University of Oxford FAIR/OM projects Community of Interest webinar series
  2. 2. This webinar is being recorded
  3. 3. Audience Q&A Please use the questions box
  4. 4. ©PistoiaAlliance Philippe Rocca-Serra Ø Associate Member of Faculty, at the Oxford e-Research Centre, Department of Engineering Science, at the University of Oxford since 2010, where he focuses on activities supporting open data and open science. Ø Before that, based at the EMBL European Bioinformatics Institute (EBI), he established the ISA framework to support the reporting of multi-omics data and which underlies Metabolights, the European repository for Metabolomics data. Ø As part of the OBO foundry, he also works with the community to develop semantic resources to improve data reporting. Ø More recently, as part of the FAIRplus IMI funded project, he is coordinating the creation of guidance on FAIR data and FAIRification processes.
  5. 5. ©PistoiaAlliance FAIRification experience: Clarifying the semantics of data matrices Philippe Rocca-Serra, PhD [0000-0001-9853-5668] @Phil_at_OeRC philippe.rocca-serra@oerc.ox.ac.uk Data Readiness Group, Oxford e-Research Centre Department of Engineering Science, University of Oxford Pistoia Alliance Webinar, May 28th, 2020
  6. 6. ©PistoiaAlliance An Exercise in FAIRification And Lessons learned
  7. 7. ©PistoiaAlliance GC-MS targeted metabolite profiling data: available only as supplementary material An ideal test case to perform a FAIRification process “A Rose of the Garden Fair” An Exercise in FAIRification
  8. 8. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification Key constraints and rules of the game: ● Use existing open tools and use out-of-box solutions ● Use existing open efforts ● Demonstrate the value added
  9. 9. ©PistoiaAlliance Step1: address Findability Accessibility We made the initial spreadsheet table discoverable and citable by uploading it to Zenodo, assigning an open license (CC-BY 4.0), and obtaining a persistent unique identifier: https://doi.org/10.5281/zenodo.2598799 “A Rose of the Garden Fair” An Exercise in FAIRification
  10. 10. ©PistoiaAlliance ● “What’s wrong with Excel files?” ○ Ad-hoc structuring ○ Absence of annotation ○ Absence of provenance ○ Risk of auto-correction ○ … ● What are the alternatives? ○ Domain specific specifications: Metabolights Assignment File ○ Generic open standards (domain independent): JSON / Tabular data package “A Rose of the Garden Fair” An Exercise in FAIRification
  11. 11. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification Offsets Protocol information Table of actual data
  12. 12. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification This is a Data Cube Step 2: address Interoperability We regularized the three dimensions of the matrix (data cube), which represent: i) the metabolites molecular entities), ii) the treatments (experimental conditions and corresponding biomaterials and bioassays), and iii) the quantitation type (measurements).
  13. 13. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification 1. Molecular entities: the easiest to model and identify 2. Average & SE Measurement About what? A population / study group 1. Decomposing the `metaheader` “R. Chinensis ‘Old Blush’ sepals’ into its parts, namely: a. Plant Part (Anatomicalformation) b. Genotype (taxonomic information) The most important information is the fact that 3a and 3b are ‘Independent Variable’/ Predictors. Essential Information capturing the intent of experimentalists. Central Information and Organizing principle for structuring information => Specific Markup Required
  14. 14. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification 1. Molecular entities: the easiest to model and identify 2. Average & SE Measurement About what? A population / study group 1. Decomposing the `metaheader` “R. Chinensis ‘Old Blush’ sepals’ into its parts, namely: a. Plant Part (Anatomical information) b. Genotype (taxonomic information) Key Step: Clarifying the semantics of data matrices Main Guide: Design of Experiment Information
  15. 15. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification https://github.com/proccaserra/rose2018ng- notebook/blob/master/data/processed/rose-data/rose-aroma- data-integration-datapackage.json
  16. 16. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification https://github.com/proccaserra/rose2018ng- notebook/blob/master/data/processed/rose-data/rose-aroma- data-integration-datapackage.json https://github.com/proccaserra/rose2018ng- notebook/blob/master/data/processed/rose-data/rose-aroma- naturegenetics2018-treatment-group-mean-sem-report-table- example.csv
  17. 17. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification
  18. 18. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification
  19. 19. ©PistoiaAlliance ETL Process an ad-hoc MS Excel table as supplementary Material Tabular Data Package: A ‘long table’ compatible with graphic grammar plotting library (Python Plotnine, R- ggplot2, tidyr/tidyverse) “A Rose of the Garden Fair” An Exercise in FAIRification Key Step: Choosing a syntax to maximize reuse Main Guide: What is the preferred input to popular plotting libraries? Pandas DataFrame operations pd.read_excel() pd.wide_to_long()
  20. 20. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification Key Step: Clarifying the semantics of data matrices Annotating each side of the Data Cube Step2: Semantic Annotation and conversion to RDF Plant Ontology NCBI Taxonomy ChEBI1. Molecular entities 2. Treatment and Independent Variables 3. Quantitation Types/Measures https://github.com/ISA- tools/stato
  21. 21. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification 1. Data Integration Challenge: Is it possible to apply the same process to another data? a. Test with another dataset by the same group published in 2015 b. All steps are the same except the conversion starts from pdf document instead of an Excel document c. Comparison of metabolite profile between datasets
  22. 22. ©PistoiaAlliance Step3: Visual Exploration “A Rose of the Garden Fair” An Exercise in FAIRification
  23. 23. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification
  24. 24. ©PistoiaAlliance Relevance for FAIRplus: ReSOLUTE case study NGS based dataset: RNA-Seq data matrix a. transcriptTPM_parental_celllines.csv b. geneTPM_parental_celllines.csv c. mean_transcriptTPM_parental_celllines.cs d. mean_geneTPM_parental_celllines.csv
  25. 25. ©PistoiaAlliance Relevance for FAIRplus: ReSOLUTE case study Observations: ● The matrix is missing an explicit dimension ● The Quantitation Type is implicitly reported in the file name ● TPM is absent from ontologies but FPKM is present in STATO => term submission ● Provenance information is missing ○ Which tool or workflow generates the matrix? ○ Which genome reference was used? ○ Which version of ensembl was used? ○ Where all this information should be capture? ■ Capability gap
  26. 26. ©PistoiaAlliance “A Rose of the Garden Fair” An Exercise in FAIRification Key Step: FAIRify software
  27. 27. ©PistoiaAlliance https://doi.org/10.5281/zenodo.3274257 Here is where to find all the material presented
  28. 28. ©PistoiaAlliance ● Process planning : Integration Challenge & competency questions ● Licensing ● Modeling ● Terminology Services: ○ Terminology hosting ○ Terminology lookup service ○ Terminology extension (term submission) ● Process planning ○ Transformation: ■ Format conversion ■ Term extraction ■ Term tagging ● Missing data handling ● Quality control / Assessment ● Identifier minting ● Publishing Capability Identification during this FAIRification process
  29. 29. ©PistoiaAlliance ● Process planning : Integration Challenge & competency questions ● Licensing ● Modeling ● Terminology Services: ○ Terminology hosting ○ Terminology lookup service ○ Terminology extension (term submission) ● Process planning ○ Transformation: ■ Format conversion ■ Term extraction ■ Term tagging ● Missing data handling ● Quality control / Assessment ● Identifier minting ● Publishing Capability Identification during this FAIRification process https://fairsharing.github.io/FAIR-Evaluator-FrontEnd/#!/evaluations/170 https://doi.org/10.1101/649202
  30. 30. ©PistoiaAlliance ● Extension of STATO Ontology: https://github.com/ISA-tools/stato Ongoing Development and Future Work
  31. 31. ©PistoiaAlliance Ongoing Development and Future Work Bringing it all together in a Wellcome Trust Funded Project: DataScriptor
  32. 32. ©PistoiaAlliance ● Collaboration with the Open Knowledge Foundation & Frictionless Data Group ● Integration of work with the ISA framework (ISA-API) ● EMBL-EBI Metabolights, NASA Gene Lab or effort such as FAIRDOM ● Collaboration with the (via FAIRplus ?) , integrative work with UDM and Allotrope Foundation work on LCMS for instance Ongoing Development and Future Work
  33. 33. ©PistoiaAlliance ● Prof Susanna Sansone, UOXF (Rose Dataset FAIRification) ● Dominique Batista, UOXF (FAIR evaluator UI) ● Dr Mark Wilkinson, University of Madrid (FAIR evaluator engine) ● IMI FAIRplus project ● Lilly Winfree, Evgueni Korev, Frictionless Data ● H2020 PhenoMeNal project (when the work started) Acknowledgments
  34. 34. ©PistoiaAlliance Thank you for your attention! Question Time
  35. 35. The Cellosaurus, a FAIR repository to help researchers navigate the confusing universe of cell lines Join us for the next FAIR/OM CoI webinar: Speaker: Amos Bairoch, Swiss institute of Bioinformatics Thurs 25th June at 15:00 BST
  36. 36. Get Involved! Join the FAIR Implementation project Ian Harrow Ian.harrow@pistoiaalliance.org Membership: membership@pistoiaalliance.org General Enquiries: Zahid Tharia – zahid.tharia@pistoiaalliance.org www.pistoiaalliance.orgwww.pistoiaalliance.org
  37. 37. Next Webinar Quantum Computing & Pharma: Building Partnerships to Accelerate Research and Innovation Weds, June 10th, 2020, 16:00 – 17:30 BST www.pistoiaalliance.org/webinars-2020/
  38. 38. info@pistoiaalliance.org @pistoiaalliance www.pistoiaalliance.org Thanks for your attention

×