Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Odam: Open Data, Access and Mining

1,487 views

Published on

ODAM is an Experiment Data Table Management System (EDTMS) that gives you an open access to your data and make them ready to be mined - A data explorer as bonus

Published in: Data & Analytics
  • Be the first to comment

Odam: Open Data, Access and Mining

  1. 1. Give an open access to your data and make them ready to be mined Daniel Jacob UMR 1332 BFP – Metabolism Group Bordeaux Metabolomics Facility May 2016 Open Data for Access and Mining A data explorer as bonus EDTMS ODAM
  2. 2. Daniel Jacob – INRA UMR 1332 –May 2016 The experimental context: needs / wishesseeding harvesting samples preparation samples analysis Sample identifiers 2 Experiment Data Tables Experiment Design Web API Develop if needed, lightweight tools - R scripts (Galaxy), lightweight GUI (R shiny) Make both metadata and data available for data mining identifiers centrally managed data sharing & data availability facilitate the subsequent data mining 1 2 3 EDTMS ODAM Open Data for Access and Mining : The core idea in one shot
  3. 3. Daniel Jacob – INRA UMR 1332 –May 2016 Data repository Data capture Minimal effort (PUT) PUT myhost.org http://myhost.org/ mount GET Implementation of an Experiment Data Tables Management System (EDTMS) Experiment Data Tables Merely dropping data files in a data repository (e.g. a local NAS or distant storage space) should allow users to access them by web API Data can be downloaded, explored and mined No database schema, no programming code and no additional configuration on the server side. Open Data for Access and Mining : The core idea in one shot EDTMS ODAM 3
  4. 4. Daniel Jacob – INRA UMR 1332 –May 2016 plants.tsv harvests.tsv samples.tsv compounds.tsv Data subset files enzymes.tsv • Whatever the kind of experiment, this assumes a design of experiment (DoE) involving individuals, samples or whatever things, as the main objects of study (e.g. plants, tissues, bacteria, …) • This also assumes the observation of dependent variables resulting of effects of some controlled experimental factors. • Moreover, the objects of study have usually an identifier for each of them, and the variables can be quantitative or qualitative. • We can have either one object type of study or several kinds, but in this latter case, it must exist a relationship between object types that we assume of “obtainedFrom" type. Preparation and cleaning of the data sub-sets of files EDTMS ODAM 4
  5. 5. Daniel Jacob – INRA UMR 1332 –May 2016 plants.tsv harvests.tsv samples.tsv compounds.tsv Classification of each column within its right category enzymes.tsv Data subset files factor quantitative qualitative identifier link categories EDTMS ODAM 5 Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values) • You have to organize your data subsets so that links could be established between them. • In practical, it means to add a column containing the identifiers corresponding to the entity to which you want to connect the subset, implying a ‘obtainedFrom’ relation. • It is to be noted that this duplication of identifiers must be the only redundant information, through all data subsets.
  6. 6. Daniel Jacob – INRA UMR 1332 –May 2016 plants.tsv harvests.tsv samples.tsv enzymes.tsv Data subset files compounds.tsv Plants Harvests Samples Compounds Enzymes Connections between the dataset files based on identifiers Entities (concepts) Link between 2 subsets being carried out from identifiers (implies a ‘obtainedFrom’ relation) Identifier of the central entity of the subset EDTMS ODAM factor quantitative qualitative identifier link categories 6
  7. 7. Daniel Jacob – INRA UMR 1332 –May 2016 Supplementary files In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata: For that, 2 metadata files are required • s_subsets.tsv: a file allowing to associate with each subset of data a key concept corresponding to the main entity of the subset and the relations of the type "obtainedFrom" between these concepts • a_attributes.tsv: a metadata file allowing each attribute (concept/variable) to be annotated with some minimal but relevant metadata Creation of the metadata files EDTMS ODAM 7 Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note: TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
  8. 8. Daniel Jacob – INRA UMR 1332 –May 2016 s_subsets.tsv This metadata file allows to associate a key concept to each data subset file Creation of the metadata files EDTMS ODAM 8 Plants Compounds Enzymes Harvests Samples plants.tsv PlanteID harvests.tsv Lot samples.tsv SampleID compounds.tsv enzymes.tsv SampleID SampleID 1 2 3 4 5 Identifier of the central entity of the subset Link between 2 subsets (implies a ‘obtainedFrom’ relation) Unique rank number of the data subset Key concept (i.e. the main entity) associated to the subset in the form of a short name Plants1 factor quantitative qualitative identifier categories PlanteID plants.tsv Data file name
  9. 9. Daniel Jacob – INRA UMR 1332 –May 2016 a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with some minimal but relevant metadata Creation of the metadata files EDTMS ODAM 9 factor quantitative qualitative identifier categories Plants Harvests Samples Compounds … …
  10. 10. Daniel Jacob – INRA UMR 1332 –May 2016 s_subsets.tsv a_attributes.tsv … … Additional subsets/ attributes can be added step by step, as soon as data are produced. Updating the metadata files EDTMS ODAM
  11. 11. Daniel Jacob – INRA UMR 1332 –May 2016 Uploading your datasets in the data repository EDTMS ODAM No database schema, no programming code and no additional configuration on the server side. Your data subset files Your dataset entry (named ‘frim1’ as example) within the data repository Z: (Storage) Merely dropping data files on the data repository (e.g. NAS) should allow users to access them by web API Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values) Data repository PUT myhost.orgmount GET Data capture Minimal effort (PUT)
  12. 12. Daniel Jacob – INRA UMR 1332 –May 2016 http://myhost.org/check/frim1 myhost.org StorageDataRepos NAS Checking online if your the data subset files are consistent EDTMS ODAM Many test checks can be automatically done for you
  13. 13. Daniel Jacob – INRA UMR 1332 –May 2016 EDTMS ODAM Data storage seeding harvesting samples analysis samples preparation 13 GET , maximal efficiency (GET) After depositing your complete dataset as described previously: • An open access is given to your data through web API • They are ready to be mined • No specific code or additional configuration are needed (*) https://www.erasysbio.net/index.php?index=266 minimal effort (PUT) PUT Format TSV Data Data Linking Preparation and cleaning of the data sub-sets of files FRIM1(*) Check Open Data, Access and Mining : web API
  14. 14. Daniel Jacob – INRA UMR 1332 –May 2016 Data Format TSV EDTMS ODAM Data linking Open Data, Access and Mining : web API REST Services: hierarchical tree of resource naming (URL) Retrieving data Retrieving metadata <data format> <dataset name> <subset> (<subset>) <entry><category> <value> <value> <value> <entry> GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > factor quantitative qualitative identifier link categories FRIM1 (*) xml/tsv/json frim1 14 (*) https://doi.org/10.5281/zenodo.154041
  15. 15. Daniel Jacob – INRA UMR 1332 –May 2016 EDTMS ODAM Open Data, Access and Mining : web API REST Services: hierarchical tree of resource naming (URL) 15 GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > Field Description Examples <data format> format of the retrieved data; possible values are: 'xml' or 'csv' xml <dataset name> Short name (tag) of your dataset frim1 <subset> Short name of a data subset samples <entry> Name of an attribute entry (defined by the user in the a_attribute file (column ‘entry’) sampleid <category> Name of the attribute category; (assigned by the user in the a_attribute file (column ‘category’) possible values are: ‘identifier’, ‘factor’, ‘qualitative’, ‘quantitative’ quantitative (<subset>) Set of data subsets by merging all the subsets with lower rank than the specified subset and following the pathway defined by the "is_part_of" links. (samples)  plants + harvests + samples <value> Exact value of the desired entry or category 1, factor
  16. 16. Daniel Jacob – INRA UMR 1332 –May 2016 EDTMS ODAM Open Data, Access and Mining : web API REST Services: hierarchical tree of resource naming (URL) 16 GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > http://myhost.org/getdata/<data format>/<dataset name>/<subset>/<entry>/<value> http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<category> http://myhost.org/getdata/<data format>/<dataset name> http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<entry>/<value> http://myhost.org/getdata/<data format>/<dataset name>/<subset> http://myhost.org/getdata/<data format>/<dataset name>/(<subset>) • Get the subset list of a dataset • Get all values within a data subset • Get values within a data subset for a specific value of an entry • Get all values within a set of data subsets • Get values within a set of data subsets for a specific value of an entry • Get the attribute list within a set of data subsets for a specific category
  17. 17. Daniel Jacob – INRA UMR 1332 –May 2016 http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants http://myhost.org/getdata/xml/frim1/harvests/lot/1 http://myhost.org/getdata/xml/frim1/(compounds)/quantitative Metadata Metadata Data Data Open Data Access via web API: Examples based on FRIM1 EDTMS ODAM FRIM1 17
  18. 18. Daniel Jacob – INRA UMR 1332 –May 2016 http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control Set of data subsets by merging all the subsets with lower rank than the specified subset and following the pathway defined by the “obtainedFrom" links. (samples)  plants + harvests + samples Open Data Access via web API: Examples based on FRIM1 EDTMS ODAM FRIM1 18
  19. 19. Daniel Jacob – INRA UMR 1332 –May 2016 Data Format TSV minimal effort, maximal efficiency EDTMS ODAM Data linking Open Data Access via web API: Application layer FRIM1 19 … Use existing tools - Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …
  20. 20. Daniel Jacob – INRA UMR 1332 –May 2016 Retrieving Data within R Open Data Access via web API: Application layer The R package Rodam EDTMS ODAM 20
  21. 21. Daniel Jacob – INRA UMR 1332 –May 2016 Open Data Access via web API Rodam package 21 <data format> <dataset name> <subset> (<subset>) <entry><category> <value> <value> <value> <entry> tsv frim1 samples sample 365 GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365
  22. 22. Daniel Jacob – INRA UMR 1332 –May 2016 Open Data Access via web API Read metadata i.e. category types within the data Get the data subset ‘activome’ along with its metadata 22 <data format> <dataset name> <subset> (<subset>) <entry> <category> <value> <value> <entry> tsv frim1 activome factor GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor Rodam package
  23. 23. Daniel Jacob – INRA UMR 1332 –May 2016 Open Data Access via web API 23 Rodam package
  24. 24. Daniel Jacob – INRA UMR 1332 –May 2016 Data / Metadata Data Mining ? Make both metadata and data available for data mining. Experimentation / Analysis MFA rCCA pLDA … Open Data Access via web API activome qNMR_metabo Water StressControl ODAM facilitates the subsequent data mining All Dev. Stages All Treatments ODAM facilitates the subsequent data mining (log10 transformed) 24 Rodam package
  25. 25. Daniel Jacob – INRA UMR 1332 –May 2016 Develop if needed, lightweight tools - R scripts (Galaxy), lightweight GUI (R shiny) minimal effort, maximal efficiency … Use existing tools - Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, … EDTMS ODAM Data Format TSV Data linking Open Data Access via web API: Application layer FRIM1 25
  26. 26. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 26 http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
  27. 27. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 27 http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
  28. 28. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 28
  29. 29. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 29 To remove an item from the selection: i) click on it, and then ii) click on the ‘Suppr’ key
  30. 30. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 30
  31. 31. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 31 Explore several possibilities by interacting with the graph
  32. 32. Daniel Jacob – INRA UMR 1332 –May 2016 To summarize 1. Preparation and cleaning of the data sub-sets of files 2. Classification of each column within its right category 3. Connections between the dataset files based on identifiers 4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv 5. Deposit of the dataset files in the data repository 6. Checking online if your the data subset files are consistent 7. Testing online the web-services on your dataset 8. Use of the web API through an application layer (R scripts, data explorer, ... ) EDTMS ODAM Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values) Note: TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas (See https://en.wikipedia.org/wiki/Tab-separated_values)
  33. 33. Daniel Jacob – INRA UMR 1332 –May 2016 Advantages of this approach data sharing & data availability - The array of the "plants" may be created even before planting the seeds. - Similarly, the array of the "harvests" can be created as soon as the harvests are done, and this before any analysis. - Thus, these arrays are generated only once in the project and we can set up the sharing soon the seed planting. Then each analysis comes to complement the set of data as soon as they produce their own sub-dataset. - data are accessible to everyone as soon as they are produced, identifiers centrally managed - data are archived and compiled, so that it becomes useless to proceed a laborious investigation to find out who possesses the right identifiers, etc. EDTMS ODAM seeding harvesting samples analysis Sample identifiers samples preparation
  34. 34. Daniel Jacob – INRA UMR 1332 –May 2016 Advantages of this approach facilitate the subsequent publication of data - data are already readily available online by web API, - But nothing prevents to take this data to fill in existing databases, by adjoining more elaborate annotations. - Neither administrator privileges nor any programmatic skills are required EDTMS ODAM Data Format TSV Data linking PUT GET Data capture Minimal effortData analysis/mining Maximum efficiency
  35. 35. Daniel Jacob – INRA UMR 1332 –May 2016 minimal effort, maximum efficiency Format the data - Based on TSV: choice to keep the good old way of scientist to use worksheets, thus i) using the same tool for both data files and metadata definition files, ii) no programmatic skill are required Give an access through a web services layer - based on current standards (REST) Use existing tools - Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, … Develop if needed, lightweight tools - R scripts, lightweight GUI (R shiny) Advantages of this approach biostatflow.org EDTMS ODAM
  36. 36. Daniel Jacob – INRA UMR 1332 –May 2016 Have a good fun !! Daniel Jacob UMR 1332 BFP – Metabolism Group Bordeaux Metabolomics Facility May 2016 Open Data for Access and Mining https://hub.docker.com/r/odam/getdata/ http://www.bordeaux.inra.fr/pmb/dataexplorer/ https://github.com/INRA/ODAM https://cran.r-project.org/package=Rodam https://zenodo.org/record/154041 An online example

×