Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GARNet workshop on Integrating Large Data into Plant Science

922 views

Published on

Workshop on Integrating Large Data into Plant Science organised by GARNet and egenis, at Dartington Hall, Devon, 21st April 2016.

Published in: Data & Analytics
  • Be the first to comment

GARNet workshop on Integrating Large Data into Plant Science

  1. 1. Data Sharing Infrastructures to Foster Data Reuse David Johnson david.johnson@oerc.ox.ac.uk @NuDataScientist Integrating Large Data into Plant Science workshop 21st April 2016
  2. 2. Philippe   Rocca-­Serra,  PhD Senior  Research  Lecturer Alejandra Gonzalez-­Beltran,  PhD Research  Lecturer Milo   Thurston,  PhD Research  Software  Engineer Massimiliano  Izzo,  PhD Research  Software  Engineer Peter   McQuilton,  PhD Knowledge  Engineer Our  main  areas  of  research  and  activity: • Data  collection,  curation,   representation  etc. • Data  publication • Data  provenance   • Development  of  software,  infrastructure • Open,  community  ontologies  and   standards • Semantic  web • Training Communities we work with/for: Allyson   Lister,  PhD Knowledge  Engineer Eamonn Maguire,  DPhil Software  Engineer  contractor David   Johnson,  PhD Research  Software  Engineer Susanna-­Assunta  Sansone,  PhD Principal  Investigator,  Associate  Director   (consultant  for  Nature  Publishing  Group)
  3. 3. Notes in Lab Books (information for humans) Spreadsheets andTables ( the compromise) Facts as RDF statements (information for machines) Notes and narrative Spreadsheets and tables Linked data and data publication Notes in Lab Books (information for humans) Spreadsheets andTables ( the compromise) Facts as R (informat n Lab Books ation for humans) Spreadsheets andTables ( the compromise) Facts as RDF statements (information for machine Enabling  reproducible  research  and  open  science, driving  science  and  discoveries Increase  the  level  of  annotation  at  the  source,  tracking  provenance  and  using  community  standards Maximize  data  discoverability  and  reuse Applied  research  approach Two  well-­established  products  with   large  user  base,  embedded  in   many  funded  projects Several  community-­driven   ontology  and  other  standards,   embedded  in  many  funded   projects
  4. 4. 86 349 200 MIAME MIAPA MIRIAM MIQASMIX MIGEN ARRIVE MIAPPE MIASE MIQE MISFISHIE…. REMARK CONSORT MAGE-Tab GCDML SRA XML SOFT FASTA DICOM MzML SBRML SED-ML… GELML ISAtab CML MITAB AAO CHEBI OBI PATO ENVO MOD BTO IDO… TEDDY PRO XAO DO VO In the life sciences there are > 600 content standards Databases and tools implementing Standards;also training material on and around standards nmrML ISA-JSON Formats Terminologies Guidelines CO
  5. 5. de jure de facto grass-roots groups standard organizations Nanotechnology Working Group • To structure, enrich and report the description of the datasets and the experimental context under which they were produced Community-developed content standards Formats Terminologies Guidelines
  6. 6. Mapping  the  landscape  of  ‘standards’  in  the  life  sciences A  web-­based,  curated and  searchable  registry  ensuring  that   standards and  databases are  registered,  informative and  discoverable;;   monitoring  development  and  evolution of  standards,  their  use in   databases  and  adoption  of  both  in  data  policies 1,400  records  and  growing  
  7. 7. Mapping  the  landscape  of  ‘standards’  in  the  life  sciences 1,400  records  and  growing   also  operating   as  a  WG  in  Run  at is  also  an contribution   to
  8. 8. Is there a database, implementing standards, where to deposit my metagenomics dataset? My funder’s data sharing policy recommends the use of established standards, but which ones are widely endorsed and applicable to my toxicological and clinical data? Am I using the most up-to-date version of this terminology to annotate cell-based assays? I understand this format has been deprecated; what has been replaced by and who is leading the work? Are there databases implementing this exchange format, whose development we have funded? What are the mature standards and standards-compliant databases we should recommend to our authors? But  how  do  we  help  users  to  make  informed  decisions?
  9. 9. The  International  Conference  on  Systems  Biology  (ICSB),  22-­28  August,  2008              Susanna-­Assunta   Sansone www.ebi.ac.uk/net-­project Search  and  filter  to  find  what  is  relevant  to  your  type  of  data
  10. 10. From  simple  and  advance  search  interfaces  to…. Powered  by  curated  descriptions  of  each   standard  and  database  records,  and  their   relations;; ….the  recommender  system
  11. 11. The  International  Conference  on  Systems  Biology  (ICSB),  22-­28  August,  2008              Susanna-­Assunta   Sansone www.ebi.ac.uk/net-­project Tracking  evolution,  e.g.  deprecations  and  substitutions
  12. 12. Cross-­linking  standards  to  standards  and  databases Model/format  formalizing  reporting  guideline  -­-­>   <-­-­ Reporting  guideline  used  by  model/format We  link  (descriptions  of)  standards  to   related  standards  and  databases,   implementing  them
  13. 13. Standards  and  databases  cross-­linked
  14. 14. model and related formats These tools and formats will help you to:
  15. 15. The  International  Conference  on  Systems  Biology  (ICSB),  22-­28  August,  2008              Susanna-­Assunta   Sansone www.ebi.ac.uk/net-­project ISA powers data collection, curation resources and repositories, e.g.: Initiated 2003, continues to work with/for many domains model and related formats
  16. 16. 17 ISA in a nutshell
  17. 17. 18 Why ISA format and Tools? ISA metadata specifications: •workflow and process orientated •compatible with checklist enforcement •compatible with external vocabulary resources •compatible by design with existing schemas
  18. 18. 19 1. Essentials about ISA tab syntax ● Investigation File: cardinality: 1..1 – purpose: think “executive summary” – layout: rows of key value pairs organized in blocks – content: • Why? general study description • How? methods / protocol declaration • How? variable declarations (predictor and response variables) • Who? contact and affiliation information ● Study File: cardinality: 1..n – layout: true header/row of record table (think “sorting, filtering of samples”) – content: • What? Listing all biological materials collected over the study course and their treatments. ● Assay File: cardinality: 1..n – layout: true header/row of record table (think “sorting, filtering of datafiles”) – content: • What? Listing all data acquisition events and data files collected by a given assay and subsequent data transformations
  19. 19. 20 1. Essentials about ISA syntax Protocol act on Material or Data defining Workflows: – Input and Outputs of Protocols are Material Nodes (Source Name, Sample Name, Extract Name, Labeled Extract Name.) or Data Nodes (Raw Data File or Derived Data File) Characteristics[…] Factor Value[…] (independent variables) Material Type Comment[…] Data NodeMaterial Node Date (day effect) Performer (operator effect) Parameter Value […] Protocol Application Material TransformationSample Extract Raw  Data  File Derived  Data  File
  20. 20. 21 2. basic coding patterns with ISA syntax The task: rendering a graph in a table
  21. 21. 22 – Branching events: root mature leaf A thaliana 1 Source  Name Characteristic s[organism] Protocol  REF Parameter   Value[storage   condition] Sample  Name Characteristics[organ] AT1 A  Thaliana sample   collection liquid  nitrogen AT1  -­‐ sample1 flower AT1 A  Thaliana sample   collection liquid  nitrogen AT1  -­‐ sample2 mature  leaf AT1 A  Thaliana sample   collection liquid  nitrogen AT1  -­‐ sample3 root Source Material flower Sample Material 2. basic coding patterns with ISA syntax
  22. 22. 23 – Pooling events: Source  Name Characteristic s[organism] Protocol  REF Parameter   Value[storage   condition] Sample  Name Characteristics[organ] plant  1 Fragaria ananassa, sample   collection liquid  nitrogen pool1 fruit plant  2 Fragaria ananassa, sample   collection liquid  nitrogen pool1 fruit plant  3 Fragaria ananassa, sample   collection liquid  nitrogen pool1 fruit plant 1 plant 2 plant 3 Source Material fruit Sample Material 2. basic coding patterns with ISA syntax
  23. 23. 24 – Representing  interventions  and  treatments • expressing  treatments  as  sets  of  factor  levels • examples:    exposure  to  different  doses  of  systemic  herbicide • Factors  will  be  ‘compound’,  ‘dose’ and  duration • (what?,how much?,  how  long  for?) • Implicit  column  order  matters  but  this  is  independent  from  the  ISA  syntax   specification: Source  Name Characteristic s[organism] Protocol  REF Factor   Value[compound] Factor   Value[dose] Factor   Value[duration] Plant  1 Zea  mays treatment glyphosate 250  mg/day 12  weeks Plant  2 Zea  mays treatment glyphosate 250  mg/day 12  weeks Plant  3 Zea  mays treatment glyphosate 20  mg/day 12  weeks 2.  basic  coding  patterns  with  ISA  syntax
  24. 24. 25 –Tagging with Terminologies • ISA tools (ISAcreator - ISAconfigurator) provide Ontology term selection and term tagging facilities to help users. Source  Name Characteristics[ ORGANISM] Term  Source   REF Term   Accession   Number Characteristics[ AGE] Unit Term  Source   REF Term   Accession   Number Factor   Value[COMPOUND   (htppt://purl] Term  Source  REF Term  Accession  Number individual1 Homo  sapiens NCBITax 9606 12 week UO UO:wwer wta aspirin CHEBI 1231354 2. basic coding patterns with ISA syntax Source  Name Characteristics[ORGANISM] Characteristics[AGE] Factor  Value[COMPOUND] individual1 human 12  weeks aspirin
  25. 25. 26 ISA syntax boundaries ● Any model is a compromise between granularity and simplicity ● Some cases are hard to represent – crossover design with dissimilar arms – representing mixtures of chemical – representing loops (with donors and recipients) ● Reaching the limits of how graphs can be efficiently represented in tables
  26. 26. 27 – A case of simple non destructive HTP : – 60 genotypes x 5 replicates : 12 trays of 25 pots each – 1 seed per pot gives us 300 individual plants – experiment duration: 35 days – single daily data acquisition: • visible light: 3 angles + top view = 4 images • near infrared: 3 angles + top view = 4 images • fluorescence: 1 angle = 1 image • TOTAL: 9 images per plant per day – Grand Total: 94,500 files to store and track Plant H-T Phenotyping worked example
  27. 27. 28 – Decomposing the experiment in term of ISA elements – Identifying key experimental variables: • independent variables => used to define ISA Factors and/or Characteristics – Factor = {genotype}, Factor Values[G1..G60] = 60 distinct values – Factor = {day}, Factor Values[day1..day35] = 35 distinct values • response variables => used to define 3 distinct ISA Assays – morphology using visible light imaging » ISA parameters to track ‘camera position’ {top,left,right,centre} – water content using near infrared imaging » ISA parameters to track ‘camera position’ {top,left,right,centre} – photosynthetic pigment concentration using fluorescence imaging » ISA parameters to track ‘camera position’ {top} Plant H-T Phenotyping worked example
  28. 28. 29 – Decomposing the experiment in term of ISA elements – Identifying key experimental variables: • independent variables => used to define ISA Factors and/or Characteristics – Factor = {genotype}, Factor Values[ ] = 60 distinct values – Factor = {day}, Factor Values[ ] = 35 distinct values • Automatic creating and filling of ISA Study Sample files – 60 x 35 = 2100 factor combinations – 5 replicates per factor combination => 10500 pots with 1 seed per pot to be grown – Translated into : » 1 ISA study file with 10500 row on the following pattern Plant H-T Phenotyping worked example
  29. 29. 30 Declaring  and  annotating  an  ISA  Source  Node ISA  Protocol  Application  with  sets  of   Parameter  Values  resulting  in  a  ISA  Sample   Node Reporting  of  independent   variables  as  ISA  Factor  Values Plant H-T Phenotyping worked example
  30. 30. 31 – Decomposing the experiment in term of ISA elements – Identifying key experimental variables: • response variables => used to define 3 distinct ISA Assays – morphology using visible light imaging » ISA parameters to track ‘camera position’ {top,left,right,centre} – water content using near infrared imaging » ISA parameters to track ‘camera position’ {top,left,right,centre} – photosynthetic pigment concentration using fluorescence imaging » ISA parameters to track ‘camera position’ {top} Plant H-T Phenotyping worked examples
  31. 31. 32 Describing  a  data  acquisition   event ISA  Protocol  Application  of  type  Data   Transformation  with  sets  of  Parameter   Values  resulting  in  a  ISA  Derived  Data  File Reporting  of  independent   variables  as  ISA  Factor  Values Plant H-T Phenotyping worked examples
  32. 32. Collaborative Open Plant Omics 34
  33. 33. ISA tools in the Cloud 35
  34. 34. 36 You can email us... isatools@googlegroups.com View our blog http://isatools.org/blog Follow us on Twitter @isatools @biosharing View our websites http://www.isa-tools.org http://www.biosharing.org View our Git repo & contribute http://github.com/ISA-tools

×