Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
GARNet workshop on Integrating Large Data into Plant Science
1. Data Sharing
Infrastructures to
Foster Data
Reuse
David Johnson
david.johnson@oerc.ox.ac.uk
@NuDataScientist
Integrating Large Data into Plant Science workshop
21st April 2016
2. Philippe
Rocca-Serra, PhD
Senior Research Lecturer
Alejandra
Gonzalez-Beltran, PhD
Research Lecturer
Milo
Thurston, PhD
Research Software Engineer
Massimiliano Izzo, PhD
Research Software Engineer
Peter
McQuilton, PhD
Knowledge Engineer
Our main areas of research and activity:
• Data collection, curation,
representation etc.
• Data publication
• Data provenance
• Development of software, infrastructure
• Open, community ontologies and
standards
• Semantic web
• Training
Communities we work with/for:
Allyson
Lister, PhD
Knowledge Engineer
Eamonn
Maguire, DPhil
Software Engineer contractor
David
Johnson, PhD
Research Software Engineer
Susanna-Assunta Sansone, PhD
Principal Investigator, Associate Director
(consultant for Nature Publishing Group)
3. Notes in Lab Books
(information for humans)
Spreadsheets andTables
( the compromise)
Facts as RDF statements
(information for machines)
Notes and narrative Spreadsheets and tables Linked data and data publication
Notes in Lab Books
(information for humans)
Spreadsheets andTables
( the compromise)
Facts as R
(informat
n Lab Books
ation for humans)
Spreadsheets andTables
( the compromise)
Facts as RDF statements
(information for machine
Enabling reproducible research and open science,
driving science and discoveries
Increase the level of annotation at the source, tracking provenance and using community standards
Maximize data discoverability and reuse
Applied research approach
Two well-established products with
large user base, embedded in
many funded projects
Several community-driven
ontology and other standards,
embedded in many funded
projects
5. de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
• To structure, enrich and report the description of the datasets and the
experimental context under which they were produced
Community-developed content standards
Formats
Terminologies
Guidelines
6. Mapping the landscape of ‘standards’ in the life sciences
A web-based, curated and searchable registry ensuring that
standards and databases are registered, informative and discoverable;;
monitoring development and evolution of standards, their use in
databases and adoption of both in data policies
1,400 records and growing
7. Mapping the landscape of ‘standards’ in the life sciences
1,400 records and growing
also operating as a WG in Run at is also an contribution to
8. Is there a database, implementing
standards, where to deposit my
metagenomics dataset?
My funder’s data sharing policy
recommends the use of
established standards, but
which ones are widely
endorsed and applicable to my
toxicological and clinical data?
Am I using the most up-to-date
version of this terminology to
annotate cell-based assays?
I understand this format has been
deprecated; what has been replaced
by and who is leading the work?
Are there databases implementing
this exchange format, whose
development we have funded?
What are the mature
standards and
standards-compliant
databases we should
recommend to our
authors?
But how do we help users to make informed decisions?
9. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Search and filter to find what is relevant to your type of data
10. From simple and advance search interfaces to….
Powered by curated descriptions of each
standard and database records, and their
relations;;
….the recommender system
11. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and substitutions
12. Cross-linking standards to standards and databases
Model/format formalizing reporting guideline -->
<-- Reporting guideline used by model/format
We link (descriptions of) standards to
related standards and databases,
implementing them
16. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
ISA powers data collection, curation resources and repositories, e.g.:
Initiated 2003, continues to work with/for many domains
model and related formats
18. 18
Why ISA format and Tools?
ISA metadata specifications:
•workflow and process orientated
•compatible with checklist enforcement
•compatible with external vocabulary resources
•compatible by design with existing schemas
19. 19
1. Essentials about ISA tab syntax
● Investigation File: cardinality: 1..1
– purpose: think “executive summary”
– layout: rows of key value pairs organized in blocks
– content:
• Why? general study description
• How? methods / protocol declaration
• How? variable declarations (predictor and response variables)
• Who? contact and affiliation information
● Study File: cardinality: 1..n
– layout: true header/row of record table (think “sorting, filtering of samples”)
– content:
• What? Listing all biological materials collected over the study course and their
treatments.
● Assay File: cardinality: 1..n
– layout: true header/row of record table (think “sorting, filtering of datafiles”)
– content:
• What? Listing all data acquisition events and data files collected by a given assay and
subsequent data transformations
20. 20
1. Essentials about ISA syntax
Protocol act on Material or Data defining
Workflows:
– Input and Outputs of Protocols are Material Nodes (Source Name, Sample Name, Extract Name, Labeled
Extract Name.) or Data Nodes (Raw Data File or Derived Data File)
Characteristics[…]
Factor Value[…]
(independent variables)
Material Type
Comment[…]
Data NodeMaterial Node
Date (day effect)
Performer (operator effect)
Parameter Value
[…]
Protocol
Application
Material
TransformationSample
Extract Raw Data File
Derived Data File
21. 21
2. basic coding patterns with ISA syntax
The task: rendering a graph in a table
22. 22
– Branching events:
root
mature leaf
A thaliana 1
Source
Name
Characteristic
s[organism] Protocol
REF
Parameter
Value[storage
condition]
Sample
Name
Characteristics[organ]
AT1 A
Thaliana
sample
collection
liquid
nitrogen
AT1
-‐ sample1 flower
AT1 A
Thaliana
sample
collection
liquid
nitrogen AT1
-‐ sample2 mature
leaf
AT1 A
Thaliana
sample
collection
liquid
nitrogen AT1
-‐ sample3 root
Source Material
flower
Sample Material
2. basic coding patterns with ISA syntax
23. 23
– Pooling events:
Source
Name
Characteristic
s[organism] Protocol
REF
Parameter
Value[storage
condition]
Sample
Name
Characteristics[organ]
plant
1
Fragaria
ananassa,
sample
collection
liquid
nitrogen pool1 fruit
plant
2
Fragaria
ananassa,
sample
collection
liquid
nitrogen
pool1 fruit
plant
3
Fragaria
ananassa,
sample
collection
liquid
nitrogen
pool1 fruit
plant 1
plant 2
plant 3
Source Material
fruit
Sample Material
2. basic coding patterns with ISA syntax
24. 24
– Representing interventions and treatments
• expressing treatments as sets of factor levels
• examples: exposure to different doses of systemic herbicide
• Factors will be ‘compound’, ‘dose’ and duration
• (what?,how much?, how long for?)
• Implicit column order matters but this is independent from the ISA syntax
specification:
Source
Name
Characteristic
s[organism] Protocol
REF
Factor
Value[compound]
Factor
Value[dose]
Factor
Value[duration]
Plant
1 Zea
mays treatment
glyphosate
250
mg/day 12
weeks
Plant
2 Zea
mays treatment glyphosate 250
mg/day 12
weeks
Plant
3 Zea
mays treatment glyphosate 20
mg/day 12
weeks
2. basic coding patterns with ISA syntax
25. 25
–Tagging with Terminologies
• ISA tools (ISAcreator - ISAconfigurator) provide
Ontology term selection and term tagging facilities to
help users.
Source
Name
Characteristics[
ORGANISM]
Term
Source
REF
Term
Accession
Number
Characteristics[
AGE]
Unit
Term
Source
REF
Term
Accession
Number
Factor
Value[COMPOUND
(htppt://purl]
Term
Source
REF Term
Accession
Number
individual1 Homo
sapiens NCBITax 9606 12 week UO
UO:wwer
wta
aspirin CHEBI 1231354
2. basic coding patterns with ISA syntax
Source
Name Characteristics[ORGANISM] Characteristics[AGE] Factor
Value[COMPOUND]
individual1 human 12
weeks aspirin
26. 26
ISA syntax boundaries
● Any model is a compromise between granularity
and simplicity
● Some cases are hard to represent
– crossover design with dissimilar arms
– representing mixtures of chemical
– representing loops (with donors and recipients)
● Reaching the limits of how graphs can be efficiently
represented in tables
27. 27
– A case of simple non destructive HTP :
– 60 genotypes x 5 replicates : 12 trays of 25 pots each
– 1 seed per pot gives us 300 individual plants
– experiment duration: 35 days
– single daily data acquisition:
• visible light: 3 angles + top view = 4 images
• near infrared: 3 angles + top view = 4 images
• fluorescence: 1 angle = 1 image
• TOTAL: 9 images per plant per day
– Grand Total: 94,500 files to store and track
Plant H-T Phenotyping worked example
28. 28
– Decomposing the experiment in term of ISA elements
– Identifying key experimental variables:
• independent variables => used to define ISA Factors and/or
Characteristics
– Factor = {genotype}, Factor Values[G1..G60] = 60 distinct values
– Factor = {day}, Factor Values[day1..day35] = 35 distinct values
• response variables => used to define 3 distinct ISA Assays
– morphology using visible light imaging
» ISA parameters to track ‘camera position’ {top,left,right,centre}
– water content using near infrared imaging
» ISA parameters to track ‘camera position’ {top,left,right,centre}
– photosynthetic pigment concentration using fluorescence imaging
» ISA parameters to track ‘camera position’ {top}
Plant H-T Phenotyping worked example
29. 29
– Decomposing the experiment in term of ISA elements
– Identifying key experimental variables:
• independent variables => used to define ISA Factors and/or
Characteristics
– Factor = {genotype}, Factor Values[ ] = 60 distinct values
– Factor = {day}, Factor Values[ ] = 35 distinct values
• Automatic creating and filling of ISA Study Sample files
– 60 x 35 = 2100 factor combinations
– 5 replicates per factor combination => 10500 pots with 1 seed per
pot to be grown
– Translated into :
» 1 ISA study file with 10500 row on the following pattern
Plant H-T Phenotyping worked example
30. 30
Declaring
and
annotating
an
ISA
Source
Node
ISA
Protocol
Application
with
sets
of
Parameter
Values
resulting
in
a
ISA
Sample
Node
Reporting
of
independent
variables
as
ISA
Factor
Values
Plant H-T Phenotyping worked example
31. 31
– Decomposing the experiment in term of ISA elements
– Identifying key experimental variables:
• response variables => used to define 3 distinct ISA Assays
– morphology using visible light imaging
» ISA parameters to track ‘camera position’
{top,left,right,centre}
– water content using near infrared imaging
» ISA parameters to track ‘camera position’
{top,left,right,centre}
– photosynthetic pigment concentration using fluorescence
imaging
» ISA parameters to track ‘camera position’ {top}
Plant H-T Phenotyping worked examples
32. 32
Describing
a
data
acquisition
event
ISA
Protocol
Application
of
type
Data
Transformation
with
sets
of
Parameter
Values
resulting
in
a
ISA
Derived
Data
File
Reporting
of
independent
variables
as
ISA
Factor
Values
Plant H-T Phenotyping worked examples