The document discusses the Investigation/Study/Assay (ISA) metadata framework for making bioscience research more reproducible and reusable. It describes the need for contextual metadata standards across different experimental domains and technologies. It also addresses challenges of multiple standards and communities leading to issues with interoperability. The framework aims to provide well-annotated, structured data to facilitate integration and reuse of research studies and experiments.
Pathway resources at the Rat Genome DatabaseJennifer Smith
This poster by Victoria Petri about RGD's pathway resources was presented at the December 2009 Rat Genomics and Models meeting at Cold Spring Harbor Laboratory.
Pathway resources at the Rat Genome DatabaseJennifer Smith
This poster by Victoria Petri about RGD's pathway resources was presented at the December 2009 Rat Genomics and Models meeting at Cold Spring Harbor Laboratory.
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Alejandra Gonzalez-Beltran
La biología experimental se ha convertido en una ciencia intensiva en datos, gracias a los avances en las tecnologías de adquisición de señales digitales y biosensores. La disponibilidad de los datos es fundamental para la transparencia del proceso científico: para poder reproducir los resultados y también para la reutilización de los datos en estudios futuros. Esta charla explorará distintas herramientas de software que facilitan el proceso de generación de metadatos para mejorar la calidad, el reporte, la publicación y la revisión de datos, con énfasis en aplicaciones biomédicas.
Portland 100 kick-off presentation public finalPDCshare
The Portland 100 is an initiative to scale Portland's most promising young firms and establish a robust pool of local companies. The program was formulated by Nitin Khanna about a year ago and with the help of Portland Development Commission has kicked off its initial class. What is the 100 in Portland 100? During the kickoff meeting Nitin explained,“The 100 in Portland 100 signifies $100 million. You (the companies) in Portland 100, are the companies that we believe can reach a $100 million in valuation or revenue in three to five years. We are here because we believe in you and believe this program will help you get to and exceed this level that has been such a rare occurrence for Portland companies in the last 2 decades.”
Talk given at the Data Visualisation and the Future of Academic Publishing event. https://www.eventbrite.com/e/data-visualisation-and-the-future-of-academic-publishing-tickets-25372801733?password=dataviz
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
The reproducibility of science in the digital age is attracting a lot of attention and concerns from the scientific community, where studies have shown the inability to reproduce results due to a variety of reasons, ranging from unavailability of the data to lack of proper descriptions of the experimental steps.
Multiple research object models have been proposed to describe different aspects of the research process. Investigation/Study/Assay (ISA) is a widely used general-purpose metadata tracking framework with an associated suite of open-source software, which offers a rich description of the experiment’s hypotheses and design, investigators involved, experimental factors, protocols applied. The information is organised in a three-level hierarchy where ’Investigation’ provides the project context for a ’Study’ (a research question), which itself contains one or more ’Assays’ (taking analytical measurements and key data processing and analysis steps). Nanopublication (NP) is a research object model which enables specific scientific assertions, such as the conclusions of an experiment, to be annotated with supporting evidence, published and cited. Lastly, the Research Object (RO) is a model that enables the aggregation of the digital resources contributing to findings of computational research, including results, data and software, as citable compound digital objects.
For computational reproducibility, platforms such as Taverna and Galaxy are popular and efficient ways to represent the data analysis steps in the form of reusable workflows, where the data transformations can be specified and executed in an automatic way.
In this presentation, we will address the question of whether such research object models and workflow representation frameworks can be used to assist in the peer review process, by facilitating evaluation of the accuracy of the information provided by scientific articles with respect to their repeatability.
Our case study is based on an article on a genome assembler algorithm published in GigaScience, but due to the proven use of the respective research object models in their respective communities, we argue that the combination of models and workflow system will improve the scholarly publishing process, making science peer-reproduced.
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools
Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014
Increased access to the data generated is fuelling increased consumption and accelerating the cycle of discovery. But the successful integration and re-use of heterogeneous data from multiple providers and scientific domains is a major challenge within academia and industry, often due to incomplete description of the study details or metadata about the study. Using the BioSharing, ISA Commons and the STATistics Ontology (STATO) projects as exemplar community efforts, in this breakout session we will discuss the evolving portfolio of community-based standards and methods for structuring and curating datasets, from experimental descriptions to the results of analysis.
http://www.methodsinecologyandevolution.org/view/0/events.html#Data_workshop
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Alejandra Gonzalez-Beltran
La biología experimental se ha convertido en una ciencia intensiva en datos, gracias a los avances en las tecnologías de adquisición de señales digitales y biosensores. La disponibilidad de los datos es fundamental para la transparencia del proceso científico: para poder reproducir los resultados y también para la reutilización de los datos en estudios futuros. Esta charla explorará distintas herramientas de software que facilitan el proceso de generación de metadatos para mejorar la calidad, el reporte, la publicación y la revisión de datos, con énfasis en aplicaciones biomédicas.
Portland 100 kick-off presentation public finalPDCshare
The Portland 100 is an initiative to scale Portland's most promising young firms and establish a robust pool of local companies. The program was formulated by Nitin Khanna about a year ago and with the help of Portland Development Commission has kicked off its initial class. What is the 100 in Portland 100? During the kickoff meeting Nitin explained,“The 100 in Portland 100 signifies $100 million. You (the companies) in Portland 100, are the companies that we believe can reach a $100 million in valuation or revenue in three to five years. We are here because we believe in you and believe this program will help you get to and exceed this level that has been such a rare occurrence for Portland companies in the last 2 decades.”
Talk given at the Data Visualisation and the Future of Academic Publishing event. https://www.eventbrite.com/e/data-visualisation-and-the-future-of-academic-publishing-tickets-25372801733?password=dataviz
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
The reproducibility of science in the digital age is attracting a lot of attention and concerns from the scientific community, where studies have shown the inability to reproduce results due to a variety of reasons, ranging from unavailability of the data to lack of proper descriptions of the experimental steps.
Multiple research object models have been proposed to describe different aspects of the research process. Investigation/Study/Assay (ISA) is a widely used general-purpose metadata tracking framework with an associated suite of open-source software, which offers a rich description of the experiment’s hypotheses and design, investigators involved, experimental factors, protocols applied. The information is organised in a three-level hierarchy where ’Investigation’ provides the project context for a ’Study’ (a research question), which itself contains one or more ’Assays’ (taking analytical measurements and key data processing and analysis steps). Nanopublication (NP) is a research object model which enables specific scientific assertions, such as the conclusions of an experiment, to be annotated with supporting evidence, published and cited. Lastly, the Research Object (RO) is a model that enables the aggregation of the digital resources contributing to findings of computational research, including results, data and software, as citable compound digital objects.
For computational reproducibility, platforms such as Taverna and Galaxy are popular and efficient ways to represent the data analysis steps in the form of reusable workflows, where the data transformations can be specified and executed in an automatic way.
In this presentation, we will address the question of whether such research object models and workflow representation frameworks can be used to assist in the peer review process, by facilitating evaluation of the accuracy of the information provided by scientific articles with respect to their repeatability.
Our case study is based on an article on a genome assembler algorithm published in GigaScience, but due to the proven use of the respective research object models in their respective communities, we argue that the combination of models and workflow system will improve the scholarly publishing process, making science peer-reproduced.
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools
Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014
Increased access to the data generated is fuelling increased consumption and accelerating the cycle of discovery. But the successful integration and re-use of heterogeneous data from multiple providers and scientific domains is a major challenge within academia and industry, often due to incomplete description of the study details or metadata about the study. Using the BioSharing, ISA Commons and the STATistics Ontology (STATO) projects as exemplar community efforts, in this breakout session we will discuss the evolving portfolio of community-based standards and methods for structuring and curating datasets, from experimental descriptions to the results of analysis.
http://www.methodsinecologyandevolution.org/view/0/events.html#Data_workshop
Scratchpads in the Biodiversity Informatics LandscapeVince Smith
Roberts, D., Harman, K., Rycroft, S.D. & Smith, V.S. Stockholm Biodiversity Informatics Symposium 2008, Swedish Museum of Natural History, Stockholm, Sweden 1-4 December 2008.
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
Maize database is the most important database in the bioinformatics. so i hope it is beneficial to the B.Sc.in Agriculture and M.Sc. in Genetics and Plant Breeding.
Sampling bias in the species-people correlation, network epidemiology, botanic gardens, Europe by night, plant health policy governance landscape, biodiversity conservation at the interface between disciplines, Random sample of 100 papers per year on ‘species richness’ in Web of Science, ecosystem services, sustainability, GDP, natural resources, London School of Economics
Abstract-GasCan is a specialized and unique database of gastric cancer protein encoding genes expressed in human and mouse. The features that make GasCan unique are availability of gene information, availability of primers for each gene, with their features and conditions given that are useful in PCR amplification, especially in cloning experiments and to make it more unique built in programmed sequence analysis facility is provided that analyze gene sequences in database itself, resulting sequence analysis information can be valuable for researchers in different experiments. Furthermore, DNA sequence analysis tool is provided that can be access freely. GasCan will expand in future to other species, genes and cover more useful information of other species. Flexible database design, expandability and easy access of information to all of the users are the main features of the database. The Database is publicly available at http://www.gastric-cancer.site40.net.
Slides introducing the session on 'Big data in healthcare' at the Brazil-UK Frontiers of Engineering symposium held at Jarinu, Sao Paulo, Brazil - 6-8 November 2014.
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
1. The
Inves)ga)on/Study/Assay
(ISA)
metadata
framework
for
reproducible
and
reusable
bioscience
research
Alejandra
González-‐Beltrán,
PhD
on
behalf
of
the
ISATeam
Oxford
e-‐Research
Centre,
University
of
Oxford
Faculty
of
Technology,
Environment
and
Engineering
Birmingham
City
University
12th
March
2013
2. Ioannidis
et
al.,
Repeatability
of
published
microarray
gene
expression
analyses.
Nature
Gene*cs
41(2),
149-‐55
(2009)
doi:10.1038/ng.295
3. Ioannidis
et
al.,
Repeatability
of
published
microarray
gene
expression
analyses.
Nature
Gene*cs
41(2),
149-‐55
(2009)
doi:10.1038/ng.295
8. Need
for
a
generic
representa)on,
applied
to:
•microarray
based
experiments
(MAGE)
•sequencing
based
experiments
(SRA)
•flow
cytometry
based
experiments
(FuGE-‐Flow
Cyt)
•mass
spectrometry
and
NMR
spectroscopy
experiments
(Metabolights
and
PRIDE)
9. Roadmap
Reproducible
&
Reusable
Bioscience
Research
10. Roadmap
reasoning
visualiza)on
analysis
browsing
integra)on
exchange
retrieval
Well-‐annotated
&
Structured
Data
Reproducible
&
Reusable
Bioscience
Research
11. Roadmap
reasoning
visualiza)on
analysis
browsing
integra)on
exchange
retrieval
Well-‐annotated
&
Structured
Data
Reproducible
&
Reusable
Bioscience
Research
User
community
12. Roadmap
reasoning
visualiza)on
analysis
browsing
integra)on
exchange
retrieval
Community
Standards
Sodware
Tools
Well-‐annotated
&
Structured
Data
Reproducible
&
Reusable
Bioscience
Research
User
community
15. Bioscience
is
mul)-‐domain…
health
env
agro
tox/pharma
§
Interdisciplinary
and
integra:ve
in
character
• need
to
deal
with
new
and
exis:ng
datasets
• deal
with
a
variety
of
data
types
Source
of
the
figure:
EBI
website
16. Mul)ple
communi)es,
mul)ple
norms
and
standards,
e.g.:
use
the
same
term
to
allow
data
to
flow
from
report
the
same
core,
refer
to
the
same
‘thing’
one
system
to
another
essen)al
informa)on
Challenges: lack of interaction and coordination, duplication of effort,
fragmentation and uneven coverage…hinders interoperability
18. But…
what
do
we
know
about
them
and
how
they
are
related
MAGE-Tab! AAO! miame!
GCDML! MIAPA!
CHEBI! GIATE!
SRAxml! OBI! MIRIAM!
VO!
SOFT! MIQAS!
FASTA! PATO! MIX!
CML! ENVO! REMARK!
DICOM! MIGEN!
GELML! MOD!
SBRML! MIAPE! MIQE!
TEDDY!
MITAB! MzML! XAO! CIMR! CONSORT!
BTO!
ISA-Tab! SEDML…! DO
PRO! IDO…! MIASE! MISFISHIE….!
19. But…
what
do
we
know
about
them
and
how
they
are
related
I
use
high
throughput
Which
tools
and
sequencing
technologies,
databases
which
ones
are
relevant
to
implement
which
me?
standards?
How
can
I
get
What
are
the
involved
to
propose
criteria
to
evaluate
extensions
or
their
status
and
modifica)ons?
value?
Which
ones
are
Which
formats
I
work
on
plants,
are
mature
enough
for
support
specific
these
standards
just
me
to
use
or
minimum
for
biomedical
recommend?
informa)on
applica)ons?
guidelines?
20. A
coherent,
curated
and
searchable
catalogue
of
data
sharing
resources
• Bioscience
standards
and
associated
data-‐sharing
policies,
publica:ons,
tools
and
databases
• Assessment
criteria
for
usability
and
popularity
of
standards
• Rela:onships
among
standards
• Encouragement
for
communica:on
&
interac:on
among
groups
• Promo)ng
interoperability
&
informed
decisions
about
standards
21.
infrastructure
22. ISA
sodware
suite:
suppor)ng
standards-‐compliant
experimental
annota)on
and
enabling
cura)on
at
infrastructure
the
community
level
Rocca-‐Serra
et
al,
2010
Bioinforma)cs
• Assist
in
the
annota)on
and
management
of
experimental
metadata
at
source,
suppor)ng
data
provenance
tracking
• Deal
with
high-‐throughput
studies
using
one
or
a
combina)on
of
omics
and
other
technologies
• Empower
users
to
uptake
community-‐defined
checklists
and
ontologies
• Facilitate
data
sharing,
re-‐use,
comparison
and
reproducibility
of
experiments,
submission
to
interna)onal
public
repositories
23.
24.
25.
26.
27. faahKO
dataset
• Available
in
Bioconductor
• Subset
of
the
original
data
on
global
metabolite
profiling
Saghatlian
et
al.
Biochemistry.
2004
• LC/MS
peaks
from
the
spinal
cords
of
6
wild-‐type
and
6
FAAH
(fa[y
acid
amyde
hydrolase)
knockout
mice
28. -‐
Define
key
en))es
(e.g.
factors,
protocols,
parameters)
-‐
Grouping
of
studies
-‐
Relate
studies
and
assays
faahKO
inves)ga)on
29. -‐ Subjects
studied:
source(s),
sampling
methodology,
characteris)cs
faahKO
study
-‐ treatments/manipula)ons
performed
to
prepare
the
specimens
NEWT
UniProt
Taxonomy
Database
Mouse
Genome
Informa)cs
30. -‐ Subjects
studied:
source(s),
sampling
methodology,
characteris)cs
faahKO
study
-‐ treatments/manipula)ons
performed
to
prepare
the
specimens
Mouse
Adult
Gross
Anatomy
31. -‐ measurement
type,
e.g.
metabolite
profiling
-‐ technology,
e.g.
mass
spectrometry
faahKO
assay
32.
33.
34. Create template(s) to fit the type of
experiments to be described
Create
templates
detailing
the
steps
to
be
reported
for
different
inves)ga)ons,
complying
to
community
standards,
e.g.
configuring
the
value(s)
allowed
for
each
field
to
be
• text
(with/without
regular
expression
tes)ng),
• ontology
terms,
• numbers
etc.
35. Describe, curate your experiment using a
desktop-based tool
Report and edit the description using this tool,
(also customized using the templates) with a
spreadsheet like look and feel, packed with
functionalities such as
• ontology search (access via )
• term-tagging features
• import from spreadsheets etc…
36. • Ontology
search
and
automated
tagging
(relying
on
NCBO
Bioportal
services)
on
Google
Spreadsheets
• Collabora)ve
annota)on;
support
for
distributed
users
• Version
control
&
history
OntoMaton:
a
Bioportal
powered
Ontology
widget
for
Google
Spreadsheets
Maguire
et
al,
2013
Bioinforma)cs
37.
38.
39.
40.
41. • R
package
available
in
BioConductor
2.11
h[p://bioconductor.org/packages/release/bioc/html/Risa.html
• ISAtab
class
• Read
ISAtab
files
into
ISAtab
objects
and
write
ISAtab
files
back
to
disk
• Increment
metadata
with
defini)on
factors/
treatments/groups
• Build
xcmsSet
(xcms
package)
objects
from
mass
spectrometry
assays
• Augment
the
ISAtab
dataset
ader
analysis
•
source
&
issues
tracking
h[ps://github.com/ISA-‐tools/Risa
42. • faahKO
package
v.
2.12
contains
ISAtab
files
describing
the
experiment
faahkoISA
=
readISAta(find.package("faahKO"))
assay.filename
<-‐
faahkoISA["assay.filenames"][[1]]
xset
=
processAssayXcmsSet(faahkoISA,
assay.filename)
…
updateAssayMetadata(faahkoISA,
assay.filename,"Derived
Spectral
Data
File","faahkoDSDF.txt"
)
• MTBLS2
processing
and
analysis
using
Risa,
xcms
and
CAMERA
BioConductor
packages
Metabolights – an open access general-purpose repository for
metabolomics studies and associated meta-data
Haug et al, 2012
Nucleic Acids Research
44. Hybridiza)on
Derived
Array
Data
File
Sample
Name
Material
Type
Assay
Design
REF
Array
Data
File
Protocol
REF
Assay
Name
sample1
genomic
DNA
assay1
A-AFFY-107" assay1.cel
data
normaliza)on
assay1.txt
sample2
genomic
DNA
assay2
A-AFFY-107" assay2.cel
data
normaliza)on
assay2.txt
sample3
genomic
DNA
assay3
A-AFFY-107" assay3.cel
data
normaliza)on
assay3.txt
Material
transforma)ons...
Material
Node
Data
File
Node
"
" DATA!
Characteristics[…]
Material! Derived Data File
Factor Value[…]
(independent Protocol
variables)
Process
Material Type
Comment[…]
Parameter
Value
"
[…]
"
Material! DATA! Raw Data
Performer
(operator effect)
File
Date
(day effect)
45. 45
Tagging:
from
free
text
to
ontology-‐based
• single
interven)on
representa)on,
free
text
annota)on
Factor
Characteris)cs[organism]
Factor
Factor
Source
Name
Value[perturba)on
Value[dose]
Value[dura)on]
agent]
individual1
human
aspirin
high
dose
12
weeks
• single
interven)on,
ontology-‐based
annota)on
Factor
Characteris)cs[organism
Term
Source
Term
Accession
Value[chemical
Term
Source
Term
Accession
Source
Name
obi:0100026)])
REF
Number
compound
REF
Number
CHEBI_37577)]
individual1
Homo
sapiens
NCBITax
9606
aspirin
CHEBI
1231354
Factor
Term
Source
Term
Accession
Factor
Value[)me
Term
Source
Term
Accession
Unit
Value[dose(OBI_0000984)
REF
Number
(PATO_0000165)]
REF
Number
low
dose
LNC
LP30872-‐3
12
week
UO
0000034
46. ToxBank
effort
developed
by
Nina
Jeliazkova
Health
Care
&
Life
Sciences
Kohonen
et
al.
The
ToxBank
Data
Warehouse:
a
Interest
Group
research
cluster
of
7
EU
FP7
Health
systems
toxicology
and
toxicogenomics
projects.
47. • Make
the
seman)cs
of
ISAtab
explicit,
including
materials
&
data
en))es
&
processes
&
their
rela)onships
• Provide
incen)ves
for
provision
of
ontology-‐
based
annota)ons
in
ISA-‐TAB
datasets;
exploit
those
annota)ons
• Augment
ISA
syntax
with
new
elements
(e.g.
groups),
facilita)ng
the
understanding
&
querying
of
experimental
design
• Facilitate
data
integra)on
&
knowledge
discovery/reasoning
49. vocabularies
Chemical
Biomolecular
Informa)on
domain
domain
domain
Experimental
domain
Factor
Characteris)cs[organi
Term
Term
Accession
Value[chemical
Term
Source
Term
Accession
Source
Name
smobi:0100026)])
Source
REF
Number
compound
REF
Number
CHEBI_37577)]
individual1
Homo
sapiens
NCBITax
9606
aspirin
CHEBI
1231354
50. Open
Biological
and
Biomedical
Ontologies
(OBO)
Foundry
BFO
ChEBI
GO
IAO
Factor
Characteris)cs[organi
OBI
Term
Term
Accession
Value[chemical
Term
Source
Term
Accession
Source
Name
smobi:0100026)])
Source
REF
Number
compound
REF
Number
CHEBI_37577)]
individual1
Homo
sapiens
NCBITax
9606
aspirin
CHEBI
1231354
53. faahKO
dataset
Available
in
Bioconductor
(with
ISA-‐TAB
metadata)
Global
metabolite
profiling
Data
subset:
LC/
MS
peaks
from
the
spinal
cords
of
6
wild-‐type
and
6
FAAH
(fa[y
acid
amyde
hydrolase)
knockout
mice
54.
55. • support
different
conversion
modes
(different
levels
of
granularity)
• querying
for
ISA-‐TAB
datasets,
across
mul)ple
experiment
types
• reasoning
exploi)ng
ontology
annota)ons
–
seman)c
valida)on
of
ISA-‐TAB
datasets
• augmented
annota)on
over
na)ve
ISA
syntax
– iden)fica)on
gaps
in
ontological
representa)ons
– feedback
of
findings
to
community
ontologies
56. Increasing
level
of
structure
for
experimental
metadata
Notes
in
Lab
books
Spreadsheets
&
Tables
Facts
as
RDF
statements
(ISAtab
metadata)
57.
58. Towards
interoperable
bioscience
data
Sansone
et
al,
2012
Nature
Gene)cs
A
growing
ecosystem
of
over
30
public
and
internal
resources
using
the
ISA
metadata
tracking
framework
to
facilitate
standards-‐compliant
collec)on,
cura)on,
management
and
reuse
of
inves)ga)ons
in
an
increasingly
diverse
set
of
life
science
domains.