The document discusses knowledge management of experimental data through the ISA ecosystem. It describes the ISA-tab format and software suite that allows annotation and curation of experimental metadata. As a use case, it analyzes a dataset on metabolite profiling from a study of fatty acid amide hydrolase knockout mice. The ISA tools can represent investigations and assays, convert data to standardized formats, and facilitate sharing and analysis of experimental data.
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools
Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools
Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Ontomaton: NCBO BioPortal Ontology lookups in Google Spreadsheets produced by ISATeam at University of Oxford e-Research Centre (Eamonn Maguire, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra and Susanna Sansone) and NCBO (Trish Whetzel).
The work was presented during ICBO 2013 in Montreal by Trish Whetzel (Thanks Trish!)
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
Citing data in research articles: principles, implementation, challenges - an...FAIRDOM
Prepared and presented by Jo McEntyre (EMBL_EBI) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
FAIR Data, Operations and Model management for Systems Biology and Systems Me...Carole Goble
FAIR Data, Operations and Model management for Systems Biology and Systems Medicine Projects given at 1st Conference of the European Association of Systems Medicine, 26-28 October 2016, Berlin. the FAIRDOM project is described.
Model organisms such as budding yeast provide a common platform to interrogate and understand cellular and physiological processes. Knowledge about model organisms, whether generated during the course of scientific investigation, or extracted from published articles, are made available by model organism databases (MODs) such as the Saccharomyces Genome Database (SGD) for powerful, data-driven bioinformatic analyses. Integrative platforms such as InterMine offer a standard platform for MOD data exploration and data mining. Yet, today’s bioinformatic analyses also requires access to a significantly broader set of structured biomedical data, such as what can be found in the emerging network of Linked Open Data (LOD). If MOD data could be provisioned as FAIR (Findable, Accessible, Interoperable, and Reusable), then scientists could leverage a greater amount of interoperable data in knowledge discovery.
The goal of this proposal is to increase the utility of MOD data by implementing standards-compliant data access interfaces that interoperate with Linked Data. We will focus our efforts on developing interfaces for data access, data retrieval, and query answering for SGD. Our software will publish InterMine data as LOD that are semantically annotated with ontologies and be retrieved using standardized formats (e.g. JSON-LD, Turtle). We will facilitate the exploration of MOD data for hypothesis testing, by implementing efficient query answering using Linked Data Fragments, and by developing a set of graphical user interfaces to search for data of interest, explore connections, and answer questions that leverage the wider LOD network. Finally, we will develop a locally and cloud-deployable image to enable the rapid deployment of the proposed infrastructure. Our efforts to increase interoperability and ease of deployment for biomedical data repositories will increase research productivity and reduce costs associated with data integration and warehouse maintenance.
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Ontomaton: NCBO BioPortal Ontology lookups in Google Spreadsheets produced by ISATeam at University of Oxford e-Research Centre (Eamonn Maguire, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra and Susanna Sansone) and NCBO (Trish Whetzel).
The work was presented during ICBO 2013 in Montreal by Trish Whetzel (Thanks Trish!)
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
Citing data in research articles: principles, implementation, challenges - an...FAIRDOM
Prepared and presented by Jo McEntyre (EMBL_EBI) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
FAIR Data, Operations and Model management for Systems Biology and Systems Me...Carole Goble
FAIR Data, Operations and Model management for Systems Biology and Systems Medicine Projects given at 1st Conference of the European Association of Systems Medicine, 26-28 October 2016, Berlin. the FAIRDOM project is described.
Model organisms such as budding yeast provide a common platform to interrogate and understand cellular and physiological processes. Knowledge about model organisms, whether generated during the course of scientific investigation, or extracted from published articles, are made available by model organism databases (MODs) such as the Saccharomyces Genome Database (SGD) for powerful, data-driven bioinformatic analyses. Integrative platforms such as InterMine offer a standard platform for MOD data exploration and data mining. Yet, today’s bioinformatic analyses also requires access to a significantly broader set of structured biomedical data, such as what can be found in the emerging network of Linked Open Data (LOD). If MOD data could be provisioned as FAIR (Findable, Accessible, Interoperable, and Reusable), then scientists could leverage a greater amount of interoperable data in knowledge discovery.
The goal of this proposal is to increase the utility of MOD data by implementing standards-compliant data access interfaces that interoperate with Linked Data. We will focus our efforts on developing interfaces for data access, data retrieval, and query answering for SGD. Our software will publish InterMine data as LOD that are semantically annotated with ontologies and be retrieved using standardized formats (e.g. JSON-LD, Turtle). We will facilitate the exploration of MOD data for hypothesis testing, by implementing efficient query answering using Linked Data Fragments, and by developing a set of graphical user interfaces to search for data of interest, explore connections, and answer questions that leverage the wider LOD network. Finally, we will develop a locally and cloud-deployable image to enable the rapid deployment of the proposed infrastructure. Our efforts to increase interoperability and ease of deployment for biomedical data repositories will increase research productivity and reduce costs associated with data integration and warehouse maintenance.
BioVeL (Biodiversity Virtual e-Laboratory) is an e-laboratory that supports research on biodiversity using large amounts of data from cross-disciplinary sources.
Bioinformatics databases: Current Trends and Future PerspectivesUniversity of Malaya
Data is the most powerful resource in any field or subject of study. In Biology, data comes from scientists and their actions, while any institution that makes sense of the data collected, will be in the forefront in their respective research field. In the beginning of any data collection endeavour, it is critical to find proper management techniques to store data and to maximise its utilisation. This presentation reflects upon the current trends and techniques of data modeling, architecture with a highlight on the uses of database, focusing on Bioinformatics examples and case studies. Finally, the future of bioinformatics databases is highlighted to give an overview of the modeling techniques to accommodate the biological data escalation in coming years.
Talk given at the Data Visualisation and the Future of Academic Publishing event. https://www.eventbrite.com/e/data-visualisation-and-the-future-of-academic-publishing-tickets-25372801733?password=dataviz
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
The reproducibility of science in the digital age is attracting a lot of attention and concerns from the scientific community, where studies have shown the inability to reproduce results due to a variety of reasons, ranging from unavailability of the data to lack of proper descriptions of the experimental steps.
Multiple research object models have been proposed to describe different aspects of the research process. Investigation/Study/Assay (ISA) is a widely used general-purpose metadata tracking framework with an associated suite of open-source software, which offers a rich description of the experiment’s hypotheses and design, investigators involved, experimental factors, protocols applied. The information is organised in a three-level hierarchy where ’Investigation’ provides the project context for a ’Study’ (a research question), which itself contains one or more ’Assays’ (taking analytical measurements and key data processing and analysis steps). Nanopublication (NP) is a research object model which enables specific scientific assertions, such as the conclusions of an experiment, to be annotated with supporting evidence, published and cited. Lastly, the Research Object (RO) is a model that enables the aggregation of the digital resources contributing to findings of computational research, including results, data and software, as citable compound digital objects.
For computational reproducibility, platforms such as Taverna and Galaxy are popular and efficient ways to represent the data analysis steps in the form of reusable workflows, where the data transformations can be specified and executed in an automatic way.
In this presentation, we will address the question of whether such research object models and workflow representation frameworks can be used to assist in the peer review process, by facilitating evaluation of the accuracy of the information provided by scientific articles with respect to their repeatability.
Our case study is based on an article on a genome assembler algorithm published in GigaScience, but due to the proven use of the respective research object models in their respective communities, we argue that the combination of models and workflow system will improve the scholarly publishing process, making science peer-reproduced.
Increased access to the data generated is fuelling increased consumption and accelerating the cycle of discovery. But the successful integration and re-use of heterogeneous data from multiple providers and scientific domains is a major challenge within academia and industry, often due to incomplete description of the study details or metadata about the study. Using the BioSharing, ISA Commons and the STATistics Ontology (STATO) projects as exemplar community efforts, in this breakout session we will discuss the evolving portfolio of community-based standards and methods for structuring and curating datasets, from experimental descriptions to the results of analysis.
http://www.methodsinecologyandevolution.org/view/0/events.html#Data_workshop
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Alejandra Gonzalez-Beltran
La biología experimental se ha convertido en una ciencia intensiva en datos, gracias a los avances en las tecnologías de adquisición de señales digitales y biosensores. La disponibilidad de los datos es fundamental para la transparencia del proceso científico: para poder reproducir los resultados y también para la reutilización de los datos en estudios futuros. Esta charla explorará distintas herramientas de software que facilitan el proceso de generación de metadatos para mejorar la calidad, el reporte, la publicación y la revisión de datos, con énfasis en aplicaciones biomédicas.
Slides introducing the session on 'Big data in healthcare' at the Brazil-UK Frontiers of Engineering symposium held at Jarinu, Sao Paulo, Brazil - 6-8 November 2014.
1. The
open
source
ISA
soOware
suite
and
its
internaQonal
user
community:
Knowledge
management
of
experimental
data
Alejandra
González-‐Beltrán
Senior Software Engineer, ISATeam
Oxford
e-‐Research
Centre,
University
of
Oxford
Oxford,
UK
NETTAB
2012
–
Integrated
Bio-‐Search,
Como,
Italy,
November
14-‐16
2. Outline
• Knowledge
management
of
experimental
data
– SeSng
the
scene
– The
ecosystem:
ISA-‐tab,
tools,
community
– Use
case
• Latest
addiQons
• Related
projects
&
main
points
3. SeSng
the
scene
health
agro
env
tox/pharma
Source
of
the
figure:
EBI
website
Bioscience
is
mulQ-‐domain…
4. SeSng
the
scene
health
agro
env
tox/pharma
Source
of
the
figure:
EBI
website
Bioscience
is
mulQ-‐domain…
Petabytes
of
data
5. SeSng
the
scene
health
agro
env
tox/pharma
Source
of
the
figure:
EBI
website
Bioscience
is
mulQ-‐domain…
Petabytes
of
data
Experimental
metadata
in
Lab
books
6. inves&ga&on
study
assay
• Assist
in
the
annotaQon
and
management
of
experimental
data
at
source
• Deal
with
data
from
high-‐throughput
studies
using
one
or
a
combinaQon
of
omics
and
other
technologies
• Empower
users
to
uptake
community-‐defined
checklists
and
ontologies
• Facilitate
data
sharing,
reuse,
comparison
and
reproducibility
of
experiments,
submission
to
internaQonal
public
repositories
8. The
ecosystem
ISA software suite: supporting standards-compliant Towards interoperable bioscience data
experimental annotation and enabling curation at the Sansone et al, 2012
community level
Nature Genetics
Rocca-Serra et al, 2010
Bioinformatics
9. General
purpose
&
flexible
format
Domain
agnosQc
Captures
metadata
in
omics
experiments
and
tradiQonal
experiments
(e.g.
clinical
chemistry
and
histology)
10. faahKO
dataset
• Available
in
BioConductor
• Subset
of
the
original
data
on
global
metabolite
profiling
Saghatlian
et
al.
Biochemstry.
2004
• LC/MS
peaks
from
the
spinal
cords
of
6
wild-‐type
and
6
FAAH
(facy
acid
amyde
hydrolase)
knockout
mice
11. -‐
Define
key
enQQes
(e.g.
factors,
protocols,
parameters)
-‐
Grouping
of
studies
-‐
Relate
studies
and
assays
faahKO
invesQgaQon
12. -‐ Subjects
studied:
source(s),
sampling
methodology,
characterisQcs
faahKO
study
-‐ treatments/manipulaQons
performed
to
prepare
the
specimens
NEWT
UniProt
Taxonomy
Database
Mouse
Genome
InformaQcs
13. -‐ Subjects
studied:
source(s),
sampling
methodology,
characterisQcs
faahKO
study
-‐ treatments/manipulaQons
performed
to
prepare
the
specimens
Mouse
Adult
Gross
Anatomy
14. -‐ measurement
type,
e.g.
metabolite
profiling
-‐ technology,
e.g.
mass
spectrometry
faahKO
assay
15.
16.
17.
18. Report
and
edit
the
descripQon
of
the
invesQgaQon
using
Google
Spreadsheets.
Use
Google
Spreadsheets
in
combinaQon
with
ISA-‐
Tab
templates
(created
through
imporQng
the
Excel
file
from
the
ISAconfigurator)
and
OntoMaton
(for
ontology
search
and
tagging
support)
to
report
an
invesQgaQon.
19. -‐ collaboraQve
annotaQon
-‐ distributed
groups
of
users
-‐ version
control
&
history
Ontology
Search
and
Tagging
in
Google
Spreadsheets
20. Create
templates
detailing
the
steps
to
be
reported
for
different
invesQgaQons,
complying
to
community
standards
(listed
at
),
e.g.
configuring
fields
to
be
(i)
ontology
terms,
(ii)
text
(with/without
regular
expression
tesQng),
(iii)
numbers
etc.
21. From
the
ISA-‐Tab
we
can
perform
analysis,
convert
to
RDF/OWL
and
other
formats
for
submission/
sharing
to
local/remote
repositories,
22. From
the
ISA-‐Tab
we
can
perform
analysis,
convert
to
RDF/OWL
and
other
formats
for
submission/
sharing
to
local/remote
repositories,
+
VisualisaQon
Methods
23. faahKO
Groups
faahKO
Workflow
Maguire
E,
Rocca-‐Serra
P,
Sansone
SA,
Davies
J
and
Chen
M.
Taxonomy-‐based
Glyph
Design
-‐-‐
with
a
Case
Study
on
Visualizing
Workflows
of
Biological
Experiments,
IEEE
Transac9ons
on
Visualiza9on
and
Computer
Graphics,
volume
18,
2012
(in
press)
24. • R
package
available
in
BioConductor
2.11
hcp://bioconductor.org/packages/release/bioc/html/Risa.html
• ISAtab
class
• Read
ISAtab
files
into
ISAtab
objects
and
save
ISAtab
files
• Build
xcmsSet
(xcms
package)
objects
from
mass
spectrometry
assays
• Augment
the
ISAtab
dataset
aOer
analysis
•
source
&
issues
tracking
hcps://github.com/ISA-‐tools/Risa
25. • faahKO
package
v.
2.12
contains
ISAtab
files
describing
the
experiment
faahkoISA
=
readISAta(find.package("faahKO"))
assay.filename
<-‐
faahkoISA["assay.filenames"][[1]]
xset
=
processAssayXcmsSet(faahkoISA,
assay.filename)
…
updateAssayMetadata(faahkoISA,
assay.filename,"Derived
Spectral
Data
File","faahkoDSDF.txt"
)
• MTBLS2
processing
and
analysis
using
Risa,
xcms
and
CAMERA
BioConductor
packages
Metabolights – an open access general-purpose repository for
metabolomics studies and associated meta-data
Haug et al, 2012
Nucleic Acids Research
26. ISA
syntax
&
Underlying
Material/Data
workflows
Input
Material
or
Output
Material
or
Data
Node
Data
Node
Characteris9cs[…]
Factor
Value[…]
Characteris9cs[…]
Factor
Value[…]
Protocol
REF
Parameter
Value
[…]
26
27. • Make
the
semanQcs
of
ISAtab
explicit,
including
materials
&
data
enQQes
&
processes
• Exploit
the
semanQc
annotaQons
available
in
ISAtab
datasets
• Augment
ISA
syntax
with
new
elements
(e.g.
groups),
facilitaQng
the
understanding
&
querying
of
experimental
design
• Facilitate
data
integraQon
&
knowledge
discovery/reasoning
28. ISAtab
datasets
as
linked
data
• Connect
to
the
growing
Linked
Data
universe
RDF
=
Resource
DescripQon
Framework,
OWL
=
Web
Ontology
Language
• CollaboraQons
with
Toxbank
(
)
&
W3C
Health
Care
&
Life
Sciences
Interest
Group
(HCLSIG)
<subject,
predicate,
object>
<lipoprotein>
<parQcipates_in>
<inflammatory
response>
<PRO:212342352>
<BFO_0000056>
<GO:0006954>
29. ISAtab
dataset
ISAtab
Graph
Parser
Analysis
ISA
Mapping
Parser
31. has
specified
input
type
material
enQty
Saghantelian_1
sample
collecQon
derives
from
has
specified
output
type
type
KO1
has
specified
input
processed
material
derives
from
extracQon
material
processing
type
has
specified
output
KO1_extract
has
specified
input
type
InformaQon
derives
from
mass
content
enQty
spectrometry
has
specified
output
type
./cdf/KO/ko15.CDF
32. Increasing
level
of
structure…
…different
target
audiences
Notes
in
Lab
books
Spreadsheets
&
Tables
Facts
as
RDF
statements
(informaQon
for
humans)
(ISAtab
metadata)
(informaQon
for
machines)
35. Implementation at the EBI
hcp://www.ebi.ac.uk/metabolights
Metabolights – an open access general-purpose repository for
metabolomics studies and associated meta-data
Haug et al, 2012
Nucleic Acids Research
35