Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...GigaScience, BGI Hong Kong
Scott Edmunds talk at the AIST Computational Biology Research Center in Tokyo: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods), July 1st 2014
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools
Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
Scott Edmunds talk on Big Data Publishing at the "What Bioinformaticians need to know about digital publishing beyond the PDF" workshop at ISMB 2013, July 22nd 2013
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
The reproducibility of science in the digital age is attracting a lot of attention and concerns from the scientific community, where studies have shown the inability to reproduce results due to a variety of reasons, ranging from unavailability of the data to lack of proper descriptions of the experimental steps.
Multiple research object models have been proposed to describe different aspects of the research process. Investigation/Study/Assay (ISA) is a widely used general-purpose metadata tracking framework with an associated suite of open-source software, which offers a rich description of the experiment’s hypotheses and design, investigators involved, experimental factors, protocols applied. The information is organised in a three-level hierarchy where ’Investigation’ provides the project context for a ’Study’ (a research question), which itself contains one or more ’Assays’ (taking analytical measurements and key data processing and analysis steps). Nanopublication (NP) is a research object model which enables specific scientific assertions, such as the conclusions of an experiment, to be annotated with supporting evidence, published and cited. Lastly, the Research Object (RO) is a model that enables the aggregation of the digital resources contributing to findings of computational research, including results, data and software, as citable compound digital objects.
For computational reproducibility, platforms such as Taverna and Galaxy are popular and efficient ways to represent the data analysis steps in the form of reusable workflows, where the data transformations can be specified and executed in an automatic way.
In this presentation, we will address the question of whether such research object models and workflow representation frameworks can be used to assist in the peer review process, by facilitating evaluation of the accuracy of the information provided by scientific articles with respect to their repeatability.
Our case study is based on an article on a genome assembler algorithm published in GigaScience, but due to the proven use of the respective research object models in their respective communities, we argue that the combination of models and workflow system will improve the scholarly publishing process, making science peer-reproduced.
An overview of Text and Data Mining (ContentMining) including live demonstrations. The fundamentals: discover, scrape, normalize , facet/index, analyze, publish are exemplified using the recent Zika outbreak. Mining covers textual and non-textual content and examples of chemistry and phylogenetic tress are given.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
Carole Goble is a professor in the school of computer science at the University of Manchester.
In this keynote, Carole offered her insights into research data management and data centres.
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...GigaScience, BGI Hong Kong
Scott Edmunds talk at the AIST Computational Biology Research Center in Tokyo: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods), July 1st 2014
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools
Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
Scott Edmunds talk on Big Data Publishing at the "What Bioinformaticians need to know about digital publishing beyond the PDF" workshop at ISMB 2013, July 22nd 2013
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
The reproducibility of science in the digital age is attracting a lot of attention and concerns from the scientific community, where studies have shown the inability to reproduce results due to a variety of reasons, ranging from unavailability of the data to lack of proper descriptions of the experimental steps.
Multiple research object models have been proposed to describe different aspects of the research process. Investigation/Study/Assay (ISA) is a widely used general-purpose metadata tracking framework with an associated suite of open-source software, which offers a rich description of the experiment’s hypotheses and design, investigators involved, experimental factors, protocols applied. The information is organised in a three-level hierarchy where ’Investigation’ provides the project context for a ’Study’ (a research question), which itself contains one or more ’Assays’ (taking analytical measurements and key data processing and analysis steps). Nanopublication (NP) is a research object model which enables specific scientific assertions, such as the conclusions of an experiment, to be annotated with supporting evidence, published and cited. Lastly, the Research Object (RO) is a model that enables the aggregation of the digital resources contributing to findings of computational research, including results, data and software, as citable compound digital objects.
For computational reproducibility, platforms such as Taverna and Galaxy are popular and efficient ways to represent the data analysis steps in the form of reusable workflows, where the data transformations can be specified and executed in an automatic way.
In this presentation, we will address the question of whether such research object models and workflow representation frameworks can be used to assist in the peer review process, by facilitating evaluation of the accuracy of the information provided by scientific articles with respect to their repeatability.
Our case study is based on an article on a genome assembler algorithm published in GigaScience, but due to the proven use of the respective research object models in their respective communities, we argue that the combination of models and workflow system will improve the scholarly publishing process, making science peer-reproduced.
An overview of Text and Data Mining (ContentMining) including live demonstrations. The fundamentals: discover, scrape, normalize , facet/index, analyze, publish are exemplified using the recent Zika outbreak. Mining covers textual and non-textual content and examples of chemistry and phylogenetic tress are given.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
Carole Goble is a professor in the school of computer science at the University of Manchester.
In this keynote, Carole offered her insights into research data management and data centres.
Reproducibility, Research Objects and Reality, Leiden 2016Carole Goble
Presented at the Leiden Bioscience Lecture, 24 November 2016, Reproducibility, Research Objects and Reality
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. It all sounds very laudable and straightforward. BUT…..
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange
In this talk I will explore these issues in data-driven computational life sciences through the examples and stories from initiatives I am involved, and Leiden is involved in too including:
· FAIRDOM which has built a Commons for Systems and Synthetic Biology projects, with an emphasis on standards smuggled in by stealth and efforts to affecting sharing practices using behavioural interventions
· ELIXIR, the EU Research Data Infrastructure, and its efforts to exchange workflows
· Bioschemas.org, an ELIXIR-NIH-Google effort to support the finding of assets.
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
Thoughts on computational science reproducibility with a focus on software. Given at the Software Sustainability Institute's 2014 Collaborations Workshop
Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.
Jean-Claude Bradley presents on "Peer Review and Science2.0: blogs, wikis and social networking sites" as a guest lecturer for the “Peer Review Culture in Scholarly Publication and Grantmaking” course at Drexel University. The main thrust of the presentation is that peer review alone is not capable of coping with the increasing flood of scientific information being generated and shared. Arguments are made to show that providing sufficient proof for scientific findings does scale and weakens the tragedy of the trusted source cascade.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
Recomendations for infrastructure and incentives for open science, presented to the Research Data Alliance 6th Plenary. Presenter: William Gunn, Director of Scholarly Communications for Mendeley.
The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...Carole Goble
Presented at the FAIR Data in Practice Symposium, 16 may 2023 at BioITWorld Boston. https://www.bio-itworldexpo.com/fair-data. The ELIXIR European research Infrastructure for life science data is an inter-governmental organizations coordinating, integrating and sustaining FAIR data and software resources across its 23 nations. To help advise users, data stewards, project managers and service providers, ELIXIR has developed complementary community-driven, open knowledge resources for guiding FAIR Research Data Management (RDMkit) and providing FAIRification recipes (FAIRCookbook). 150+ people have contributed content so far, including representatives of the pharmaceutical industry.
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
Scott Edmunds talk at the HUPO congress in Geneva, September 6th 2011 on GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami.
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...GigaScience, BGI Hong Kong
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency. Melbourne University 19th September 2014
This is a keynote that I have given in polyweb workshop on the state of the art of data science reproducibility. I review tools that have been developed over the last few years in the first part. In the second part, I focus on proposals that I have been involved in to facilitate workflow reproducibility and preservation.
2014-09-01 Taverna tutorial in Bonn: Advanced Taverna features.
List handling, Cross Product and Dot product.
Looping asynchronous services.
Control links.
Retries.
Parallel service invocation.
2014-09-01 Taverna tutorial in Bonn: Using RESTful Web Services from Taverna. Building on the "REST and Biocatalogue" tutorial, this tutorial expands on the various REST configuration options and different content types that can be retrieved.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
The beauty of workflows and models
1. The beauty of
workflows and models
Workflows for research.
Reproducible research.
Professor Carole Goble
The University of Manchester, UK
The Software Sustainability Institute
carole.goble@manchester.ac.uk
@caroleannegoble
RDMF Meeting, Westminster, 20 June 2014
2. Scientific publications have at least
two goals:
(i) to announce a result and
(ii) to convince readers that the
result is correct
…..
papers in experimental science
should describe the results and
provide a clear enough protocol to
allow successful repetition and
extension
Jill Mesirov
Accessible Reproducible Research
Science 22 Jan 2010: 327(5964): 415-416
DOI: 10.1126/science.1179653
Virtual Witnessing*
*Leviathan and the Air-Pump: Hobbes, Boyle, and the
Experimental Life (1985) Shapin and Schaffer.
3. Scientific publications have at least
two goals:
(i) to announce a result and
(ii) to convince readers that the
result is correct
…..
papers in experimental science
should describe the results and
provide a clear enough protocol to
allow successful repetition and
extension
Jill Mesirov
Accessible Reproducible Research
Science 22 Jan 2010: 327(5964): 415-416
DOI: 10.1126/science.1179653
Virtual Witnessing*
*Leviathan and the Air-Pump: Hobbes, Boyle, and the
Experimental Life (1985) Shapin and Schaffer.
4. “An article about computational
science in a scientific publication
is not the scholarship itself, it is
merely advertising of the
scholarship. The actual
scholarship is the complete
software development
environment, [the complete
data] and the complete set of
instructions which generated the
figures.”
David Donoho, “Wavelab and Reproducible
Research,” 1995
datasets
data collections
standard operating
procedures
software
algorithms
configurations
tools and apps
codes
workflows
scripts
code libraries
services,
system software
infrastructure,
compilers
hardware
Morin et al Shining Light into Black Boxes
Science 13 April 2012: 336(6078) 159-160
Ince et al The case for open computer programs
Nature 482, 2012
6. Biodiversity
marine monitoring and health assessment
ecological niche modelling
Data Intensive Science
Collaborative Science
Pilumnus hirtellusEnclosed sea problem
(Ready et al., 2010)
Sarah Bourlat http://www.biovel.eu
7.
8. Data discoveryData discovery
Data assembly,
cleaning, and
refinement
Data assembly,
cleaning, and
refinement
Ecological Niche
Modeling
Ecological Niche
Modeling
Statistical analysisStatistical analysis
Analytical cycle
Data collectionData collection
InsightsInsights Scholarly Communication
& Reporting
Scholarly Communication
& Reporting
9. BioSTIF
method
instruments and laboratory
Workflows: capture the steps
assembly & interoperability
shielding & optimising
flexible variant reuse
pipelines & exploration
repetition & comparison
record & set-up
provenance collection
report & embed
multi-code and multi-
resource experiments
in-house and external
workflow mgment systems
materials
http://www.taverna.org.uk
15. Preservation Planning &Watch
continuous preservation management
Environment and
users
Repository
access
ingest
harvest
Monitored environment
and usersWatch
Planning Operations
create/
reevaluate
plans
deploy plan
monitored
actions
Monitored content and events
execute action plan
policies
SCOUT
c3po
PLATO
Taverna
workflows
RODA
Long term preservation of digital data. Maintaining scans of newspapers, books,
records of data; Metadata maintenance; large and automated.
Preservation Policy: Collection level Control Policy: Low level actions & constraints
http://www.scape-project.eu/
17. Workflows (and Scripts and Models) are….
…provenance of data
…general technique for describing and enacting a process
…precise, unambiguous, transparent protocols and records.
…often complex, so they need explaining.
…often challenging and expensive to develop.
…know-how and best practice.
…collaborations.
…first class citizens of research
…support the process of research
19. Reproducibility = Hard Work
Code in sourceforge under GPLv3:
http://soapdenovo2.sourceforge.net/>5000 downloads
http://homolog.us/wiki/index.php?title=SOAPdenovo2
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>11000 accesses
Open-Code
8 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Enabled code to being picked apart by bloggers in wiki
[Scott Edmunds]
22. Data Operations in Workflows in the Wild
Analysis of 260 publicly available workflows in Taverna, WINGS, Galaxy and Vistrails
Garijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, FGCS, 36, July 2014, 338–351
23. Research Method Stewardship
Management, Publishing, Preservation
Workflows &
Scripts
Services &
Codes
Standard
Operating
Procedures
Descriptions
Standards
Portal
Different systems
Formats
Web Services
Code Libraries
Executables
Models &
Algorithms
Mark-up Languages,
Mathematical
descriptions
Standards
BioModels
26. Victoria Stodden, AMP 2011 http://www.stodden.net/AMP2011/,
Special Issue Reproducible Research Computing in Science and Engineering July/August 2012, 14(4)
Howison and Herbsleb (2013) "Incentives and Integration In Scientific Software Production" CSCW 2013.
27. Data Stewardship
Making practices
Sustainability
Management planning
Deposition
Long term access
Credit
Journals
Licensing
Open source / access
Best Practices for Scientific Computing http://arxiv.org/abs/1210.0530
1st
Workshop on Maintainable Software Practices in e-Science – e-Science 2012
Stodden, Reproducible Research Standard, Intl J Comm Law & Policy, 13 2009
Software
ModelsWorkflows
Services
28. Data Stewardship
Best Practices for Scientific Computing http://arxiv.org/abs/1210.0530
1st
Workshop on Maintainable Software Practices in e-Science – e-Science 2012
Stodden, Reproducible Research Standard, Intl J Comm Law & Policy, 13 2009
Software
ModelsWorkflows
Services
Making practices
Sustainability
Management planning
Deposition
Long term access
Credit
Journals
Licensing
Open source / access
30. Jennifer Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012
Software release paradigm
Some of your data isn’t data
Not a static document paradigm
• Release
research
• Methods in
motion.
• Versioning
• Forks & merges
• F1000, PeerJ
GitHub….
31. Pivot around method / software / data
rather than paper
Citation semantics: software as was? software as is?
The multi-dimensional paper
33. Replication Gap
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
Out of 18 microarray papers, results
from 10 could not be reproduced
Out of 18 microarray papers, results
from 10 could not be reproduced
35. reusereproduce
repeat replicate
Drummond C Replicability is not Reproducibility: Nor is it Good Science, online
Peng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
Methods
(techniques, algorithms,
spec. of the steps)
Materials
(datasets, parameters,
algorithm seeds)
Experiment
Instruments
(codes, services, scripts,
underlying libraries)
Laboratory
(sw and hw infrastructure,
systems software,
integrative platforms)
Setup
36. same experiment
same set up
same lab
same experiment
same set up
different lab
same experiment
different set up
different experiment
some of same
validate
Drummond C Replicability is not Reproducibility: Nor is it Good Science, online
Peng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
reusereproduce
repeat replicate
37. DesignDesign
ExecutionExecution
Result AnalysisResult Analysis
CollectionCollection
Publish /
Report
Publish /
Report
Peer
Review
Peer
Review
Peer
Reuse
Peer
Reuse
ModellingModelling
Can I repeat &
defend my
method?
Can I review / reproduce
and compare my results /
method with your results /
method?
Can I review /
replicate and certify
your method?
Can I transfer your
results into my
research and reuse
this method?
* Adapted from Mesirov, J. Accessible Reproducible Research Science 327(5964), 415-416 (2010)
Research Report
PredictionPrediction
MonitoringMonitoring
CleaningCleaning
41. RDataTracker and DDG Explorer
Build into the workflows of research….
[Barbara S. Lerner and Emery R. Boose]
42. Components
Dependencies
Change
• 35 kinds of annotations
• 5 Main Workflows
• 14 Nested Workflows
• 25 Scripts
• 11 Configuration files
• 10 Software dependencies
• 1 Web Service
• Dataset: 90 galaxies
observed in 3 bands
• Multiple platforms
• Multiple systems
José Enrique Ruiz (IAA-CSIC)
Galaxy Luminosity Profiling
43. specialist codes
libraries, platforms, tools
services
(cloud)
hosted
services
commodity
platforms
data collections
catalogues software
repositories
my data
my process
my codes
integrative
frameworks
gateways
45. Instrument Entropy
all experiments become less reproducible
Zhao, Gomez-Perez, Belhajjame, Klyne, Garcia-
Cuesta, Garrido, Hettne, Roos, De Roure and
Goble. Why workflows break - Understanding
and combating decay in Taverna workflows, 8th
Intl Conf e-Science 2012
Mitigate
Detect, Repair
Preserve
Partial replication
Approx reproduce
Verification
Benchmarks
46. Environmental Ecosystem
Joppa et al SCIENCE 340 May 2013; Morin et al Science 336 2012
Black boxes
Mixed systems, mixed stewardship
Distributed,
hosted systems
47. Workflow Planning &
Watch of Workflows
Watch
Operations
Planning
Env & Users
Repository
plan
deploy
monitor monitor
monitor
access
ingest,
harvest
Decay, Service Deprecation,
Data source monitoring,
Checklists,
Minimal Models
Workflows, myExperiment
Workflows for managing workflows
48. portability
variability tolerance
[Adapted Freire, 2013]
preservation
packaging
provenance
gather dependencies
capture steps
track & keep results
provenance
gather dependencies
capture steps
track & keep results
versioning
host
service
Open Source/Store
Sci as a Service
Integrative fws
Virtual Machines
Recompute, limited
installation, Black Box
Byte execution, copies
Descriptive read,
White Box
Archived record
Read & Run, Co-location
No installation
Portable Package
White Box, Installation
Archived record
50. Levels of Reproducibility
Coverage: how
much of an
experiment is
reproducible
OriginalExperimentSimilarExperimentDifferentExperiment
Portability
Depth: how much of an experiment is available
Binaries +
Data
Source Code /
Workflow
+ Data
Binaries +
Data +
Dependencies
Source Code /
Workflow
+ Data +
Dependencies
Virtual Machine
Binaries +
Data +
Dependencies
Virtual Machine
Source Code /
Workflow
+ Data +
Dependencies
Figures +
Data
[Freire, 2014]
52. (Dynamic) Research Objects
• Bundle and relate multi-hosted digital resources of a scientific experiment or
investigation using standard mechanisms, Currency of exchange
• Exchange, Releasing paradigm for publishing
http://www.researchobject.org/
53. • Bundle and relate multi-hosted digital resources of a scientific experiment or
investigation using standard mechanisms, Currency of exchange
• Exchange, Releasing paradigm for publishing
http://www.researchobject.org/
(Dynamic) Research Objects
54. • Bundle and relate multi-hosted digital resources of a scientific experiment or
investigation using standard mechanisms, Currency of exchange
• Exchange, Releasing paradigm for publishing
http://www.researchobject.org/
(Dynamic) Research Objects
OAI-
ORE
W3C
OAM
H. Van de Sompel et. al. Persistent Identifiers for
Scholarly Assets and the Web: The Need for an
Unambiguous Mapping 9th International Digital
Curation Conference; Trusty URLs
H. Van de Sompel et. al. Persistent Identifiers for
Scholarly Assets and the Web: The Need for an
Unambiguous Mapping 9th International Digital
Curation Conference; Trusty URLs
57. Sys Bio Research Object
Adobe UCF
Research Object
Bundle
ORE PROVODF
• Aggregation
• Annotations/provenance
• Ad-hoc domain-specific
specification
OMEX archive
Systems Biology:
Needed a common
archive format for
reuse across tools
58. The research lifecycle
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring
Tools
Lab
Notebooks
Data
Capture
Software
Repositories
Analysis
Tools
Visualization
Scholarly
Communication
Commercial &
Public Tools
Git-like
Resources
By Discipline
Data Journals
Discipline-
Based Metadata
Standards
Community Portals
Institutional Repositories
New Reward
Systems
Commercial Repositories
Training
[Phil Bourne]
63. Acknowledgements
• David De Roure
• Tim Clark
• Sean Bechhofer
• Robert Stevens
• Christine Borgman
• Victoria Stodden
• Marco Roos
• Jose Enrique Ruiz del Mazo
• Oscar Corcho
• Ian Cottam
• Steve Pettifer
• Magnus Rattray
• Chris Evelo
• Katy Wolstencroft
• Robin Williams
• Pinar Alper
• C. Titus Brown
• Greg Wilson
• Kristian Garza
• Donal Fellows
• Wf4ever, SysMO, BioVel, UTOPIA and myGrid teams
• Juliana Freire
• Jill Mesirov
• Simon Cockell
• Paolo Missier
• Paul Watson
• Gerhard Klimeck
• Matthias Obst
• Jun Zhao
• Pinar Alper
• Daniel Garijo
• Yolanda Gil
• James Taylor
• Alex Pico
• Sean Eddy
• Cameron Neylon
• Barend Mons
• Kristina Hettne
• Stian Soiland-Reyes
• Rebecca Lawrence
• Alan Williams
Editor's Notes
how, why and what matters
benchmarks for codes
plan to preserve
repair on demand
description persists
use frameworks
partial replication
approximate reproduction
verification
/ Minute Taking
It examines the debate between Robert Boyle and Thomas Hobbes over Boyle&apos;s air-pump experiments in the 1660s. In 2005, Shapin and Schaffer were awarded the Erasmus Prize for this work.
Ballast water in shipping lanes
invasive species risk assessments
When the tools are integrated on the Lifewatch portal, we will have access to the enviornmental data through the workflow.
Global marine layers used in our study came from Bio-Oracle (http://www.biooracle. ugent.be/) with a resolution of 5 arc-minutes (9.2 km) (Tyberghein et al. 2012),
and were used to study abiotic factors such as mean sea surface temperature (SST) (°C), mean surface salinity (SSS) (PSU) and mean photosynthetically available
radiation (PAR) (Einstein/m2/day). Other marine layers used in our analyses came from AquaMaps (http://www.aquamaps.org/download/main.php) with a resolution of 30 arc-minutes, i.e. 1 km2 (Kaschner et al. 2010).
If layers are needed from SMHI, they can be uploaded via the workflow.
Not all things are batch
VPH-Share opens a VNC connection spawned instance.
Taverna Interaction Service
Users interact with a workflow (wherever it is running) in a web browser.
Interaction Service Workbench Plug-in
Collecting, processing and management of big data
Metagenomics, genotyping, genome sequencing, phylogenetics, gene expression analysis, proteomics, metabolomics, auto sampling
Analytics and management of broad data from many different disciplines
Coupling analytical metagenomics with meaningful ecological interpretations
Continuous development of novel methods and technologies
Functional trait-based ecology approach proposed by Barberán et. al 2012.
Example of an extreme of the software issue
Multi-code experiments
platform
libraries, plugins
Infrastructure
components, services
infrastructure
Variety:
common metadata models
rich metadata collection
ecosystem
Validity:
auto record of experiment set-up, citable and shareable descriptions
curation, publication,
mixed stewardship
third part availability
model executability
citability, QC/QA. trust.
Social issues of understanding the culture of risk, reward, sharing and reporting.
C3PO content profiling tool
What kind of platform are my users running on
Purpose: show how the system components work together and support the overall conceptual framework we are building on.
The example story that we use is a real-world case from the British Library where a plan was created previously with Plato3
This kind of scenario is very common, of course - it could have been one of many other organisations that have similar, in particular digitized, collections. In parallel to the Plato case study we are relying on, we had conducted several others that led to different conclusions; and the same problem structure applies to the case study we just conducted in SCAPE with the SB-DK. Hence, the tools demonstrated here apply equally across many scenarios - the concepts used are completely scenario-agnostic, and decision support applies equally across scenarios, content sets, and organisations.
Running migration tools
Designing for reuse is hard
No closing the feedback loop
Off site collaboration (behind closed doors and invisible)
Poor reciprocity – find, grab, piss off
Not much reuse (too specific)
Components
Enclaves and groups are better
File and forget
Most workflows decay.
Group enclaves
Personal archives rather than sharing
Credit…..
Designing for reuse
Curators Credit?
Funded?
Blue-Collar.
Personal and institutional visibility
Scholarly (citation metrics)
Federate workloads
Unpopular with the big data providers.
Hell is other people’s metadata Jean-Paul Satre
Invisible vs nonintrusive (discreet but controlled)
credit
Takes time, fears for description
Non-intrusiveness vs control and invisible
Off site collaboration
Not much reuse (too specific)
Components
Enclaves and groups are better
File and forget
Most workflows decay.
Designing for stewardship
“As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software”
An aside
TOLERANCE
The explicit documentation of designed-in and anticipated variation
methods matterresults may vary
Added afterwards.
1. Required as condition of publication, certain exceptions permitted (e.g. preserving confidentiality of human subjects)
2. Required but may not affect editorial/publication decisions
3. Explicitly encouraged/addressed; may be reviewed and/or hosted
4. Implied
5. No mention
59% of papers in the 50 highest-IF journals comply with (often weak) data sharing rules.
Alsheikh-Ali et al Public Availability of Published Research Data in High-Impact Journals. PLoS ONE 6(9) 2011
conceptual replication “show A is true by doing B rather than doing A again”
verify but not falsify
[Yong, Nature 485, 2012]
The letter or the spirit of the experiment
indirect and direct reproducibility
Reproduce the same affect? Or same result?
Concept drift towards bottom.
As an old ontologist I wanted an ontology or a framework or some sort of property based classification.
Reproducible Research Environment
Integrated infrastructure for producing and working with reproducible research.
make reproducible, born reproducible
Reproducible Research Publication Environment Distributing and reviewing; credit; licensing etc
FAIRport* ReproducibilityFind, Access, Interoperate, Reuse, PortPreservation - Lots of copies keeps stuff safe
Stability dimension
Add two more dimensions to our classification of themes
A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. Virtual machines are separated into two major classifications, based on their use and degree of correspondence to any real machine: System
Overlap of course
Static vs dynamic.
GRANULARITY
This model for audit and target of your systems
overcoming data type silos
public integrative data sets
transparency matters
cloud
Recomputation.org
Reproducibility by ExecutionRun It
Reproducibility by InspectionRead It
Availability – coverage
Gathered: scattered across resources, across the paper and supplementary materials
Availability of dependencies: Know and have all necessary elements
Change management: Data? Services? Methods? Prevent, Detect, Repair.
Execution and Making Environments: Skills/Infrastructure to run it: Portability and the Execution Platform (which can be people…), Skills/Infrastructure for authoring and reading
Description: Explicit: How, Why, What, Where, Who, When, Comprehensive: Just Enough, Comprehensible: Independent understanding
Documentation vs Bits (VMs) reproducibility
Learn/understand (reproduce and validate, reproduce using different codes) vs Run (reuse, validate, repeat, reproduce under different configs/settings)
Electronic lab notebooks
Issues: non-secure html using http inside secure https iframe in ipython doesn’t work – need to update interaction service to deliver on https.
I’ll come back to iPython notebook later
Client package (currently under development, will be available via Python Package Index (PyPI) for installation for all major platforms (Linux, Mac, Windows)
Allows for calling Taverna Workflows available via Taverna Player
List of available workflows can be retrieved from the BioVel Portal (Taverna Player)
Users can enter the input values using Ipython Notebook (these values can be then results of the code previously run in the Notebook
The outputs from running the workflow (the results) are returned to the Notebook and processed further
The full workflow run and the overall process (provenance) can be saved in the Ipython Notebook format
For an example (static for now), see
http://nbviewer.ipython.org/urls/raw.githubusercontent.com/myGrid/DataHackLeiden/alan/Player_example.ipynb?create=1
the goldilocks paradox
the description needed to make an experiment reproducible is too much for the author and too little for the reader”
just enough just in time
No just ONE system. Many systems.
Bewildering range of standards for formats, terminologies and checklists
The Economics of Curation
Curate Incrementally
Early and When Worthwhile
Ramps: Automation & Integrated Tools
Copy editing Methods
the reproducibility ecosystem
For peer and author
complicated and scattered - super fragmentation – supplementary materials, multi-hosted, multi-stewarded.
we must use the right platforms for the right tools
The trials and tribulations of review
Its Complicated
www.biostars.org/
Apache
Service based ScienceScience as a Service
Shapin 1984
Boyle’s instruments
Check numbers of workflows
The how, why and what
plan to preserve
prepare to repair
description persists
common frameworks
partial replication
approximate reproduction
verification
benchmarks for codes
“lets copy the box that the internet is in”
black boxes
closed codes & services, proprietary licences, magic cloud services, manual manipulations, poor provenance/version reporting, unknown peer review, mis-use, platform calculation dependencies
Mixed systems, mixed stewardship, mixed provenance, different resources
Decay monitoring, traffic lights contribute to Watch/Monitoring
Customise to User characteristics, e.g. their version of workflow engine
myExp and RODL provide both Operations and Repository
Next steps:
Planning component
Explicit roles of archival and curation organisations
Scientific users play different roles but it tends to be the same user
Considering SCAPE brings up some questions about wf4ever
What are the high level policies for workflow preservation?
Who’s playing the role of, e.g. Curator or Archival Organisation?
So far, use cases tend to have the same individual playing multiple roles rather than different individuals.
Who’s our Designated Community?
Meta-Question: How do we answer these questions?
Opportunity to develop workflow preservation policies.
FAIRport* ReproducibilityFind, Access, Interoperate, Reuse, PortPreservation - Lots of copies keeps stuff safe
Stability dimension
Add two more dimensions to our classification of themes
A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. Virtual machines are separated into two major classifications, based on their use and degree of correspondence to any real machine: System
Overlap of course
Static vs dynamic.
GRANULARITY
This model for audit and target of your systems
overcoming data type silos
public integrative data sets
transparency matters
cloud
Recomputation.org
Reproducibility by ExecutionRun It
Reproducibility by InspectionRead It
Availability – coverage
Gathered: scattered across resources, across the paper and supplementary materials
Availability of dependencies: Know and have all necessary elements
Change management: Data? Services? Methods? Prevent, Detect, Repair.
Execution and Making Environments: Skills/Infrastructure to run it: Portability and the Execution Platform (which can be people…), Skills/Infrastructure for authoring and reading
Description: Explicit: How, Why, What, Where, Who, When, Comprehensive: Just Enough, Comprehensible: Independent understanding
Documentation vs Bits (VMs) reproducibility
Learn/understand (reproduce and validate, reproduce using different codes) vs Run (reuse, validate, repeat, reproduce under different configs/settings)
FAIRport* ReproducibilityFind, Access, Interoperate, Reuse, PortPreservation - Lots of copies keeps stuff safe
Stability dimension
Add two more dimensions to our classification of themes
A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. Virtual machines are separated into two major classifications, based on their use and degree of correspondence to any real machine: System
Overlap of course
Static vs dynamic.
GRANULARITY
This model for audit and target of your systems
overcoming data type silos
public integrative data sets
transparency matters
cloud
Recomputation.org
Reproducibility by ExecutionRun It
Reproducibility by InspectionRead It
Availability – coverage
Gathered: scattered across resources, across the paper and supplementary materials
Availability of dependencies: Know and have all necessary elements
Change management: Data? Services? Methods? Prevent, Detect, Repair.
Execution and Making Environments: Skills/Infrastructure to run it: Portability and the Execution Platform (which can be people…), Skills/Infrastructure for authoring and reading
Description: Explicit: How, Why, What, Where, Who, When, Comprehensive: Just Enough, Comprehensible: Independent understanding
Documentation vs Bits (VMs) reproducibility
Learn/understand (reproduce and validate, reproduce using different codes) vs Run (reuse, validate, repeat, reproduce under different configs/settings)
ENCODE threads
exchange between tools and researchers
bundles and relates digital resources of a scientific experiment or investigation using standard mechanisms
ENCODE threads
exchange between tools and researchers
bundles and relates digital resources of a scientific experiment or investigation using standard mechanisms
Three principles underlie the approach:
Identity
Referring to resources (and the aggregation itself)
Aggregation
Describing the aggregation structureand its constituent parts
Annotation
Associating information with aggregated resources.
ROs as Information Packages in OAIS
DOIs – the resolution of a DOI URL is a landing page, not RDF metadata for a client. (designed for people)
HTTP URIs provide both access and identification
PIDs: Persistent Identifiers (e.g.DOIs) tend to resolve to human-readable landing pages
With embedded links to further (possibly machine-readable) resources
ROs seen as non-information resources with descriptive (RDF) metadata
Redirection/negotiation
Standard patterns for Linked Data resources
Bidirectional mappings between URIs and PIDs
Versioning through, e.g. Memento
ENCODE threads
exchange between tools and researchers
bundles and relates digital resources of a scientific experiment or investigation using standard mechanisms
Three principles underlie the approach:
Identity
Referring to resources (and the aggregation itself)
Aggregation
Describing the aggregation structureand its constituent parts
Annotation
Associating information with aggregated resources.
http://www.youtube.com/watch?v=p-W4iLjLTrQ&list=PLC44A300051D052E5
Our collaboration with ISA/GigaScience/nanopublication is finally being written up and will be submitted to ECCB this Friday. We will upload a copy to Arxiv after the deadline. - We will continue our workshop at ISMB, with BioMED Central. And Kaitlin will also join us on the Panel. You can find more details about agenda and panel planning in other emails.
Posted on December 11, 2013 by Kaitlin Thaney
Part of the Science Lab’s mission is to work with other community members to build technical prototypes that move science on the web forward. In particular, we want to show that many problems can be solved by making existing tools and technology work together, rather than by starting from scratch.The reason behind that is two-fold: (1) most of the stuff needed to change behaviors already exists in some form and (2) the smartest minds are usually outside of your organization.
Our newest project extends our existing work around “code as a research object”, exploring how we can better integrate code and scientific software into the scholarly workflow. The project will test building a bridge that will allow users to push code from their GitHub repository to figshare, providing a Digital Object Identifier for the code (a gold standard of sorts in science, allowing persistent reference linking). We will also be working on a “best practice” standard (think a MIAME standard for code), so that each research object has sufficient documentation to make it possible to meaningfully use.
The project will be a collaboration of the Science Lab with Arfon Smith (Github; co-founder Zooniverse) and Mark Hahnel and his team at figshare.
Why code?
Scientific research is becoming increasingly reliant on software. But despite there being an ever-increasing amount of the academic process described in code, research communities do not yet treat these products as a fundamental component or “first-class research object” (see our background post here for more). Up until recent years, the sole “research object” in discussion was the published paper, the main means of packaging together the data, methods and research to communicate findings. The web is changing that, making it easier to unpack the components such as data and code for the community to remix, reuse, and build upon.
A number of scientists are pushing the envelope, testing out new ways of bundling their code, data and methods together. But outside of copy and pasting lines of code into a paper or, if we’re lucky, having it included in a supplementary information file alongside a paper, the code is still often separated from the documentation needed for others to meaningfully use it to validate and reproduce experiments. And that’s if it’s shared openly at all.
Code can go a long way in helping academia move toward the holy grail that is reproducibility. Unfortunately, academics whose main research output is the code they produce, often cannot get the recognition they deserve for creating it. There is also a problem with versioning: citing a paper written about software (as is common practice), gives no indication of which version, or release in GitHub terms, was used to generate the results.
What we’re testing
figshare and GitHub are two of the leading repositories for data and code (figshare for data; GitHub for code). Open data repositories like figshare have led the way in recent years in changing our practices in relation to data, championing the idea of data as a first-class research object. figshare and others such as Harvard’s Dataverse and Dryad have helped change how we think of data as part of the research process, providing citable endpoints for the data itself that the community trusts (DOIs), as well as clear licensing and making it easy to download, remix, and reuse information. One of the main objectives here is that the exact code used in particular investigations, can be accessed by anyone and persists in the form it was in when cited.
This project will test whether having a means of linking code repositories to those commonly used for data will allow for software and code to be better incorporated into existing credit systems (by having persistent identifiers for code snapshots) and how seamless we can make these workflows for academics using tools they are already familiar with. We’ve seen this tested with data over recent years, with sharing of detailed research data associated with increased citation rates (Piwowar, 2007). This culture shift of publishing more of the product of research is an ongoing process and we’re keen to see software and code elevated to the same status as the academic manuscript.
We believe that by having the code closely associated with the data it executes on (or generates) will help reduce the barriers when trying to reproduce and build upon the work of others. This is already being tested in the reverse, with computational scientists nesting their data with the code in GitHub (Carl and Ethan, for example). We want to find out if formally linking the two to help ease that pain will change behavior.
We are also looking to foster a culture of reuse with academic code. While we know there are lots of variables in this space, we are actively soliciting feedback from the community to help determine best practices for licensing and workflows.
How to get involved
(UPDATE: Want to help us test? Instead of sending us an email, how about adding yourself to this issue in our GitHub repository? More about that here.)
Mark and Arfon will be joining us for our next Mozilla Science Lab community call on December 12, 2013. Join us to hear more about the project. Have a question you’d like to ask? Add it to the etherpad!
We’re also looking for computational researchers and publishers to help us test out the implementation. Shoot us an email if you’d like to participate.
Posted in Uncategorized.
APIs – DOIs on landing pages for people.
Scientist works with local folder structure.
Version management via github.
Local tooling produces metadata description
Metadata about the aggregation (and its resources) provided by “hidden folder”
Zenodo/figshare pull snapshot from github
Providing DOIs for the aggregrations
Additional release cycles can prompt new DOIs