Tim Babbitt discusses the changing context of research and scholarship due to digitization and the internet. The inputs and outputs of research are increasingly digital and complex, including data, code, presentations, and more. ProQuest has a history of preserving scholarship through microfilming and is exploring how to preserve the full range of digital scholarly outputs and their linkages in a sustainable way. Key questions include balancing new and old preservation methods and moving beyond preserving individual objects to also preserving networks and linkages between scholarly works.
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
The document discusses the ISA infrastructure, which provides a framework for tracking metadata in bioscience experiments from data collection to sharing in linked data clouds. The infrastructure includes a metadata syntax, open source software tools, and a user community. It allows annotation of experimental metadata, materials, and processes using ontologies to make semantics explicit and enable integration and knowledge discovery. The infrastructure is growing with over 30 public and private resources adopting it to facilitate standards-compliant sharing of investigations across life science domains.
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
This document summarizes key aspects of computational research methods and the myExperiment platform. It discusses how myExperiment allows researchers to automate, share, and reuse workflows and other methods. It also addresses challenges around reproducibility, provenance, collaboration, and incentives for sharing methods. MyExperiment provides social features and aims to build a community around openly exchanging and improving computational research techniques.
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
The tremendous growth in digital data has led to an increase in metadata initiatives for different types of scientific data, as evident in Ball’s survey (2009). Although individual communities have specific needs, there are shared goals that need to be recognized if systems are to effectively support data sharing within and across all domains. This paper considers this need, and explores systems requirements that are essential for metadata supporting the discovery and management of scientific data. The paper begins with an introduction and a review of selected research specific to metadata modeling in the sciences. Next, the paper’s goals are stated, followed by the presentation of valuable systems requirements. The results include a base-model with three chief principles: principle of least effort, infrastructure service, and portability. The principles are intended to support “data user” tasks. Results also include a set of defined user tasks and functions, and applications scenarios.
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
The document discusses the ISA infrastructure, which provides a framework for tracking metadata in bioscience experiments from data collection to sharing in linked data clouds. The infrastructure includes a metadata syntax, open source software tools, and a user community. It allows annotation of experimental metadata, materials, and processes using ontologies to make semantics explicit and enable integration and knowledge discovery. The infrastructure is growing with over 30 public and private resources adopting it to facilitate standards-compliant sharing of investigations across life science domains.
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
This document summarizes key aspects of computational research methods and the myExperiment platform. It discusses how myExperiment allows researchers to automate, share, and reuse workflows and other methods. It also addresses challenges around reproducibility, provenance, collaboration, and incentives for sharing methods. MyExperiment provides social features and aims to build a community around openly exchanging and improving computational research techniques.
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
The tremendous growth in digital data has led to an increase in metadata initiatives for different types of scientific data, as evident in Ball’s survey (2009). Although individual communities have specific needs, there are shared goals that need to be recognized if systems are to effectively support data sharing within and across all domains. This paper considers this need, and explores systems requirements that are essential for metadata supporting the discovery and management of scientific data. The paper begins with an introduction and a review of selected research specific to metadata modeling in the sciences. Next, the paper’s goals are stated, followed by the presentation of valuable systems requirements. The results include a base-model with three chief principles: principle of least effort, infrastructure service, and portability. The principles are intended to support “data user” tasks. Results also include a set of defined user tasks and functions, and applications scenarios.
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
- Scientific names for species can change over time as taxonomy knowledge evolves
- An event-centric ontology model represents names and changes through time using different URIs for taxon concepts at different times
- Transition and snapshot models can then simplify the descriptions by linking concepts over time or just showing current names
- This approach allows integrated representation of taxonomy knowledge and its revisions in a computable way
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
The one-covers-all approach in current metadata standards for scientific data has serious limitations in keeping up with the ever-growing data. This paper reports the findings from a survey to metadata standards in the scientific data domain and argues for the need for a metadata infrastructure. The survey collected 4400+ unique elements from 16 standards and categorized these elements into 9 categories. Findings from the data included that the highest counts of element occurred in the descriptive category and many of them overlapped with DC elements. This pattern also repeated in the elements co-occurred in different standards. A small number of semantically general elements appeared across the largest numbers of standards while the rest of the element co-occurrences formed a long tail with a wide range of specific semantics. The paper discussed implications of the findings in the context of metadata portability and infrastructure and pointed out that large, complex standards and widely varied naming practices are the major hurdles for building a metadata infrastructure.
A presentation given by Manjula Patel (UKOLN) at the Repository Curation Environments (RECURSE) Workshop held at the 4th International Digital Curation Conference, Edinburgh, 1st December 2008,
http://www.dcc.ac.uk/events/dcc-2008/programme/
Curation and Preservation of Crystallography DataManjulaPatel
A presentation given by Manjula Patel (UKOLN) at "Chemistry in the Digital Age: A Workshop connecting research and education", June 11-12th 2009, Penn State University,
http://www.chem.psu.edu/cyberworkshop09
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
Lei Zheng has over 15 years of experience in areas such as machine learning, data mining, and software development. He currently works as a Senior Software Engineer at Yahoo, where he develops algorithms for spam filtering and detection of abusive behavior. Previously he held research positions at the University of Pittsburgh and JustSystems Evans Research, where he implemented algorithms and systems for information retrieval, natural language processing, and data mining.
This document discusses the need for digital curation specialists in library settings to manage the growing volume of scholarly data and output. It recognizes that libraries have the skills and infrastructure to curate digital resources but will need new roles like digital curators, archivists, and data scientists. These roles require new training programs and concentrations in areas like data curation to develop specialists that can preserve, organize, and provide access to digital collections over the long term.
Integrated research data management in the Structural SciencesManjulaPatel
A presentation given by Manjula Patel (UKOLN, University of Bath) at the I2S2 workshop "Scaling Up to Integrated Research Data Management", IDCC 2010, 6th December 2010, Chicago.
http://www.ukoln.ac.uk/projects/I2S2/events/IDCC-2010-ScalingUp-Wksp/
The document discusses the ISA (Investigation/Study/Assay) framework for enabling data reuse and reproducibility in bioscience research. The ISA framework provides a generic format for rich experimental descriptions and an infrastructure of open source software tools. It aims to minimize the burden of reporting, curating, sharing data and metadata from bioscience experiments to enable comprehension, reuse of data, and reproducibility. The framework promotes community engagement to develop community standards and document use cases.
Building the FAIR Research Commons: A Data Driven Society of ScientistsCarole Goble
Science is knowledge work. The scientific method and scholarly communication are about facilitating “knowledge turns” – that is, the turning of observation and hypothesis through experimentation, comparison, and analysis into new, pooled knowledge. Turns depend on the FAIR flow and availability of data, methods for automated processing, reproducible results and on a society of scientists coordinating and collaborating. We need to build a new form of Research Commons and I will present my steps towards this.
Presented at Symposium: The Future of a Data-Driven Society, Maastricht University, 25 Jan 2018 that accompanied the 42nd Dies Natalis where I was awarded an honorary doctorate
Personal video:
https://www.youtube.com/watch?v=k5WN6KDDatU&index=4&list=PLzi-FBaZlOOagma5dCW7WSA5lv22tmNMD
Video of the symposium:
https://www.youtube.com/watch?v=JN9eMMtCHf8&t=19s&index=6&list=PLzi-FBaZlOOagma5dCW7WSA5lv22tmNMD
Acting as Advocate? Seven steps for libraries in the data decadeLizLyon
UKOLN advocates that libraries take seven steps to support data management and open science in the data decade:
1) Provide briefings on cloud data services in partnership with IT services.
2) Build usable data management tools in partnership with researchers.
3) Develop data sustainability strategies and articulate the costs and benefits.
4) Publish case studies on open science to show benefits of universal data sharing.
5) Present at university ethics committees to highlight open data issues.
6) Raise awareness of citizen science opportunities and guidelines for good practice.
7) Promote data citation and attribution to embed in publication practice.
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...Sarah Anna Stewart
Presentation given at the M25 Consortium of Academic Libraries, CPD25 Event on 'The Role of the Library in Supporting Research'. Provides an introduction to data, software and PIDs and a brief look at how libraries can enable researchers to gain impact and credit for their research data and software.
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
- Scientific names for species can change over time as taxonomy knowledge evolves
- An event-centric ontology model represents names and changes through time using different URIs for taxon concepts at different times
- Transition and snapshot models can then simplify the descriptions by linking concepts over time or just showing current names
- This approach allows integrated representation of taxonomy knowledge and its revisions in a computable way
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
The one-covers-all approach in current metadata standards for scientific data has serious limitations in keeping up with the ever-growing data. This paper reports the findings from a survey to metadata standards in the scientific data domain and argues for the need for a metadata infrastructure. The survey collected 4400+ unique elements from 16 standards and categorized these elements into 9 categories. Findings from the data included that the highest counts of element occurred in the descriptive category and many of them overlapped with DC elements. This pattern also repeated in the elements co-occurred in different standards. A small number of semantically general elements appeared across the largest numbers of standards while the rest of the element co-occurrences formed a long tail with a wide range of specific semantics. The paper discussed implications of the findings in the context of metadata portability and infrastructure and pointed out that large, complex standards and widely varied naming practices are the major hurdles for building a metadata infrastructure.
A presentation given by Manjula Patel (UKOLN) at the Repository Curation Environments (RECURSE) Workshop held at the 4th International Digital Curation Conference, Edinburgh, 1st December 2008,
http://www.dcc.ac.uk/events/dcc-2008/programme/
Curation and Preservation of Crystallography DataManjulaPatel
A presentation given by Manjula Patel (UKOLN) at "Chemistry in the Digital Age: A Workshop connecting research and education", June 11-12th 2009, Penn State University,
http://www.chem.psu.edu/cyberworkshop09
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
Lei Zheng has over 15 years of experience in areas such as machine learning, data mining, and software development. He currently works as a Senior Software Engineer at Yahoo, where he develops algorithms for spam filtering and detection of abusive behavior. Previously he held research positions at the University of Pittsburgh and JustSystems Evans Research, where he implemented algorithms and systems for information retrieval, natural language processing, and data mining.
This document discusses the need for digital curation specialists in library settings to manage the growing volume of scholarly data and output. It recognizes that libraries have the skills and infrastructure to curate digital resources but will need new roles like digital curators, archivists, and data scientists. These roles require new training programs and concentrations in areas like data curation to develop specialists that can preserve, organize, and provide access to digital collections over the long term.
Integrated research data management in the Structural SciencesManjulaPatel
A presentation given by Manjula Patel (UKOLN, University of Bath) at the I2S2 workshop "Scaling Up to Integrated Research Data Management", IDCC 2010, 6th December 2010, Chicago.
http://www.ukoln.ac.uk/projects/I2S2/events/IDCC-2010-ScalingUp-Wksp/
The document discusses the ISA (Investigation/Study/Assay) framework for enabling data reuse and reproducibility in bioscience research. The ISA framework provides a generic format for rich experimental descriptions and an infrastructure of open source software tools. It aims to minimize the burden of reporting, curating, sharing data and metadata from bioscience experiments to enable comprehension, reuse of data, and reproducibility. The framework promotes community engagement to develop community standards and document use cases.
Building the FAIR Research Commons: A Data Driven Society of ScientistsCarole Goble
Science is knowledge work. The scientific method and scholarly communication are about facilitating “knowledge turns” – that is, the turning of observation and hypothesis through experimentation, comparison, and analysis into new, pooled knowledge. Turns depend on the FAIR flow and availability of data, methods for automated processing, reproducible results and on a society of scientists coordinating and collaborating. We need to build a new form of Research Commons and I will present my steps towards this.
Presented at Symposium: The Future of a Data-Driven Society, Maastricht University, 25 Jan 2018 that accompanied the 42nd Dies Natalis where I was awarded an honorary doctorate
Personal video:
https://www.youtube.com/watch?v=k5WN6KDDatU&index=4&list=PLzi-FBaZlOOagma5dCW7WSA5lv22tmNMD
Video of the symposium:
https://www.youtube.com/watch?v=JN9eMMtCHf8&t=19s&index=6&list=PLzi-FBaZlOOagma5dCW7WSA5lv22tmNMD
Acting as Advocate? Seven steps for libraries in the data decadeLizLyon
UKOLN advocates that libraries take seven steps to support data management and open science in the data decade:
1) Provide briefings on cloud data services in partnership with IT services.
2) Build usable data management tools in partnership with researchers.
3) Develop data sustainability strategies and articulate the costs and benefits.
4) Publish case studies on open science to show benefits of universal data sharing.
5) Present at university ethics committees to highlight open data issues.
6) Raise awareness of citizen science opportunities and guidelines for good practice.
7) Promote data citation and attribution to embed in publication practice.
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...Sarah Anna Stewart
Presentation given at the M25 Consortium of Academic Libraries, CPD25 Event on 'The Role of the Library in Supporting Research'. Provides an introduction to data, software and PIDs and a brief look at how libraries can enable researchers to gain impact and credit for their research data and software.
In the last decade, several Scientific Knowledge Graphs (SKG) were released, representing scientific knowledge in a structured, interlinked, and semantically rich manner. But, what kind of information they describe? How they have been built? What can we do with them? In this lecture, I will first provide an overview of well-known SKGs, like Microsoft Academic Graph, Dimensions, and others. Then, I will present the Academia/Industry DynAmics (AIDA) Knowledge Graph, which describes 21M publications and 8M patents according to i) the research topics drawn from the Computer Science Ontology, ii) the type of the author's affiliations (e.g, academia, industry), and iii) 66 industrial sectors (e.g., automotive, financial, energy, electronics) from the Industrial Sectors Ontology (INDUSO). Finally, I will showcase a number of tools and approaches using such SKGs, supporting researchers, companies, and policymakers in making sense of research dynamics.
This document discusses several studies on user engagement in research data curation. It finds that institutional repositories for data were developed without input from researchers, leading to systems that did not meet researchers' needs. Barriers to open data sharing included concerns over commercial use and maintaining ownership. Successful data curation requires understanding disciplinary differences and developing trusted relationships with researchers through dialogue early in projects.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
This document summarizes Professor Carole Goble's presentation on making research more reproducible and FAIR (Findable, Accessible, Interoperable, Reusable) through the use of research objects and related standards and infrastructure. It discusses challenges to reproducibility in computational research and proposes bundling datasets, workflows, software and other research products into standardized research objects that can be cited and shared to help address these challenges.
A Big Picture in Research Data ManagementCarole Goble
A personal view of the big picture in Research Data Management, given at GFBio - de.NBI Summer School 2018 Riding the Data Life Cycle! Braunschweig Integrated Centre of Systems Biology (BRICS), 03 - 07 September 2018
This document discusses engaging researchers in research data management (RDM) through data reference interviews. It provides an overview of EDINA and the University of Edinburgh Data Library and their roles in assisting researchers. It then describes the data reference interview process, highlighting the importance of understanding the researcher's field and data. Recommendations are provided for interviewing researchers and tools for assessing data are introduced. The document concludes by discussing the University's RDM strategy and engagement tools.
Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
This document discusses change management for libraries in the digital age. It notes that digital technologies are blurring traditional lines between types of resources, institutions, and access to information. Users now expect online access and searching across all information formats and locations. The management of digital information requires investment in people, technology, and resources. Libraries must develop new skills and roles to integrate physical and digital collections and provide one-stop searching. Repositories are important for managing and preserving the growing amount of digital research output and data. Metadata standards help link resources across repositories at multiple levels from institutional to international.
Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.
Enabling better science - Results and vision of the OpenAIRE infrastructure a...Paolo Manghi
The document discusses enabling better science through open access to research outputs. It describes the OpenAIRE infrastructure and the Research Data Alliance (RDA) Data Publishing Working Group. OpenAIRE provides services to link publications, research data, projects and initiatives. The RDA group aims to create an open service for linking datasets to publications. OpenAIRE and PANGAEA are developing a beta data-literature linking service to increase discovery and reuse of research outputs.
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
Publishing of Scientific Data - Science Foundation Ireland Summit 2010jodischneider
This document discusses trends in publishing scientific data, including requirements to deposit data, citing data through identifiers like DOIs, considering data itself as a publication in data journals or databases, and including interactive data within publications. It also outlines new roles for working with scientific data, such as data scientists and curators who extract facts from literature to populate databases and ensure data quality.
4. A Revolution in Research
What is at stake is nothing less
than the ways in which
astronomy will be done in the era
of information abundance
Astronomer George Djorgovski
4
5. Drivers of context change
Growth of the internet
Low cost, rapid digitization of print materials
Open Source movement
Rise of Social Software, Web 2.0 tools, mobile
Publishing and scholarship ecosystem
Changing policies
Internationalization of scholarship
Growth in primary source datasets
5
6. Key characteristics of the current
research landscape
The products of research and the starting point of
new research are increasingly digital and increasingly
―born-digital‖
Exploding volumes and rising demand for data use
by the rapid pace of digital technology innovations
The rapid expansion of the inputs and outputs of
scholarship
6
7. Linking the Scholarly lifecycle
Vitae Grants
Related
Articles Comments
Notebooks & Reviews
Models
Codes Presentations
Algorithms
Preprints
Podcasts
Models
Methods
Video
Plans
Data Ontologies
Intermediate
Results
7
10. Examples of text as data
Changes in word sense ( e.g. consumption( TB )
, moot, oratio1 ) and spelling (e.g. 18th C. ſ to s ,
*re *er )
Bibliometrics and other usage analyses
Citation patterns
Institution vs. discipline
Author demographics
Pharma: Drug / Symptom correlation.
Biology: Species / date / location observations.
Social Sci: Work/life habits of undergrads based
on access patterns at different institutions [ usage
data based]
…
10
11. Text Mining
Unstructured text to queryable data structures
WHY?
TOO MUCH TEXT TO HAND ANALYZE.
Improved discovery ( better ‗metadata‘ )
Business Intelligence
e.g. content stats -> content acquisitions
Saleable datasets
E.g. Distribution of authors vs. disciplines vs. grants
End User research agendas
High-End : Custom (user specified) mining as a service
Simple : Visualization of results ( frequency / co-occurrence
…)
11
12. Datasets: Factoids & point data
ca. 1.4M Faculty ( 50% full-time ) in US HE, ~75M people enrolled in US HE
ca. 100k Faculty in UK HE
44% of Researchers use online (other people‘s) datasets for their research
48% of Researchers use datasets > 1GB
10.8% store their data outside their institution ( 50% store it in their ―lab‖)
1 - 5% of datasets are formally moved into the curation process.
66%of faculty have requested other people‘s data ( and 49% of those got it).
[ 26.5% have the expertise to analyze their own data.
[ 80.3% do not have sufficient expertise to manage their own data
Institutional storage costs ~ $600 / TB / year
[ 58% is the annual increase in the amount of data being generated
[ 20-40% is annual growth in the amount of storage deployed (est.)
< 1% of ecological data is accessible after publication.
> 85% of all information is in text form
2.7 times more citations accrue to papers with accessible data
3 to 6 times more papers emerge if the data is accessible.
12
13. Curation OF scholar data
Tools to ingest, add & validate schemas, publish,
migrate and preserve. ( DMP1 provision )
Tools to analyze2
Tools to discover datasets
―Summon‖ for IR datasets, gov‘t datasets …
Tools to merge (create composite datasets) 3
Citation management & attribution for datasets.
Generic capabilities (domain specific later).
13
14. Dataset provision TO scholars
Content procurement and dissemination.
What we do already (intermediary)
Needs discovery tools
Easy to focused on selected domains that are
publicly available.
Most research does not use publicly available data
14
15. Towards reproducible research
Reproducible
research
means context, quality,
trust
means easy access to
the sources
Science depends
entirely on the
knowledge and data
gained in the past to
further advance
15
16. Preserving Research Data
Growing trend of journals and publishers linking to open-
access data repositories
Elsevier and PANGAEA – Publishing Network for Geoscientific
& Environmental Data
Reciprocal linking of articles and the data behind the research
Journals and funding agencies setting policy to preserve
and associate data supporting research results
e.g. American Naturalist new policy:
This journal requires, as a condition for publication, that data
supporting the results in the paper should be archived in an
appropriate public archive, such as GenBank, TreeBASE, Dryad,
or the Knowledge Network for Biocomplexity. Data are important
products of the scientific enterprise, and they should be preserved
and usable for decades in the future. Authors may elect to have the
data publicly available at time of publication, or, if the technology of
the archive allows, may opt to embargo access to the data for a
period up to a year after publication. Exceptions may be granted at
the discretion of the editor, especially for sensitive information such
as human subject data or the location of endangered species.
16
20. ProQuest Microfilm
PQ business original objectives: preservation and access
New technology, microfilming
1938 British Library – 120,000 first printed books in English
1939 established Dissertations filming, printing program
1940‘s began microfilming newspapers
1948 began microfilming serials
Added 700+ Research Collections for Academic market, still
actively filming several
2.5M Dissertations and Theses, actively filming
Newspaper Archive contains 10,700 titles, 900 titles actively
filming
21. Microfilm Commitment
With the ongoing research and archival need for
microfilmed content, ProQuest invested significantly to
build a new filming operation in Ypsilanti, MI.
Opened May, 2010
Employing 65 staff
Utilizing eBeam Cameras: digital images to film masters
Scanning operation.
Utilizing 2 archive locations: Iron Mountain and Ypsilanti
29. Microform and Digital Interface
Microforms are the source materials for numerous
historical digital products.
Historical Newspapers
Periodical Archive Online, Periodical Index Online
Early English Books Online
Parliamentary Papers
Sanborn Maps, Geo-edition Sanborn Maps
Gerritsen Collection of Women‘s History
700+ Research Collections……
30. Digital Microfilm
Adobe controls
for zooming,
rotating, printing,
saving, emailing
PDFs or links
Use this area
for further date
selection
33. Dissertations
ProQuest ―UMI‖ Dissertation Publishing
Over 50 years
Official repository of dissertations and theses for the national
libraries of Canada and the United States
Archive
Use of Microform
Multi-location digital copies
Tape
35. Preservation of inputs and outputs
of scholarship
Publication part of
wider network of Related
Vitae Grants
Articles Comments
scholarly Notebooks
Models
& Reviews
information: Codes Presentations
Algorithms
Original data Preprints
Shared databases Models
Podcasts
Multimedia
Methods
Video
Plans
expressions Data Ontologies
Intermediate
Social media
Results
Preservation should
encompass all of
this
36. Our concern for scholarship
Secondary source publications are much better
protected than inputs to research
Research data-explosion
Primary sources
Datasets
Text as data
Focus on objects rather than linkages
We need to continue to support the preservation of
scholarship inputs and outputs as they evolves
37. Our questions for us…
Can practices of preservation and sustainability
become common place?
What is the right balance of new digital technology
and analog methods of preservation?
Film industry—research and practice on preservation born-
digital films
How should we approach going beyond the current
atomic level of preservation—the object? How should
we deal with:
Links
Text as data
mining
38. Towards increasing the
sustainability of research output
Persistent identifiers—linkages of underlying output
of scholarship
i.e. DOI, ISBN, ISNI
Establishing network of safe/trusted repositories for
for all outputs of scholars
Link/citation practices to outputs, not just official
publications; focus on reliability
39. Preservation of born digital outputs
Capability to preserve objects in digital formats—
addressing storage capacity; accessibility; and
frequent churn in digital formats, media, and tools
that turn bits into humanly-recognizable artifacts—is
a core requirement of digital scholarship.
Leverage Microfilm as superior vehicle for ―born digital‖
preservation
Driver for movement from print to digital in library
collections. See for example, 2009 Ithaka paper,
―What to Withdraw: Print Collections Management in
the Wake of Digitization‖
40. Preservation as a practice
We have a history in the preservation of
scholarship that continues today
Build preservation practices into our everyday
management of scholarly inputs and outputs.
Work with the community of scholars, libraries,
and publishers to evolve our thinking of needs
and practices
Working with CRL towards TRAC criteria audit of
our digital data and content
Partner with repositories for sustainability
40
Whilst content can be obfuscated or reduced, there are thorny issues with usage data. Early policy decisions need to be taken with respect to exposing usage data, even indirectly ( triangulation is always possible ).--1 Oratio has shifted from ‘speech’ to ‘prayer’ and back again in the latin literature. See Greg Crane et al.
Figures on faculty demographics from http://nces.ed.gov/programs/digest/d09Sources in earlier paper on datasets.
DMP : JISC / NSF mandated Data Management PlanBoth ‘canned’ such as histograms and user-scriptable.E.g combining observational data over time and space to turn point measurements into a time series of distribution map.
A reminder - Digital Microfilm acts like an extension of microfilm – there is no searching. It does provides basic amounts of metadata – for newspapers: title, year, month, day, and page – that make it easy to skip through the reels. Another reminder - It is web-based, so researchers can access the film content from their kitchens or their dorm rooms.
A reminder - Digital Microfilm acts like an extension of microfilm – there is no searching. It does provides basic amounts of metadata – for newspapers: title, year, month, day, and page – that make it easy to skip through the reels. Another reminder - It is web-based, so researchers can access the film content from their kitchens or their dorm rooms.
A reminder - Digital Microfilm acts like an extension of microfilm – there is no searching. It does provides basic amounts of metadata – for newspapers: title, year, month, day, and page – that make it easy to skip through the reels. Another reminder - It is web-based, so researchers can access the film content from their kitchens or their dorm rooms.