Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely
promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation
sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw
NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover,
the (meta)data does often not meet the FAIR principles [6]. Consequently, a central FAIR-compliant repository
is highly desirable to support NGS related research. We have selected iRODS (Rule-Oriented Data management
systems) [3] as a basis for implementing a sequencing data repository because it allows storing both data and metadata
together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized
way, and supports different types of clients. This repository will be part of an ecosystem of RDM solutions that
cover complementary phases of the research data life cycle in our organization (Academic Medical Center of the
University of Amsterdam). We selected Virtuoso [5] to enrich the metadata from iRODS to enable the management
of a triplestore for linked data. The metadata in the iCat (iRODS’ metadata catalogue) and the ontology in Virtuoso
are kept synchronized by enforcement of strict data manipulation policies. We have implemented a prototype to
preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes:
Davrods [4] for data and metadata ingestion, data retrieval; Metalnx-web [7] for administration, data curation, and
repository browsing; and iCommands [2] for all tasks by advanced users. Different user profiles are defined (principal
investigator, data curator, repository administrator), with different access rights. New data is ingested by copying raw
sequence files and the corresponding metadata file (a sample sheet) to the landing collection on iRODS. An iRODS
rule is triggered by the sample sheet file, which extracts the metadata and registers it to the iCAT as AVU (Attribute,
Value and Unit). Ontology files are registered into Virtuoso. The sequence files are copied to the persistent collection
and are made uniquely identifiable based on metadata. All the steps are recorded into a report file that enables
monitoring and tracking of progress and faults. Here we describe the design and implementation of the prototype,
and discuss the first assessment results. Initial results indicate that the proposed solution is acceptable and fits the
researchers workflow well.
The document discusses using WEKA and BioWeka to analyze DNA sequences and perform pattern matching. It summarizes how Eclat filtering and EM clustering are applied to a dataset containing DNA sequences from human and chimpanzee chromosomes. Eclat is used to extract codon frequencies as features, while EM clustering assigns sequences to clusters based on the mixture model with the highest posterior probability. The analysis aims to identify biologically relevant groups of genes and determine chromosomal similarities between humans and chimpanzees.
Metadata for digital long-term preservationMichael Day
Presentation given at the Max Planck Gesellschaft eScience Seminar 2008: Aspects of long-term archiving, hosted by the Gesellschaft für Wissenschaftliche Datenverarbeitung mbh Göttingen (GWDG), Göttingen, Germany, 19-20 June 2008
This is module 4 in the EDI Data Publishing training course. In this module, you will learn how to group your data files and other information products into a publishable unit.
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
David Shotton - Research Integrity: Integrity of the published recordJisc
The document discusses several issues related to publishing research data and proposes solutions to address them. It describes projects that aim to make it easier for researchers to publish, archive, cite and reuse research data. This includes developing metadata standards, data repositories, and publishing data citations as linked open data to improve data discovery and attribution.
This is module 6 in the EDI Data Publishing training course. In this module, you will learn how to create quality metadata and be introduced to the landscape of data repositories and their functions.
The document discusses using WEKA and BioWeka to analyze DNA sequences and perform pattern matching. It summarizes how Eclat filtering and EM clustering are applied to a dataset containing DNA sequences from human and chimpanzee chromosomes. Eclat is used to extract codon frequencies as features, while EM clustering assigns sequences to clusters based on the mixture model with the highest posterior probability. The analysis aims to identify biologically relevant groups of genes and determine chromosomal similarities between humans and chimpanzees.
Metadata for digital long-term preservationMichael Day
Presentation given at the Max Planck Gesellschaft eScience Seminar 2008: Aspects of long-term archiving, hosted by the Gesellschaft für Wissenschaftliche Datenverarbeitung mbh Göttingen (GWDG), Göttingen, Germany, 19-20 June 2008
This is module 4 in the EDI Data Publishing training course. In this module, you will learn how to group your data files and other information products into a publishable unit.
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
David Shotton - Research Integrity: Integrity of the published recordJisc
The document discusses several issues related to publishing research data and proposes solutions to address them. It describes projects that aim to make it easier for researchers to publish, archive, cite and reuse research data. This includes developing metadata standards, data repositories, and publishing data citations as linked open data to improve data discovery and attribution.
This is module 6 in the EDI Data Publishing training course. In this module, you will learn how to create quality metadata and be introduced to the landscape of data repositories and their functions.
A basic course on Research data management, part 1: what and whyLeon Osinski
A basic course on research data management for PhD students. The course consists of 4 parts. The course was given at Eindhoven University of Technology (TUe), 24-01-2017
Data Management in the context of Open Science.
Because open access become mandatory for publications and project-funded research data, it is the responsibility of each researcher to be informed and then trained in new practices.
A basic course on Research data management, part 4: caring for your data, or ...Leon Osinski
A basic course on research data management for PhD students. The course consists of 4 parts. The course was given at Eindhoven University of Technology (TUe), 24-01-2017
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Good (enough) research data management practicesLeon Osinski
Slides of a lecture on research data management (RDM), given for 3rd year students (Eindhoven University of Technology, major Psychology & Technology), as part of the course 0HV90 Quantitative Research. At the end of the slides a handy summary 'Research data management basics in a nutshell' is added.
Curation and Preservation of Crystallography DataManjulaPatel
A presentation given by Manjula Patel (UKOLN) at "Chemistry in the Digital Age: A Workshop connecting research and education", June 11-12th 2009, Penn State University,
http://www.chem.psu.edu/cyberworkshop09
This document outlines a presentation on developing a suite of semantically interoperable biomedical natural language processing (BioNLP) services using the Semantic Automated Discovery and Integration (SADI) framework. The presentation covers the motivation for the work due to the large amount of biomedical literature being produced, an overview of the proposed system architecture and components, including ontology development and SADI services, and a demonstration of the system. Experimental results showing the system can accurately extract requested information from text are also discussed. The presentation concludes by discussing future work to improve system performance.
A presentation given by Manjula Patel (UKOLN) at the Repository Curation Environments (RECURSE) Workshop held at the 4th International Digital Curation Conference, Edinburgh, 1st December 2008,
http://www.dcc.ac.uk/events/dcc-2008/programme/
dkNET is a portal that provides researchers access to diverse research resources and tools to improve rigor and reproducibility. It supports the use of Research Resource Identifiers (RRIDs) to properly identify research tools in publications. dkNET also provides Resource Reports that give detailed information on tools, their usage metrics, ratings, and alerts about potential issues. Additionally, it offers services like Reproducibility Reports and a Hypothesis Center to facilitate rigorous research.
FAIR data and model management for systems biology.FAIRDOM
Written and presented by Carole Goble (University of Manchester) as part of Intelligent Systems for Molecular Biology (ISMB), Dublin. July 10th - 14th 2015.
The document discusses knowledge management of experimental data through the ISA ecosystem. It describes the ISA-tab format and software suite that allows annotation and curation of experimental metadata. As a use case, it analyzes a dataset on metabolite profiling from a study of fatty acid amide hydrolase knockout mice. The ISA tools can represent investigations and assays, convert data to standardized formats, and facilitate sharing and analysis of experimental data.
DataONE Education Module 03: Data Management PlanningDataONE
Lesson 3 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
This is module 10 in the EDI Data Publishing training course. In this module, you will receive an introduction to what a data package is, how DOIs are assigned to data packages, and the repository's steps to insert a data package.
Networked Science, And Integrating with DataverseAnita de Waard
This document discusses the growing interconnectedness of research data and tools in a networked science environment. It summarizes Elsevier's current and potential future connections to the Dataverse platform, including exporting data from the Hivebench ELN to Dataverse, linking articles to datasets in Dataverse through frameworks like Scholix, indexing Dataverse through Elsevier's data search tools, and tracking metrics on Dataverse datasets through analytics platforms like PlumX. The author expresses interest in further strengthening integration between these systems to advance open sharing of research data.
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Sean Buckner, Texas A&M University
Jeremy Donald, Trinity University
Bruce Herbert, Texas A&M University
Wendi Kaspar, Texas A&M University
Nick Lauland, Texas Digital Library
Kristi Park, Texas Digital Library
Todd Peters, Texas State University
Denyse Rodgers, Baylor University
Cecilia Smith, Texas A&M University
Chris Starcher, Texas Tech University
Ryan Steans, Texas Digital Library
Santi Thompson, University of Houston
Ray Uzwyshyn, Texas State University
Laura Waugh, Texas Digital Library
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Poster available at: https://repository.library.brown.edu/studio/item/bdr:650020/
Presenters:
Andrew Creamer, Brown University
Hope Lappen, Brown University
John Santiago, Brown University
Presenters:
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
This document evaluates the performance of structured and semi-structured tools for accessing bioinformatics databases. It compares the Sequence Retrieval System (SRS) and Entrez search tools for structured data retrieval to Perl and BioPerl programs for semi-structured data retrieval. The study retrieves gene information from the European Bioinformatics Institute and National Centre for Biotechnology Information databases using each method. It finds that semi-structured tools provide an alternative to structured tools, though each approach has advantages and disadvantages for certain types of queries.
A basic course on Research data management, part 1: what and whyLeon Osinski
A basic course on research data management for PhD students. The course consists of 4 parts. The course was given at Eindhoven University of Technology (TUe), 24-01-2017
Data Management in the context of Open Science.
Because open access become mandatory for publications and project-funded research data, it is the responsibility of each researcher to be informed and then trained in new practices.
A basic course on Research data management, part 4: caring for your data, or ...Leon Osinski
A basic course on research data management for PhD students. The course consists of 4 parts. The course was given at Eindhoven University of Technology (TUe), 24-01-2017
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Good (enough) research data management practicesLeon Osinski
Slides of a lecture on research data management (RDM), given for 3rd year students (Eindhoven University of Technology, major Psychology & Technology), as part of the course 0HV90 Quantitative Research. At the end of the slides a handy summary 'Research data management basics in a nutshell' is added.
Curation and Preservation of Crystallography DataManjulaPatel
A presentation given by Manjula Patel (UKOLN) at "Chemistry in the Digital Age: A Workshop connecting research and education", June 11-12th 2009, Penn State University,
http://www.chem.psu.edu/cyberworkshop09
This document outlines a presentation on developing a suite of semantically interoperable biomedical natural language processing (BioNLP) services using the Semantic Automated Discovery and Integration (SADI) framework. The presentation covers the motivation for the work due to the large amount of biomedical literature being produced, an overview of the proposed system architecture and components, including ontology development and SADI services, and a demonstration of the system. Experimental results showing the system can accurately extract requested information from text are also discussed. The presentation concludes by discussing future work to improve system performance.
A presentation given by Manjula Patel (UKOLN) at the Repository Curation Environments (RECURSE) Workshop held at the 4th International Digital Curation Conference, Edinburgh, 1st December 2008,
http://www.dcc.ac.uk/events/dcc-2008/programme/
dkNET is a portal that provides researchers access to diverse research resources and tools to improve rigor and reproducibility. It supports the use of Research Resource Identifiers (RRIDs) to properly identify research tools in publications. dkNET also provides Resource Reports that give detailed information on tools, their usage metrics, ratings, and alerts about potential issues. Additionally, it offers services like Reproducibility Reports and a Hypothesis Center to facilitate rigorous research.
FAIR data and model management for systems biology.FAIRDOM
Written and presented by Carole Goble (University of Manchester) as part of Intelligent Systems for Molecular Biology (ISMB), Dublin. July 10th - 14th 2015.
The document discusses knowledge management of experimental data through the ISA ecosystem. It describes the ISA-tab format and software suite that allows annotation and curation of experimental metadata. As a use case, it analyzes a dataset on metabolite profiling from a study of fatty acid amide hydrolase knockout mice. The ISA tools can represent investigations and assays, convert data to standardized formats, and facilitate sharing and analysis of experimental data.
DataONE Education Module 03: Data Management PlanningDataONE
Lesson 3 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
This is module 10 in the EDI Data Publishing training course. In this module, you will receive an introduction to what a data package is, how DOIs are assigned to data packages, and the repository's steps to insert a data package.
Networked Science, And Integrating with DataverseAnita de Waard
This document discusses the growing interconnectedness of research data and tools in a networked science environment. It summarizes Elsevier's current and potential future connections to the Dataverse platform, including exporting data from the Hivebench ELN to Dataverse, linking articles to datasets in Dataverse through frameworks like Scholix, indexing Dataverse through Elsevier's data search tools, and tracking metrics on Dataverse datasets through analytics platforms like PlumX. The author expresses interest in further strengthening integration between these systems to advance open sharing of research data.
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Presenters:
Sean Buckner, Texas A&M University
Jeremy Donald, Trinity University
Bruce Herbert, Texas A&M University
Wendi Kaspar, Texas A&M University
Nick Lauland, Texas Digital Library
Kristi Park, Texas Digital Library
Todd Peters, Texas State University
Denyse Rodgers, Baylor University
Cecilia Smith, Texas A&M University
Chris Starcher, Texas Tech University
Ryan Steans, Texas Digital Library
Santi Thompson, University of Houston
Ray Uzwyshyn, Texas State University
Laura Waugh, Texas Digital Library
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...ASIS&T
Research Data Access and Preservation Summit, 2016
Atlanta, GA
May 4-7, 2016
Poster session (Wednesday, May 4)
Poster available at: https://repository.library.brown.edu/studio/item/bdr:650020/
Presenters:
Andrew Creamer, Brown University
Hope Lappen, Brown University
John Santiago, Brown University
Presenters:
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
This document evaluates the performance of structured and semi-structured tools for accessing bioinformatics databases. It compares the Sequence Retrieval System (SRS) and Entrez search tools for structured data retrieval to Perl and BioPerl programs for semi-structured data retrieval. The study retrieves gene information from the European Bioinformatics Institute and National Centre for Biotechnology Information databases using each method. It finds that semi-structured tools provide an alternative to structured tools, though each approach has advantages and disadvantages for certain types of queries.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
This document evaluates the performance of structured and semi-structured tools for accessing bioinformatics databases. It compares the Sequence Retrieval System (SRS) and Entrez search tools for structured data retrieval to Perl and BioPerl programs for semi-structured data retrieval. The study retrieves gene information from the European Bioinformatics Institute and National Center for Biotechnology Information databases using each method. It finds that semi-structured tools provide an alternative to structured tools, though each approach has advantages and disadvantages for certain types of queries.
1. Data mining is the process of finding patterns and correlations within large data sets to identify relationships between data. It allows organizations to predict trends and is used to build disease risk models and detect fraud.
2. The main difference between a database and data warehouse is that a database stores application data while a data warehouse stores historical and cumulative data for analysis. Data mining analyzes patterns in data while data warehousing extracts and stores data.
3. Important areas of research in bioinformatics using data mining and warehousing include sequence analysis, genome annotation, gene and protein expression analysis, cancer mutation analysis, protein structure prediction, and comparative genomics.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
SEEK is an open-source platform for scientists to store, share, and collaborate on heterogeneous data, models, and standard operating procedures. It was developed by researchers in the UK and Germany to facilitate data sharing across multi-group projects. SEEK allows scientists to organize experiments and data using ISA-TAB standards, interlink related assets, and control access to assets at various stages of research from private to public. Key features include hosting and simulating SBML models, exploring and annotating spreadsheets, and finding expertise and collaborators through people profiles.
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
Abstract
In this presentation, Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health, will share the NIH’s vision for a modernized, integrated FAIR biomedical data ecosystem and the strategic roadmap that NIH is following to achieve this vision. Dr. Gregurick will highlight projects being implemented by team members across the NIH’s 27 institutes and centers and will ways that industry, academia, and other communities can help NIH enable a FAIR data ecosystem. Finally, she will weave in how this strategy is being leveraged to address the COVID-19 pandemic.
Presenter: Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health
dkNET Webinar Information: https://dknet.org/about/webinar
FAIRsharing presentation at the Japan Science and Technology AgencyPeter McQuilton
A 30 minute seminar presented at the National Bioscience Database Center, part of the Japanese Science and Technology Agency, based in Tokyo, Japan. This presentation covers the FAIR Principles, the aims, methodology and use of FAIRsharing, related projects such as Bioschemas, and international initiatives such as ELIXIR and EOSC.
How Logilab ELN helps Organizations in Research Data ManagementAgaram Technologies
Research Data Management (RDM) refers to the methods of recording, organizing, storing, processing, and caring for information that is produced from a research project or used during a research project.
COMBINE 2019, EU-STANDS4PM, Heidelberg, Germany 18 July 2019
FAIR: Findable Accessable Interoperable Reusable. The “FAIR Principles” for research data, software, computational workflows, scripts, or any other kind of Research Object one can think of, is now a mantra; a method; a meme; a myth; a mystery. FAIR is about supporting and tracking the flow and availability of data across research organisations and the portability and sustainability of processing methods to enable transparent and reproducible results. All this is within the context of a bottom up society of collaborating (or burdened?) scientists, a top down collective of compliance-focused funders and policy makers and an in-the-middle posse of e-infrastructure providers.
Making the FAIR principles a reality is tricky. They are aspirations not standards. They are multi-dimensional and dependent on context such as the sensitivity and availability of the data and methods. We already see a jungle of projects, initiatives and programmes wrestling with the challenges. FAIR efforts have particularly focused on the “last mile” – “FAIRifying” destination community archive repositories and measuring their “compliance” to FAIR metrics (or less controversially “indicators”). But what about FAIR at the first mile, at source and how do we help Alice and Bob with their (secure) data management? If we tackle the FAIR first and last mile, what about the FAIR middle? What about FAIR beyond just data – like exchanging and reusing pipelines for precision medicine?
Since 2008 the FAIRDOM collaboration [1] has worked on FAIR asset management and the development of a FAIR asset Commons for multi-partner researcher projects [2], initially in the Systems Biology field. Since 2016 we have been working with the BioCompute Object Partnership [3] on standardising computational records of HTS precision medicine pipelines.
So, using our FAIRDOM and BioCompute Object binoculars let’s go on a FAIR safari! Let’s peruse the ecosystem, observe the different herds and reflect what where we are for FAIR personalised medicine.
References
[1] http://www.fair-dom.org
[2] http://www.fairdomhub.org
[3] http://www.biocomputeobject.org
Investigating plant systems using data integration and network analysisCatherine Canevet
The document discusses challenges in integrating plant data from multiple sources and proposes solutions. It notes that plant data is sparse, distributed across many databases in various formats, and focused primarily on the model plant Arabidopsis. Data integration is necessary to address key biological questions by consolidating information from pathway databases, gene annotations, protein interactions, and more. The document outlines approaches to data integration including controlled vocabularies, ontologies, data standards, and integration applications specifically designed to combine data sources like Ondex. Effective integration is important to fully leverage available plant data.
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
This presentation was delivered at the Elsevier Library Connect Seminar on 6 October 2014 in Johannesburg, 7 October 2014 in Durban and 9 October 2014 in Cape Town and gives an overview of the potential role that librarians can play in research data management
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing one single patient’s information gathered from a large volume of data over a long period of time. The main objective of this paper is to represent our ontology-based information retrieval approach for clinical Information System. We have performed a Case Study in the real life hospital settings. The results obtained illustrate the feasibility of the proposed approach which significantly improved the information retrieval process on a large volume of data over a long period of time from August 2011 until January 2012.
This document summarizes Professor Carole Goble's presentation on making research more reproducible and FAIR (Findable, Accessible, Interoperable, Reusable) through the use of research objects and related standards and infrastructure. It discusses challenges to reproducibility in computational research and proposes bundling datasets, workflows, software and other research products into standardized research objects that can be cited and shared to help address these challenges.
This presentation was given by Jon Wheeler and Karl Benedict of the University of New Mexico during the joint NISO-NFAIS Virtual Conference held on December 7, 2016
Presentation to the UM Library Emergent Research SeriesSEAD
SEAD is a 5-year project funded by the NSF to develop cyberinfrastructure for sustainable data preservation and access. It is a partnership between the universities of Michigan, Indiana, and Illinois. SEAD aims to serve researchers in sustainability science who work in small teams and have diverse data needs. It provides active curation tools, collaboration spaces, and interfaces that integrate data, publications, and people. Data can be deposited to university repositories through the SEAD Virtual Archive for long-term preservation and discovery. Lessons show more support is needed to bridge data production and long-term infrastructure. Future plans include expanding the user community and repository options.
This document summarizes a presentation on open science and open data. It discusses the importance of open research data for reproducibility and innovation. It outlines key policy developments promoting open data, including funder data policies and journal data policies. It also describes CODATA's activities related to data policies, frameworks for developing open data strategies, and components of the international open science ecosystem.
Similar to FAIR sequencing data repository based on iRODS (20)
Just like life, our code must adapt to the ever changing world we live in. From one day coding for the web, to the next for our tablets or APIs or for running serverless applications. Multi-runtime development is the future of coding, the future is to be dynamic. Let us introduce you to BoxLang.
Photoshop Tutorial for Beginners (2024 Edition)alowpalsadig
Photoshop Tutorial for Beginners (2024 Edition)
Explore the evolution of programming and software development and design in 2024. Discover emerging trends shaping the future of coding in our insightful analysis."
Here's an overview:Introduction: The Evolution of Programming and Software DevelopmentThe Rise of Artificial Intelligence and Machine Learning in CodingAdopting Low-Code and No-Code PlatformsQuantum Computing: Entering the Software Development MainstreamIntegration of DevOps with Machine Learning: MLOpsAdvancements in Cybersecurity PracticesThe Growth of Edge ComputingEmerging Programming Languages and FrameworksSoftware Development Ethics and AI RegulationSustainability in Software EngineeringThe Future Workforce: Remote and Distributed TeamsConclusion: Adapting to the Changing Software Development LandscapeIntroduction: The Evolution of Programming and Software Development
Photoshop Tutorial for Beginners (2024 Edition)Explore the evolution of programming and software development and design in 2024. Discover emerging trends shaping the future of coding in our insightful analysis."Here's an overview:Introduction: The Evolution of Programming and Software DevelopmentThe Rise of Artificial Intelligence and Machine Learning in CodingAdopting Low-Code and No-Code PlatformsQuantum Computing: Entering the Software Development MainstreamIntegration of DevOps with Machine Learning: MLOpsAdvancements in Cybersecurity PracticesThe Growth of Edge ComputingEmerging Programming Languages and FrameworksSoftware Development Ethics and AI RegulationSustainability in Software EngineeringThe Future Workforce: Remote and Distributed TeamsConclusion: Adapting to the Changing Software Development LandscapeIntroduction: The Evolution of Programming and Software Development
The importance of developing and designing programming in 2024
Programming design and development represents a vital step in keeping pace with technological advancements and meeting ever-changing market needs. This course is intended for anyone who wants to understand the fundamental importance of software development and design, whether you are a beginner or a professional seeking to update your knowledge.
Course objectives:
1. **Learn about the basics of software development:
- Understanding software development processes and tools.
- Identify the role of programmers and designers in software projects.
2. Understanding the software design process:
- Learn about the principles of good software design.
- Discussing common design patterns such as Object-Oriented Design.
3. The importance of user experience (UX) in modern software:
- Explore how user experience can improve software acceptance and usability.
- Tools and techniques to analyze and improve user experience.
4. Increase efficiency and productivity through modern development tools:
- Access to the latest programming tools and languages used in the industry.
- Study live examples of applications
Enhanced Screen Flows UI/UX using SLDS with Tom KittPeter Caitens
Join us for an engaging session led by Flow Champion, Tom Kitt. This session will dive into a technique of enhancing the user interfaces and user experiences within Screen Flows using the Salesforce Lightning Design System (SLDS). This technique uses Native functionality, with No Apex Code, No Custom Components and No Managed Packages required.
What’s new in VictoriaMetrics - Q2 2024 UpdateVictoriaMetrics
These slides were presented during the virtual VictoriaMetrics User Meetup for Q2 2024.
Topics covered:
1. VictoriaMetrics development strategy
* Prioritize bug fixing over new features
* Prioritize security, usability and reliability over new features
* Provide good practices for using existing features, as many of them are overlooked or misused by users
2. New releases in Q2
3. Updates in LTS releases
Security fixes:
● SECURITY: upgrade Go builder from Go1.22.2 to Go1.22.4
● SECURITY: upgrade base docker image (Alpine)
Bugfixes:
● vmui
● vmalert
● vmagent
● vmauth
● vmbackupmanager
4. New Features
* Support SRV URLs in vmagent, vmalert, vmauth
* vmagent: aggregation and relabeling
* vmagent: Global aggregation and relabeling
* vmagent: global aggregation and relabeling
* Stream aggregation
- Add rate_sum aggregation output
- Add rate_avg aggregation output
- Reduce the number of allocated objects in heap during deduplication and aggregation up to 5 times! The change reduces the CPU usage.
* Vultr service discovery
* vmauth: backend TLS setup
5. Let's Encrypt support
All the VictoriaMetrics Enterprise components support automatic issuing of TLS certificates for public HTTPS server via Let’s Encrypt service: https://docs.victoriametrics.com/#automatic-issuing-of-tls-certificates
6. Performance optimizations
● vmagent: reduce CPU usage when sharding among remote storage systems is enabled
● vmalert: reduce CPU usage when evaluating high number of alerting and recording rules.
● vmalert: speed up retrieving rules files from object storages by skipping unchanged objects during reloading.
7. VictoriaMetrics k8s operator
● Add new status.updateStatus field to the all objects with pods. It helps to track rollout updates properly.
● Add more context to the log messages. It must greatly improve debugging process and log quality.
● Changee error handling for reconcile. Operator sends Events into kubernetes API, if any error happened during object reconcile.
See changes at https://github.com/VictoriaMetrics/operator/releases
8. Helm charts: charts/victoria-metrics-distributed
This chart sets up multiple VictoriaMetrics cluster instances on multiple Availability Zones:
● Improved reliability
● Faster read queries
● Easy maintenance
9. Other Updates
● Dashboards and alerting rules updates
● vmui interface improvements and bugfixes
● Security updates
● Add release images built from scratch image. Such images could be more
preferable for using in environments with higher security standards
● Many minor bugfixes and improvements
● See more at https://docs.victoriametrics.com/changelog/
Also check the new VictoriaLogs PlayGround https://play-vmlogs.victoriametrics.com/
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISTier1 app
Are you ready to unlock the secrets hidden within Java thread dumps? Join us for a hands-on session where we'll delve into effective troubleshooting patterns to swiftly identify the root causes of production problems. Discover the right tools, techniques, and best practices while exploring *real-world case studies of major outages* in Fortune 500 enterprises. Engage in interactive lab exercises where you'll have the opportunity to troubleshoot thread dumps and uncover performance issues firsthand. Join us and become a master of Java thread dump analysis!
A neural network is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions.
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceICS
This webinar explores the “secure-by-design” approach to medical device software development. During this important session, we will outline which security measures should be considered for compliance, identify technical solutions available on various hardware platforms, summarize hardware protection methods you should consider when building in security and review security software such as Trusted Execution Environments for secure storage of keys and data, and Intrusion Detection Protection Systems to monitor for threats.
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...kalichargn70th171
Visual testing plays a vital role in ensuring that software products meet the aesthetic requirements specified by clients in functional and non-functional specifications. In today's highly competitive digital landscape, users expect a seamless and visually appealing online experience. Visual testing, also known as automated UI testing or visual regression testing, verifies the accuracy of the visual elements that users interact with.
Streamlining End-to-End Testing Automation with Azure DevOps Build & Release Pipelines
Automating end-to-end (e2e) test for Android and iOS native apps, and web apps, within Azure build and release pipelines, poses several challenges. This session dives into the key challenges and the repeatable solutions implemented across multiple teams at a leading Indian telecom disruptor, renowned for its affordable 4G/5G services, digital platforms, and broadband connectivity.
Challenge #1. Ensuring Test Environment Consistency: Establishing a standardized test execution environment across hundreds of Azure DevOps agents is crucial for achieving dependable testing results. This uniformity must seamlessly span from Build pipelines to various stages of the Release pipeline.
Challenge #2. Coordinated Test Execution Across Environments: Executing distinct subsets of tests using the same automation framework across diverse environments, such as the build pipeline and specific stages of the Release Pipeline, demands flexible and cohesive approaches.
Challenge #3. Testing on Linux-based Azure DevOps Agents: Conducting tests, particularly for web and native apps, on Azure DevOps Linux agents lacking browser or device connectivity presents specific challenges in attaining thorough testing coverage.
This session delves into how these challenges were addressed through:
1. Automate the setup of essential dependencies to ensure a consistent testing environment.
2. Create standardized templates for executing API tests, API workflow tests, and end-to-end tests in the Build pipeline, streamlining the testing process.
3. Implement task groups in Release pipeline stages to facilitate the execution of tests, ensuring consistency and efficiency across deployment phases.
4. Deploy browsers within Docker containers for web application testing, enhancing portability and scalability of testing environments.
5. Leverage diverse device farms dedicated to Android, iOS, and browser testing to cover a wide range of platforms and devices.
6. Integrate AI technology, such as Applitools Visual AI and Ultrafast Grid, to automate test execution and validation, improving accuracy and efficiency.
7. Utilize AI/ML-powered central test automation reporting server through platforms like reportportal.io, providing consolidated and real-time insights into test performance and issues.
These solutions not only facilitate comprehensive testing across platforms but also promote the principles of shift-left testing, enabling early feedback, implementing quality gates, and ensuring repeatability. By adopting these techniques, teams can effectively automate and execute tests, accelerating software delivery while upholding high-quality standards across Android, iOS, and web applications.
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...The Third Creative Media
"Navigating Invideo: A Comprehensive Guide" is an essential resource for anyone looking to master Invideo, an AI-powered video creation tool. This guide provides step-by-step instructions, helpful tips, and comparisons with other AI video creators. Whether you're a beginner or an experienced video editor, you'll find valuable insights to enhance your video projects and bring your creative ideas to life.
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid
IBM watsonx Code Assistant for Z, our latest Generative AI-assisted mainframe application modernization solution. Mainframe (IBM Z) application modernization is a topic that every mainframe client is addressing to various degrees today, driven largely from digital transformation. With generative AI comes the opportunity to reimagine the mainframe application modernization experience. Infusing generative AI will enable speed and trust, help de-risk, and lower total costs associated with heavy-lifting application modernization initiatives. This document provides an overview of the IBM watsonx Code Assistant for Z which uses the power of generative AI to make it easier for developers to selectively modernize COBOL business services while maintaining mainframe qualities of service.
🏎️Tech Transformation: DevOps Insights from the Experts 👩💻campbellclarkson
Connect with fellow Trailblazers, learn from industry experts Glenda Thomson (Salesforce, Principal Technical Architect) and Will Dinn (Judo Bank, Salesforce Development Lead), and discover how to harness DevOps tools with Salesforce.
The Role of DevOps in Digital Transformation.pdfmohitd6
DevOps plays a crucial role in driving digital transformation by fostering a collaborative culture between development and operations teams. This approach enhances the speed and efficiency of software delivery, ensuring quicker deployment of new features and updates. DevOps practices like continuous integration and continuous delivery (CI/CD) streamline workflows, reduce manual errors, and increase the overall reliability of software systems. By leveraging automation and monitoring tools, organizations can improve system stability, enhance customer experiences, and maintain a competitive edge. Ultimately, DevOps is pivotal in enabling businesses to innovate rapidly, respond to market changes, and achieve their digital transformation goals.
Orca: Nocode Graphical Editor for Container OrchestrationPedro J. Molina
Tool demo on CEDI/SISTEDES/JISBD2024 at A Coruña, Spain. 2024.06.18
"Orca: Nocode Graphical Editor for Container Orchestration"
by Pedro J. Molina PhD. from Metadev
Orca: Nocode Graphical Editor for Container Orchestration
FAIR sequencing data repository based on iRODS
1. FAIR Sequencing Data Repository based on iRODS
Felipe O. Gutierrez, Paul De Geest, Aldo Jongejan, Sjoerd Repping, J.T. van den
Berg, Antoine H.C. van Kampen, Sílvia D. Olabarriaga
Academic Medical Center of the University of Amsterdam - Amsterdam, NL
{f.oliveiragutierrez, p.f.degeest, a.jongejan, s.repping, j.t.vandenberg, a.h.vankampen
s.d.olabarriaga}@amc.uva.nl
Diogo F.C. Patrão
A. C. Camargo Cancer Center - São Paulo, Brazil
djogopatrao@gmail.com
ABSTRACT
Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely
promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation
sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw
NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover,
the (meta)data does often not meet the FAIR principles [6]. Consequently, a central FAIR-compliant repository
is highly desirable to support NGS related research. We have selected iRODS (Rule-Oriented Data management
systems) [3] as a basis for implementing a sequencing data repository because it allows storing both data and metadata
together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized
way, and supports different types of clients. This repository will be part of an ecosystem of RDM solutions that
cover complementary phases of the research data life cycle in our organization (Academic Medical Center of the
University of Amsterdam). We selected Virtuoso [5] to enrich the metadata from iRODS to enable the management
of a triplestore for linked data. The metadata in the iCat (iRODS’ metadata catalogue) and the ontology in Virtuoso
are kept synchronized by enforcement of strict data manipulation policies. We have implemented a prototype to
preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes:
Davrods [4] for data and metadata ingestion, data retrieval; Metalnx-web [7] for administration, data curation, and
repository browsing; and iCommands [2] for all tasks by advanced users. Different user profiles are defined (principal
investigator, data curator, repository administrator), with different access rights. New data is ingested by copying raw
sequence files and the corresponding metadata file (a sample sheet) to the landing collection on iRODS. An iRODS
rule is triggered by the sample sheet file, which extracts the metadata and registers it to the iCAT as AVU (Attribute,
Value and Unit). Ontology files are registered into Virtuoso. The sequence files are copied to the persistent collection
and are made uniquely identifiable based on metadata. All the steps are recorded into a report file that enables
monitoring and tracking of progress and faults. Here we describe the design and implementation of the prototype,
and discuss the first assessment results. Initial results indicate that the proposed solution is acceptable and fits the
researchers workflow well.
Keywords
Next Generation Sequencing, Research Data Management, iRODS, FAIR principles, genomics data, ontology.
INTRODUCTION
As part of modern biomedical research many types of (large) data sets are produced. OMICS experiments - and in
particular next generation sequencing (NGS) - are no exception. Biomedical research involving NGS data generally
comprises collaborations between clinical departments, laboratories, and data analysis groups that have different
cultures and procedures for working with data. With the growth of data, it has become challenging to keep track
of data, processes and outcomes of research over long periods of time and across the collaborating units. Adequate
iRODS UGM 2017 June 13-15, 2017, Utrecht, Netherlands
[Authors retain copyright.]
1
2. management of research data has become paramount to guaranteeing efficiency and integrity of research, as well
as to enable future data reuse and exploitation. Important initiatives concerning OMICS data management are
supported by, for example, the BBMRI and ELIXIR programs which are active across Europe. Various of these
initiatives promote and highlight the FAIR principles for research data [6]. According to these principles, research
data should be Findable, Accessible, Interoperable and Reusable, which are properties that demand great care in terms
of collection, annotation and archival. Nevertheless, few solutions exist for NGS data management that accomodate
these principles. In practice, often there is no central repository for sequencing data, and each individual researcher
is responsible for the storage and backup of their data and metadata. Moreover, NGS related metadata is generally
not generated automatically, requiring manual annotation by the researchers. And finally, metadata is typically not
stored together with the NGS data.
In our prototype we aim to develop a repository for raw NGS research data for our organization. The repository
should adhere to the FAIR principles by defining standardized, minimum, interoperable metadata which needs to be
preserved together with the data. iRODS [3] is used as a basis for this solution, in combination with a variety of
clients to fulfill the diverse needs of the different user roles. This paper briefly introduces characteristics of NGS data
before describing the designed solution. Preliminary results of a pilot case study are presented and discussed.
NGS RESEARCH DATA REPOSITORY
NGS data is generated by sequencing the genetic material within biological samples i.e. measuring the genetic material
and thereby defining the order of its nucleotides. An NGS experiment generally comprises the following steps:
• prepared biological samples are sent to a sequencing facility;
• the sequencing facility performs one or more sequencing runs using a sequencing instrument (sequencer);
• the output of these runs are images that are converted to raw sequencing data (this data comprises several
million reads). The NGS data is stored in a standard file format, e.g.: fastq (FASTQ format is a text-based
format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.).
• information (metadata) about the run(s), sample(s) and output file(s) are described in the so called sample
sheet, which can take different formats and contain different contents across sequencing platforms and services.
NGS data of such form is being generated in our organization by multiple research groups that, at the moment,
need to take care of their own data storage and management necessities. Metadata related to these data are being
maintained locally by the respective researchers. Data and metadata are spread in various systems, using non-
standardized protocols, making data retrieval, traceability and reuse extremely difficult. Lastly, due to being an
academic hospital the (meta)data can be highly sensitive and therefore can not leave the site, excluding any external
repositories from being used. In our project we aim to design a repository that can be used by various research
groups in our organization which would also be a viable option for similar organizations. In order to follow the
FAIR principles we aim to harmonize the format and content of the sample sheets, to represent the metadata in an
ontology, and to store the metadata and data together. Note that the metadata associated with a raw NGS dataset
is potentially extremely rich, describing the whole process from the biological sample acquisition through the wet lab
sample preparation and finally sequencing. Collecting such detailed metadata is not a trivial task for researchers.
Different research groups have their own priorities for NGS related metadata, therefore, in this project, we allow
each research group to define a subset of the recommended metadata fields, in addition to the minimum of required
metadata for an NGS experiment. This flexibility is intended to make the system more user friendly and increase the
willingness of the researcher to provide this metadata, which comes at the expense of standardization.
REPOSITORY ARCHITECTURE
The main components of the proposed system are presented in Figure 1: iRODS as the main data and metadata
storage server, Virtuoso as the ontology server, the different access points and connection protocols. Openlink
2
3. Virtuoso [5] is a hybrid database engine, allowing not only the management of relational databases but also of a
triplestore for linked data. It provides support for all major ontology formats and allows data-access through the
Virtuoso conductor web-interface.
Figure 1. Main components of our FAIR NGS Data repository: iRODS for files and metadata, Virtuoso for RDF
store, four types of user interfaces (blue), and rules (yellow). Data and metadata storage resources are depicted
in green. A data repository is a collection with two areas called landing (for data ingestion) and persistent (for
retrieval). Rules are triggered by new data on the landing area, and use microservices to create and load
metadata and ontology.
Within iRODS there are currently two storage resources (the iCat and the data storage space) which are placed in
the same storage pool. The data storage space is further divided into repositories that correspond to collections that
are potentially owned by different research groups. Each repository has two directories, each with different access
types or permissions (landing and persistent). The landing directory is temporarily used for new data and metadata
ingestion and the persistent directory holds data and metadata permanently. Rules in iRODS are used for extracting
metadata from files written into the landing directory, loading them into the iCAT and Virtuoso [5], and moving
them to the persistent directory.
The users can directly access iRODS in three ways, depending on their needs and knowledge: iCommands [2],
Davrods [4] and MetaLnx-web [7]. Davrods [4] is an Apache WebDAV interface that provide access to iRODS. It is a
bridge between the WebDAV protocol and the iRODS API, implemented as an Apache HTTPD module. It leverages
the Apache server implementation of the WebDAV protocol, mod dav, for compliance with the WebDAV Class 2
standard. Through Davrods the user can ingest and retrieve data from iRODS, mounting it as a file system in his/her
workstation. Therefore, the user cannot work with metadata through this client interface. Metalnx-web [7] is a web
application designed to work alongside iRODS. It provides a graphical UI that can help simplify most administration,
collection management, and metadata management tasks removing the need to memorize the long list of iCommands.
This is the most rich interface to access iRODS, it includes functionalities to execute all iCommands using a graphical
interface and also retrieve data and metadata.
FUNCTIONAL WORKFLOW
Figure 2 presents the functional workflow. It describes a generic approach for ingesting data into iRODS (through
Davrods), for generating metadata into iRODS and ontology into Virtuoso (through iRODS rules), and for retrieving
data and metadata.
3
4. Figure 2. Functional workflow to ingest and retrieve (meta)data from iRODS. Downwards arrows indicate
data/metadata ingestion flow, upwards arrows indicate retrieval flows and the horizontal arrows indicate auto-
mated processing.
Before ingestion to the repository, the sequencing files and the metadata need to be collected into a single directory
that has the name of the sequence run that will be ingested into iRODS. This directory is represented by the name
runXX in Figure 2, and it contains all fastq files and one sample sheet (CSV file). Once the user is logged in through
a mounted Davrods client, they can copy the complete directory to the landing directory. The rules are triggered
when the sample sheet, a CSV file, is copied through Davrods into a landing directory of an iRODS collection and the
sample sheet is correctly interpreted. Each rule has its specific set of functions and they use several microservices from
iRODS. At the end of the rule execution, all the results are moved to the persistent directory, which has read-only
access and the metadata are loaded into iRODS and Virtuoso, allowing the users to query and find their data in an
easy way, improving Findability (see FAIR principles).
The rules implement several steps and each step is monitored along the workflow and possible failures are reported
into a report file. This report is stored together with the data, and it can be consulted by the user to check if the
data and metadata were loaded correctly into the repository. Each line of the report file matches to one step in the
workflow, indicating also the date, time, the path to the file that was being handled and the resulting status. If the
rule executed correctly, the final result is OK; otherwise, an error message explains the specific step that went wrong.
This report file is encoded in html, for a user friendly view of all steps and messages.
USER POLICIES AND ROLES
Different research groups need to have separated repositories for their NGS research data. We create iRODS collec-
tions, users and groups to organize them inside the storage server. One of the policies of our iRODS instance is for
creating metadata in iCat and Virtuoso automatically when the researchers ingest data (see figure 1) on their own
landing directories. additionally, we can define different rules for each research group. Furthermore, the two different
directories at the storage space have different permission policies. On the landing directory the user has permissions
to write files, and on the persistent directory the user has read-only permissions. We have enabled the SSL connection
on iRODS server, so all clients that connect to it are using a secure protocol. The iRODS users have the same access
rights through Metalnx-web as on iCommands. The Davrods access works with the group access, so all users inside
a specific group can have access to specific landing and persistent directories.
We have defined four principal user roles for our system. The most basic user role is the Researcher that only has
4
5. access to the landing and persistent storage places for the groups they belong to. The files ingested into the landing
directory are private to this user. The files and metadata in the persistent directory are readable by all users in
the group. The Data Steward has full access to the user group directory they belong to, including rights to change
the metadata. The Principal Investigator has full control of the repository, being able to include users into groups
and assign roles. Finally, the Repository Administrator has the ability to create rules and manage all services in the
system. The three first users belong mainly to iRODS, whereas the last user also has control of the server.
USE CASE EVALUATION
The use case of a single research group is currently used for evaluating the prototype environment. A repository
was created for this group with corresponding collection, group and user accounts. A user was assigned by the
research group and trained to use the system. The goal was to store some of the existing raw NSG data into the
repository with minimum metadata. The datasets used in this evaluation covered varied conditions regarding the
number and size of files. The beta user connected to the repository from a Linux desktop using a mounted Davrods
drive for data ingestion and retrieval, and iCommands for metadata search. The metadata and data preparation
was supported by the Data Steward from our team for the first datasets. The user, Data Steward and developer
had close communication during the whole process to discuss difficulties and design improvements which were then
implemented using the agile development methodology [1]. We monitored and tested the user experience along the
phases, detailed in the workflow section, and summarized here:
• Data and metadata preparation: the user prepares a directory with NGS data files and a sample sheet
(CSV file) with metadata that follows a specific format.
• Data and metadata ingestion: the user accesses the Davrods mount point drive to copy data and metadata
to the landing directory.
• Rule processing: the sample sheet triggers an iRODS rule, generating a XML file with AVU metadata
(Attribute, Value, Unit) to be ingested into iRODS and a RDF file to be ingested into Virtuoso.
• Data and metadata retrieval: Metalnx and iCommands are used to search files through metadata. Davrods
can be used to retrieve data only.
We do this monitoring by asking the user directed questions regarding the following topics:
• clarity for the user: does the user know what to do in each case? For example, we wish to understand whether
the sample sheet requirements were clear enough to enable successful data ingestion for varied datasets.
• feasibility for the user: is it feasible for the user to accomplish the requirements in a practical and workable
manner?
• ease of use: is it easy for the user to use the system? ie.: mounting the landing directory, copying files, etc.
• feedback for the user: is the feedback coming from the system sufficient, clear, easy to find and timely?
• robustness: to determine if the system performs well, not only under ordinary conditions, but also under
unusual conditions that stress its designers assumptions.
• performance with different workloads: to determine if the system performs well with more than one user
accessing it through different computers, as well with different sets of files.
Below we briefly describe our preliminary results for this usability test in a qualitative manner. We found that during
preparation of the data and metadata the user had only minor problems with user input in the sample sheet and file
names, e.g. the user left certain mandatory fields open. The user was mostly satisfied with the process of ingesting
5
6. the data and metadata after mounting the Davrods interface on their personal computer, although it was noted that
due to the amount of data files being ingested the system can become slower. Additionally, the user had some minor
comments regarding the level of detail of the feedback coming from the system. In terms of rule processing the user
experienced some more serious problems, e.g.: mac OS systems generate hidden files that caused errors on the system
side. As for retrieving data and metadata the user experienced minor problems learning iCommands but were mostly
satisfied using Metalnx-web, which implements all iCommands available.
DISCUSSION
Metadata is recognized as useful on all types of RDM system, however annotating detailed metadata takes much
effort. Therefor, we opted for a simple approach that is feasible for the users. The simplicity of the current sample
sheet, which is mostly generated already by the sequencing facilities, has been considered an advantage during this
first evaluation. The choice for iRODS as a basis for our NGS repository has also been considered positive in this
first evaluation. It proved to be flexible to accommodate various changes, in particular regarding metadata ingestion,
which is expected to be a dynamic component of any RDM system that evolves along time. We expect that this
technology will become better supported in our organization, leading to long term sustainability for the solution
proposed here.
Last, and most important, is the usability of the system. We have set up the protoype in a way that the user
doesn’t need to have any iRODS knowledge by: choosing to use Davrods allowing users to drag and drop files to
the system; processing the metadata automatically using iRules; using Metalnx-web as an interface for retrieving
data and metadata; having a data steward responsible for helping successful ingestion of the data and metadata; and
having a administrator responsible for implementing new policies through iRules. We believe that these choices make
our prototype user friendly.
Based on the preliminary results of our usability tests we found that the prototype was generally a feasible solution
for the management of NGS data and metadata and we could reach the goals of our prototype even though it is very
simple for this first version. Additionally, we found that performance improvements of the landing directory could be
made. Finally, since we chose to reuse existing generic clients to access the repository, we observed that they provide
only limited contextual feedback. Therefore more user-friendly tools that are better customized to this application
area might be necessary to address the needs of other user profiles in the future.
ACKNOWLEDGMENTS
This project is financially supported by the AMC Innovation Fund. The authors would like to thank Rudy Scholte,
Lieuwe Kool, Joyce Nijkamp, Barbera van Schaik and Niek de Vries for their insightful contributions that helped
shape up our RDM solution.
REFERENCES
[1] Agile methodology. http://agilemethodology.org/. Accessed: 2017-05-01.
[2] irods icommands. https://docs.irods.org/4.1.10/. Accessed: 2017-05-01.
[3] A. Rajasekar, R. Moore, C.-y. Hou, C. A. Lee, R. Marciano, A. de Torcy, M. Wan, W. Schroeder, S.-Y. Chen,
L. Gilbert, et al. irods primer: integrated rule-oriented data system. Synthesis Lectures on Information
Concepts, Retrieval, and Services, 2(1):1–143, 2010.
[4] T. Smeele and C. Smeele. Davrods - an Apache WebDAV interface to iRODS. iRODS User Group Meeting 2016
Proceedings, 1:41–47, Dec. 2016.
[5] Openlink Virtuoso software. https://virtuoso.openlinksw.com/. Accessed: 2017-05-01.
[6] M. D. Wilkinson et al. The FAIR guiding principles for scientific data management and stewardship. Scientific
data, 3, 2016.
[7] S. Worth. An administrative and metadata ui for irods. https://github.com/Metalnx/metalnx-web. Accessed:
2017-05-01.
6