Running Dataverse repository in the European Open Science Cloud (EOSC)vty
The document discusses Dataverse, an open source data repository software. It summarizes that Dataverse was developed by Harvard University, has a large community and development team, and is used by many countries as a data repository infrastructure. It then describes the SSHOC Dataverse project which aims to create a multilingual, standardized, and reusable open data infrastructure across several European countries. Finally, it notes that Dataverse is a reliable cloud service that enables FAIR data sharing and can be easily deployed by research organizations.
Numerous scientific teams use the HDF5 format to store very large datasets. Efficient use of this data in a distributed environment depends on client applications being able to read any subset of the data without transferring the entire file to the local machine. The goal of the HDF5-iRODS Project was to develop an HDF5-iRODS module for the iRODS datagrid server that supported this capability, and to apply the technology to an NCSA/SDSC Strategic Applications Program (SAP) project, FLASH.
A joint team from The HDF Group (representing NCSA) and the SDSC SRB group collaborated to accomplish the project goal. The team implemented five HDF5 microservices functions on the iRODS server, and developed an iRODS FLASH slice client application. The client implementation also includes a JNI interface that allows HDFView, a standard tool for browsing HDF5 files, to access HDF5 files stored remotely in iRODS. Finally, three new collection client/server calls were added to the iRODS APIs, making it easier for users to query the content of an iRODS collection.
This document summarizes the work done to enhance the Geospatial Data Abstraction Library (GDAL) to better support NASA Earth Observing System (EOS) data products. It describes three phases of work: 1) a proof-of-concept ArcGIS plugin for product-specific HDF drivers, 2) generalized HDF drivers and an XML format, and 3) collaboration with GDAL developers utilizing HDF drivers and a Virtual Format (VRT) specification. The third phase highlights include enhanced generic functions, coordination with GDAL developers, testing across GIS clients, outreach to other data centers, and building tutorials. Future work areas are also outlined.
Harris Corporation provides geospatial software and analytics tools to access and analyze scientific data from remote sensing platforms. Their ENVI and IDL software support common data formats like HDF and NetCDF and provide capabilities for calibration, bowtie correction, reprojection, and visualization of data from sensors including GOES-16, VIIRS, and ocean and weather satellites. The tools allow scientists and analysts to efficiently process large volumes of earth observation data and extract valuable information to support applications in weather forecasting, agriculture, infrastructure monitoring, and more.
Aashish Chaudhary gave a presentation on Kitware's work with scientific computing and visualization using HDF. HDF is a widely used data format at Kitware for domains like climate modeling, geospatial visualization, and information visualization. Kitware is looking to improve HDF support for cloud and web environments to enable streaming analytics and web-based data analysis. The company also aims to further open source collaboration and scientific computing.
The document proposes the SOLID architecture for managing big semantic data in real-time. It consists of five layers:
1) The Online Layer receives incoming real-time data.
2) The Data Layer stores the compressed semantic data using the RDF/HDT format.
3) The Index Layer provides an index for fast access to the Data Layer.
4) The Service Layer allows federated SPARQL queries across the different layers.
5) The Merge Layer periodically integrates the Online Layer data into a new copy of the Data Layer.
Running Dataverse repository in the European Open Science Cloud (EOSC)vty
The document discusses Dataverse, an open source data repository software. It summarizes that Dataverse was developed by Harvard University, has a large community and development team, and is used by many countries as a data repository infrastructure. It then describes the SSHOC Dataverse project which aims to create a multilingual, standardized, and reusable open data infrastructure across several European countries. Finally, it notes that Dataverse is a reliable cloud service that enables FAIR data sharing and can be easily deployed by research organizations.
Numerous scientific teams use the HDF5 format to store very large datasets. Efficient use of this data in a distributed environment depends on client applications being able to read any subset of the data without transferring the entire file to the local machine. The goal of the HDF5-iRODS Project was to develop an HDF5-iRODS module for the iRODS datagrid server that supported this capability, and to apply the technology to an NCSA/SDSC Strategic Applications Program (SAP) project, FLASH.
A joint team from The HDF Group (representing NCSA) and the SDSC SRB group collaborated to accomplish the project goal. The team implemented five HDF5 microservices functions on the iRODS server, and developed an iRODS FLASH slice client application. The client implementation also includes a JNI interface that allows HDFView, a standard tool for browsing HDF5 files, to access HDF5 files stored remotely in iRODS. Finally, three new collection client/server calls were added to the iRODS APIs, making it easier for users to query the content of an iRODS collection.
This document summarizes the work done to enhance the Geospatial Data Abstraction Library (GDAL) to better support NASA Earth Observing System (EOS) data products. It describes three phases of work: 1) a proof-of-concept ArcGIS plugin for product-specific HDF drivers, 2) generalized HDF drivers and an XML format, and 3) collaboration with GDAL developers utilizing HDF drivers and a Virtual Format (VRT) specification. The third phase highlights include enhanced generic functions, coordination with GDAL developers, testing across GIS clients, outreach to other data centers, and building tutorials. Future work areas are also outlined.
Harris Corporation provides geospatial software and analytics tools to access and analyze scientific data from remote sensing platforms. Their ENVI and IDL software support common data formats like HDF and NetCDF and provide capabilities for calibration, bowtie correction, reprojection, and visualization of data from sensors including GOES-16, VIIRS, and ocean and weather satellites. The tools allow scientists and analysts to efficiently process large volumes of earth observation data and extract valuable information to support applications in weather forecasting, agriculture, infrastructure monitoring, and more.
Aashish Chaudhary gave a presentation on Kitware's work with scientific computing and visualization using HDF. HDF is a widely used data format at Kitware for domains like climate modeling, geospatial visualization, and information visualization. Kitware is looking to improve HDF support for cloud and web environments to enable streaming analytics and web-based data analysis. The company also aims to further open source collaboration and scientific computing.
The document proposes the SOLID architecture for managing big semantic data in real-time. It consists of five layers:
1) The Online Layer receives incoming real-time data.
2) The Data Layer stores the compressed semantic data using the RDF/HDT format.
3) The Index Layer provides an index for fast access to the Data Layer.
4) The Service Layer allows federated SPARQL queries across the different layers.
5) The Merge Layer periodically integrates the Online Layer data into a new copy of the Data Layer.
Building COVID-19 Museum as Open Science Projectvty
This document discusses building a COVID-19 Museum as an open science project. It describes the speaker's background working on various data management projects. It discusses moving towards open science and sharing data according to FAIR principles. It outlines the Time Machine project for digitizing historical documents and its approach to data management. The rest of the document discusses using the Dataverse platform to build repositories, linking metadata to ontologies, using tools like Weblate for translations, and exploring the use of artificial intelligence and machine learning to enhance metadata and facilitate human-in-the-loop review processes.
This document outlines a project between the Odum Institute and IQSS Dataverse team to integrate the Dataverse data repository system with iRODS, an open source data management system. The goals are to expand storage options for Dataverse, integrate curation workflows, and connect Dataverse to national research data infrastructure. A prototype will be developed to enable automated ingest of data from Dataverse to iRODS using rules and APIs. Challenges include migrating both systems to newer versions while maintaining authentication between them. An initial prototype is expected in August 2015.
Kitware uses HDF as a widely adopted data format for scientific computing and visualization across several domains. HDF supports climate modeling, geospatial data, medical imaging, and more. Kitware is looking to improve HDF support for streaming big data, cloud computing, and web applications to enable more advanced analytics and sharing of scientific data. Future work may include pure JavaScript implementations of HDF tools and optimizing performance for cloud storage.
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsChristophe Debruyne
Data processing is increasingly the subject of various internal and external regulations, such as GDPR which has recently come into effect. Instead of assuming that such processes avail of data sources (such as files and relational databases), we approach the problem in a more abstract manner and view these processes as taking datasets as input. These datasets are then created by pulling data from various data sources. Taking a W3C Recommendation for prescribing the structure of and for describing datasets, we investigate an extension of that vocabulary for the generation of executable R2RML mappings. This results in a top-down approach where one prescribes the dataset to be used by a data process and where to find the data, and where that prescription is subsequently used to retrieve the data for the creation of the dataset “just in time”. We argue that this approach to the generation of an R2RML mapping from a dataset description is the first step towards policy-aware mappings, where the generation takes into account regulations to generate mappings that are compliant. In this paper, we describe how one can obtain an R2RML mapping from a data structure definition in a declarative manner using SPARQL CONSTRUCT queries, and demonstrate it using a running example. Some of the more technical aspects are also described.
Reference: Christophe Debruyne, Dave Lewis, Declan O'Sullivan: Generating Executable Mappings from RDF Data Cube Data Structure Definitions. OTM Conferences (2) 2018: 333-350
iRODS is an open source data management software developed by DICE at UNC and UCSD as a follow-on to SRB. It provides a customizable, policy-driven framework for implementing data grids and managing data across heterogeneous storage resources. Key features include modularity, extensibility through microservices and rules, and interoperability with systems like HDF5, NetCDF, and storage systems through integration extensions. RENCI provides support and commercial offerings around iRODS through their E-iRODS distribution.
SSHOC Dataverse in the European Open Science Cloudvty
This project summary covers the SSHOC project which aims to create a social sciences and humanities section of the European Open Science Cloud by maximizing data reuse through open science principles. The project will interconnect existing and new infrastructures through a clustered cloud, establish governance for SSH-EOSC, and provide a research data repository service for SSH institutions through further developing the Dataverse platform on EOSC. The project involves 47 partners across 20 beneficiaries and 27 linked third parties with a budget of €14,455,594.08 over 40 months to achieve these objectives.
This document discusses how HDF Product Designer (HPD) uses templates to achieve interoperability. HPD is an application for consistently developing interoperable data content in HDF5 files. It has a client-server architecture and desktop app. Templates allow users to copy design examples that incorporate best practices and are curated by the HPD development team. Available templates include NCEI collections and CF templates, with more to be added based on community review and suggestions. Templates allow users to initialize new designs by mixing and matching content from different template examples.
This presentation discusses moving data and applications from HDF4 to HDF5/netCDF-4. It covers the differences between HDF4 and HDF5 data models and capabilities, tools for converting HDF4 data to HDF5, advantages of HDF5 like unlimited dimensions and compression, and ways to ensure compatibility with netCDF-4 like avoiding HDF5-specific features. The work was supported by a NASA contract.
Flexible metadata schemes for research data repositories - Clarin Conference...Vyacheslav Tykhonov
The development of the Common Framework in Dataverse and the CMDI use case. Building AI/ML based workflow for the prediction and linking concepts from external controlled vocabularies to the CMDI metadata values.
This document discusses the 5 year evolution of Dataverse, an open source data repository platform. It began as a tool for collaborative data curation and sharing within research teams. Over time, features were added like dataset version control, APIs, and integration with other systems. The document outlines challenges around maintenance and sustainability. It also covers efforts to improve Dataverse's interoperability, such as integrating metadata standards and controlled vocabularies, and making datasets FAIR compliant. The goal is to establish Dataverse as a core component of the European Open Science Cloud by improving areas like software quality, integration with tools, and standardization.
Dataverse can be deployed using Docker containers to improve maintainability and portability. The document discusses how Docker can isolate applications and their dependencies into portable containers. It provides an example of deploying Dataverse as a set of microservices within Docker containers. Instructions are included on building Docker images, running containers, and managing the containers and images through commands and tools like Docker Desktop, Docker Hub, and Docker Compose.
Technical integration of data repositories status and challengesvty
This document discusses technical integration of data repositories, including:
- Previous integration initiatives focused on metadata integration using OAI-PMH and ResourceSync protocols, as well as aggregators like OpenAIRE.
- Challenges to integration include different levels of software/service maturity, maintenance of distributed applications, and use of common standards and vocabularies.
- Potential integration efforts could focus on improving FAIRness, metadata/data flexibility, and connections between repositories, software, and computing resources to better enable reuse of EOSC data and services.
The presentation for the W3C Semantic Web in Health Care and Life Sciences community group by Slava Tykhonov, DANS-KNAW, the Royal Netherlands Academy of Arts and Sciences (October 2020). The recording is available https://www.youtube.com/watch?v=G9oiyNM_RHc
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...vty
This document summarizes an presentation about automating CI/CD testing, installation, and deployment of Dataverse in the European Open Science Cloud. It discusses using Docker and Kubernetes for deployment, a community-driven QA plan using pyDataverse for test automation, and providing quality assurance as a service. The presentation also covers topics like the CESSDA maturity model, integrating Dataverse on Google Cloud, and using serverless computing for some Dataverse applications and services.
Ontologies, controlled vocabularies and Dataversevty
Presentation on Semantic Web technologies for Dataverse Metadata Working Group running by Institute for Quantitative Social Science (IQSS) of Harvard University.
DataverseEU: Building Multilingual infrastructure for the Social Sciences in...vty
This document discusses the DataverseEU project, which aims to build a multilingual infrastructure for social science data in Europe using the Dataverse platform. Key points:
- The project is led by DANS and funded by CESSDA to promote sharing of social science research data across Europe.
- Technical development includes a Docker module to deploy Dataverse in the cloud, multilingual interfaces in several European languages, and a plugin to integrate various persistent identifier services.
- The Docker module allows hosting unlimited Dataverses on different ports and building multilingual interfaces. It decomposes Dataverse into separate database, search, and application containers.
- The da|ra PID plugin will allow service providers to switch between identifier
Building COVID-19 Museum as Open Science Projectvty
This document discusses building a COVID-19 Museum as an open science project. It describes the speaker's background working on various data management projects. It discusses moving towards open science and sharing data according to FAIR principles. It outlines the Time Machine project for digitizing historical documents and its approach to data management. The rest of the document discusses using the Dataverse platform to build repositories, linking metadata to ontologies, using tools like Weblate for translations, and exploring the use of artificial intelligence and machine learning to enhance metadata and facilitate human-in-the-loop review processes.
This document outlines a project between the Odum Institute and IQSS Dataverse team to integrate the Dataverse data repository system with iRODS, an open source data management system. The goals are to expand storage options for Dataverse, integrate curation workflows, and connect Dataverse to national research data infrastructure. A prototype will be developed to enable automated ingest of data from Dataverse to iRODS using rules and APIs. Challenges include migrating both systems to newer versions while maintaining authentication between them. An initial prototype is expected in August 2015.
Kitware uses HDF as a widely adopted data format for scientific computing and visualization across several domains. HDF supports climate modeling, geospatial data, medical imaging, and more. Kitware is looking to improve HDF support for streaming big data, cloud computing, and web applications to enable more advanced analytics and sharing of scientific data. Future work may include pure JavaScript implementations of HDF tools and optimizing performance for cloud storage.
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsChristophe Debruyne
Data processing is increasingly the subject of various internal and external regulations, such as GDPR which has recently come into effect. Instead of assuming that such processes avail of data sources (such as files and relational databases), we approach the problem in a more abstract manner and view these processes as taking datasets as input. These datasets are then created by pulling data from various data sources. Taking a W3C Recommendation for prescribing the structure of and for describing datasets, we investigate an extension of that vocabulary for the generation of executable R2RML mappings. This results in a top-down approach where one prescribes the dataset to be used by a data process and where to find the data, and where that prescription is subsequently used to retrieve the data for the creation of the dataset “just in time”. We argue that this approach to the generation of an R2RML mapping from a dataset description is the first step towards policy-aware mappings, where the generation takes into account regulations to generate mappings that are compliant. In this paper, we describe how one can obtain an R2RML mapping from a data structure definition in a declarative manner using SPARQL CONSTRUCT queries, and demonstrate it using a running example. Some of the more technical aspects are also described.
Reference: Christophe Debruyne, Dave Lewis, Declan O'Sullivan: Generating Executable Mappings from RDF Data Cube Data Structure Definitions. OTM Conferences (2) 2018: 333-350
iRODS is an open source data management software developed by DICE at UNC and UCSD as a follow-on to SRB. It provides a customizable, policy-driven framework for implementing data grids and managing data across heterogeneous storage resources. Key features include modularity, extensibility through microservices and rules, and interoperability with systems like HDF5, NetCDF, and storage systems through integration extensions. RENCI provides support and commercial offerings around iRODS through their E-iRODS distribution.
SSHOC Dataverse in the European Open Science Cloudvty
This project summary covers the SSHOC project which aims to create a social sciences and humanities section of the European Open Science Cloud by maximizing data reuse through open science principles. The project will interconnect existing and new infrastructures through a clustered cloud, establish governance for SSH-EOSC, and provide a research data repository service for SSH institutions through further developing the Dataverse platform on EOSC. The project involves 47 partners across 20 beneficiaries and 27 linked third parties with a budget of €14,455,594.08 over 40 months to achieve these objectives.
This document discusses how HDF Product Designer (HPD) uses templates to achieve interoperability. HPD is an application for consistently developing interoperable data content in HDF5 files. It has a client-server architecture and desktop app. Templates allow users to copy design examples that incorporate best practices and are curated by the HPD development team. Available templates include NCEI collections and CF templates, with more to be added based on community review and suggestions. Templates allow users to initialize new designs by mixing and matching content from different template examples.
This presentation discusses moving data and applications from HDF4 to HDF5/netCDF-4. It covers the differences between HDF4 and HDF5 data models and capabilities, tools for converting HDF4 data to HDF5, advantages of HDF5 like unlimited dimensions and compression, and ways to ensure compatibility with netCDF-4 like avoiding HDF5-specific features. The work was supported by a NASA contract.
Flexible metadata schemes for research data repositories - Clarin Conference...Vyacheslav Tykhonov
The development of the Common Framework in Dataverse and the CMDI use case. Building AI/ML based workflow for the prediction and linking concepts from external controlled vocabularies to the CMDI metadata values.
This document discusses the 5 year evolution of Dataverse, an open source data repository platform. It began as a tool for collaborative data curation and sharing within research teams. Over time, features were added like dataset version control, APIs, and integration with other systems. The document outlines challenges around maintenance and sustainability. It also covers efforts to improve Dataverse's interoperability, such as integrating metadata standards and controlled vocabularies, and making datasets FAIR compliant. The goal is to establish Dataverse as a core component of the European Open Science Cloud by improving areas like software quality, integration with tools, and standardization.
Dataverse can be deployed using Docker containers to improve maintainability and portability. The document discusses how Docker can isolate applications and their dependencies into portable containers. It provides an example of deploying Dataverse as a set of microservices within Docker containers. Instructions are included on building Docker images, running containers, and managing the containers and images through commands and tools like Docker Desktop, Docker Hub, and Docker Compose.
Technical integration of data repositories status and challengesvty
This document discusses technical integration of data repositories, including:
- Previous integration initiatives focused on metadata integration using OAI-PMH and ResourceSync protocols, as well as aggregators like OpenAIRE.
- Challenges to integration include different levels of software/service maturity, maintenance of distributed applications, and use of common standards and vocabularies.
- Potential integration efforts could focus on improving FAIRness, metadata/data flexibility, and connections between repositories, software, and computing resources to better enable reuse of EOSC data and services.
The presentation for the W3C Semantic Web in Health Care and Life Sciences community group by Slava Tykhonov, DANS-KNAW, the Royal Netherlands Academy of Arts and Sciences (October 2020). The recording is available https://www.youtube.com/watch?v=G9oiyNM_RHc
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...vty
This document summarizes an presentation about automating CI/CD testing, installation, and deployment of Dataverse in the European Open Science Cloud. It discusses using Docker and Kubernetes for deployment, a community-driven QA plan using pyDataverse for test automation, and providing quality assurance as a service. The presentation also covers topics like the CESSDA maturity model, integrating Dataverse on Google Cloud, and using serverless computing for some Dataverse applications and services.
Ontologies, controlled vocabularies and Dataversevty
Presentation on Semantic Web technologies for Dataverse Metadata Working Group running by Institute for Quantitative Social Science (IQSS) of Harvard University.
DataverseEU: Building Multilingual infrastructure for the Social Sciences in...vty
This document discusses the DataverseEU project, which aims to build a multilingual infrastructure for social science data in Europe using the Dataverse platform. Key points:
- The project is led by DANS and funded by CESSDA to promote sharing of social science research data across Europe.
- Technical development includes a Docker module to deploy Dataverse in the cloud, multilingual interfaces in several European languages, and a plugin to integrate various persistent identifier services.
- The Docker module allows hosting unlimited Dataverses on different ports and building multilingual interfaces. It decomposes Dataverse into separate database, search, and application containers.
- The da|ra PID plugin will allow service providers to switch between identifier
Flexible metadata schemes for research data repositories - CLARIN Conference'21vty
The development of the Common Framework in Dataverse and the CMDI use case. Building AI/ML based workflow for the prediction and linking concepts from external controlled vocabularies to the CMDI metadata values.
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...Andrea Scharnhorst
Presentation given at ISKO UK: research observatory, November 24, 2021
RESEARCH REPOSITORIES AND DATAVERSE: NEGOTIATING METADATA, VOCABULARIES AND DOMAIN NEEDS
Vyacheslav Tykhonov, Jerry de Vries, Eko Indarto, Femmy Admiraal, Mike Priddy, and Andrea Scharnhorst: Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the DANS EASY Research Data Repository
Abstract:
The development of metadata schemes in data repositories (and other content providers) has always been a process of negotiation between the needs of the designated user communities and the content of the collection on the one side and standards developed in the field. Automatisation has both enabled and enforced standardisation and alignment of metadata schemes (see as an example). But, while designated user communities turned from being local users to global ones (due to web services), their specific needs have not vanished. Technology offers possibilities to give the aforementioned negotiation a new form. In this presentation, we present the Dataverse platform, used by many data repositories. We show - using the case of the CMDI metadata and the CLARIN (Common Language Resources and Technology Infrastructure)community - how the Dataverse common core set of metadata called Citation Block can be extended with custom fields defined as a discipline specific metadata block. In particular, we show how these custom fields can be connected to a distributed network of authoritative controlled vocabularies. So, that at the end semantic search is possible. The presentation highlights opportunities and challenges, based on our own experiences. Related work has been presented at the CLARIN Annual Conference 2021 (see Proceedings).
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...vty
Presentation at ISKO Knowledge Organisation Research Observatory. RESEARCH REPOSITORIES AND DATAVERSE: NEGOTIATING METADATA, VOCABULARIES AND DOMAIN NEEDS
Dataverse repository for research data in the COVID-19 Museumvty
The Covid-19 Museum has an ambition to create a platform to deposit, consult, aggregate and study heterogeneous data about the pandemics using features of a distributed web service. To achieve this purpose, Dataverse has been selected as a reliable FAIR data repository with built-in search engine and functionality that allows adding computing resources to explore archived resources both on data and metadata. Presentation by
Slava Tykhonov, DANS-KNAW (The Royal Netherlands Academy of Arts and Sciences). Université Paris Cité, 19 April 2022.
Presentation for CLARIAH IG Linked Open Data on the latest developments for Dataverse FAIR data repository. Building SEMAF workflow with external controlled vocabularies support and Semantic API.
CLARIN CMDI use case and flexible metadata schemes vty
Presentation for CLARIAH IG Linked Open Data on the latest developments for Dataverse FAIR data repository. Building SEMAF workflow with external controlled vocabularies support and Semantic API. Using the theory of inventive problem solving TRIZ for the further innovation in Linked Data.
EUDAT Collaborative Data Infrastructure: Data Access and Re-use Service AreaEUDAT
The document summarizes services provided by EUDAT, a collaborative data infrastructure funded by the European Union. It describes EUDAT's B2ACCESS identity and access management service, which allows users to access EUDAT and other services using their existing credentials. It also summarizes EUDAT's B2FIND data discovery service, B2SHARE data sharing and preservation service, and B2DROP file sharing service. It outlines their key features and integration with each other. Future plans include further deployments, improved interoperability, and enhanced user experiences across EUDAT services.
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
Slides used in the webinar TileDB hosted with participation from Spire Maritime, describing the use and accessibility of massive time series maritime data on TileDB Cloud.
The document discusses modernizing a data warehouse using the Microsoft Analytics Platform System (APS). APS is described as a turnkey appliance that allows organizations to integrate relational and non-relational data in a single system for enterprise-ready querying and business intelligence. It provides a scalable solution for growing data volumes and types that removes limitations of traditional data warehousing approaches.
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
The document provides an overview and roadmap for the first release of an open data asset manager called Etosha MDS. Key points include:
- Etosha MDS will expose metadata about datasets to enable discovery, exploration, and risk analysis of data assets.
- The first release will focus on collecting and exposing schema, statistics, and semantic annotations about datasets using tools like SPARQL and a graph browser.
- Future releases will integrate datasets across Hadoop clusters using a shared semantic knowledge graph and dataset integration layer following the data as a service paradigm.
This document discusses different options for deploying a Hadoop cluster, including using an appliance like Oracle's Big Data Appliance, deploying on cloud infrastructure through Amazon EMR, or building your own "do-it-yourself" cluster. It provides details on the hardware, software, and costs associated with each option. The conclusion compares the pros and cons of each approach, noting that appliances provide high performance and integration but may be less flexible, while cloud deployments offer scalability and pay-per-use but require consideration of data privacy. Building your own cluster gives more control but requires more work to set up and manage.
Meeting today’s dissemination challenges – Implementing International Standar...Jonathan Challener
This document discusses the .Stat system, which serves as a central repository for validated statistics and metadata. .Stat connects data production, sharing, and dissemination processes. It provides three key functional areas: a data upload engine, a data delivery engine, and a data browser. .Stat can be mapped to stages in the Generic Statistical Business Process Model and incorporates standards like SDMX for dissemination, data exchange, and internal data sharing. The document outlines .Stat's current role and future plans to further support SDMX artifacts, ingest, registries, and semantic web opportunities.
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigData_Europe
The document discusses the Big Data Integrator (BDI) platform, a one-stop solution for big and smart data management developed by the BigDataEurope project. The BDI is a flexible, generic platform that supports a variety of big data components through its Docker-based architecture. It addresses requirements from multiple stakeholders and goes beyond existing solutions by incorporating semantic capabilities and enabling easy deployment of customized data pipelines. A demo of the BDI platform shows how different big data stacks can be deployed through its user-friendly interfaces.
EUDAT receives funding from the European Union's Horizon 2020 programme. The document discusses several EUDAT services including B2STAGE, B2NOTE, the Generic Execution Framework (GEF), and the Data Subscription Service. It outlines achievements in integrating these services and future plans to further develop and apply them in projects beyond EUDAT such as EOSC-hub.
This presentation was part of the Cloudify and XLAB Research Webinar about DevOps for Data Intensive Applications.
In this webinar we discussed how to leverage automation for your big data applications, using DICE tools based on the Cloudify Open Source Orchestration.
We want to make sure that developers use the time to develop their big data applications and not have to worry about deployment and operations, and have the shortest time to delivery possible.
We also cover using the DICE deployment tools for automated deployment of Spark, Storm, Cassandra or Hadoop.
Stargate, the gateway for some multi-models data APIData Con LA
Cedrick Lunven presents on the gateway for multi-model Data APIs. The presentation discusses why data gateways are rising in popularity, the architecture and implementations of gateways like Stargate, how Apache Cassandra can be used as a multi-model database, and demos Astra which is a Cassandra-as-a-Service. The presentation aims to explain the benefits of data gateways for both developers and database administrators.
Equinix Big Data Platform and Cassandra - A view into the journeyPraveen Kumar
Story of building Big Data Platform in Equinix to cater a number of use cases. It explains journey and selection of Cassandra for NoSQL solution sitting in the heart of the platform. Storm , flume, AMQ, Drools, Solr technologies playing an important role in the platform. Platform processing large amounts of data in real-time.
Reliable Data Intestion in BigData / IoTGuido Schmutz
Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
Similar to DataverseEU as multilingual repository (20)
Decentralised identifiers and knowledge graphs vty
Building an Operating System for Open Science: data integration challenges, Dataverse data repository and knowledge graphs. Lecture by Slava Tykhonov, DANS-KNAW, for the Journées Scientifiques de Rochebrune 2023 (JSR'23).
Decentralised identifiers for CLARIAH infrastructure vty
Slides of the presentation for CLARIAH community on the ideas how to make controlled vocabularies sustainable and FAIR (Findable, Accessible, Interoperable, Reusable) with the help of Decentralized Identifiers (DIDs).
Building collaborative Machine Learning platform for Dataverse network. Lecture by Slava Tykhonov (DANS-KNAW, the Netherlands), DANS seminar series, 29.03.2022
Controlled vocabularies and ontologies in Dataverse data repositoryvty
This document discusses supporting external controlled vocabularies in Dataverse. It proposes implementing a JavaScript interface to allow linking metadata fields to terms from external vocabularies accessed via SKOSMOS APIs. Several challenges are identified, such as applying support to any field, backward compatibility, and ensuring vocabularies come from authoritative sources. Caching concepts and linking dataset files directly to terms are also proposed to improve interoperability.
External controlled vocabularies support in Dataversevty
This presentation discusses adding support for external controlled vocabularies to the Dataverse data repository platform. It describes how ontologies like SKOS can be used to represent vocabularies and allow linking metadata fields in Dataverse to terms. The presentation proposes developing a Semantic Gateway plugin for Dataverse that would allow browsing and linking to external vocabularies hosted in the SKOSMOS framework via its API. This could improve metadata by allowing standardized, linked terms and help make data more FAIR.
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataversevty
This presentation is about external CVs support in Dataverse, Open Source data repository. Data Archiving and Networked Services (DANS-KNAW) decided to use Dataverse as a basic technology to build Data Stations and provide FAIR data services for various Dutch research communities.
Data standardization process for social sciences and humanitiesvty
This document discusses data standardization processes at DANS-KNAW. It describes how DANS-KNAW standardizes metadata during data deposit and harvesting through controlled vocabularies. It also discusses how DANS is developing the SSHOC DataverseEU project to standardize metadata across several European countries. The document concludes by emphasizing the importance of tracking provenance information and developing standardization pipelines and services to improve data and metadata access.
This document summarizes the development process for the DataverseSSHOC project. It outlines two parallel development tracks - a core development team modifying the Dataverse core functionality, and an application development team creating new tools. Tasks are managed using Trello. Code is stored in GitHub and BitBucket. The development follows a SCRUM process with Docker images available on Docker Hub and a Kubernetes cluster for deployment. Testing includes unit, integration, performance, and A/B testing to comply with CESSDA maturity standards.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
3. Background of Dataverse software
CESSDA SaW, pilot projects
DataverseEU, CESSDA workplan 2018 project
SSHOC project
GUIDE proposal
Content of this presentation
4. Research data repository software
Open source
Institute for Quantitative Social Science (IQSS) University of Harvard
Gary King
Started in 2007
Harvard has published 26991 datasets with 314934 files (November 2018)
https://dataverse.org/
5. Dataverse for CESSDA SP
Docker installation on CESSDA google cloud
Translation of UI and metadata fields in national languages
CESSDA Metadata Model (CMM) compliance
Optional PiD Plugin to DA|RA service
Incorporation of CESSDA Controlled vocabularies
DataverseEU, workplan 2018 project
6. With Funding
ADP (Slovenia)
AUSSDA (Austria)
DANS (Netherlands, lead)
GESIS (Germany)
SND (Sweden)
TARKI (Hungary)
Without Funding
PROGEDO (France)
SOHDA (Belgium)
UKDA (UK)
UNIDATA (Italy)
Participants of the DataverseEU
project
7. Our goal isn’t to create new fork of Dataverse, all contributions should go
to IQSS repository
Delivered as Docker images and deployed in Google Cloud as CESSDA
DataverseEU repository
Any Service Provider can host separate Dataverse instance in its own
cloud if it’s required (Archive in a box)
Metadata from other CESSDA repositories will be harvested by central
DataverseEU repository
Easy to add new languages without changes in core functionality if more
partners will join during or after the project
DataverseEU development
model
8. Extremely powerful configuration management tool
Allows to install software on any platform
Dataverse can be installed from Docker as standalone container or
as a platform delivering Docker based microservices (database,
search engine, core service)
Docker allows to host unlimited amount of Dataverses on different
ports
Docker can be used to organize multilingual interfaces
Published as open Open Source software
http://github.com/IQSS/dataverse-docker
DataverseEU Docker module
9. We've made decomposition of all Dataverse components and
created separate containers for:
Postgres database, SOLR search engine, Glassfish
application, da-ra PID plugin
Optional containers: Handle server, Swift storage,
TwoRavens, iRods
… actually any tool that already have Docker image can be
connected as microservice
Docker Compose configuration
15. We’ve developed pipeline to export metadata records as
RDF
Ready for Linked Open Data Cloud
(LOD)
16. DataverseEU requires flexible way to switch between PID service providers (da|ra,
DataCite, handle)
Registering DOIs with da|ra service (GESIS) will give data providers a greater
visibility and recognition as data references will be integrated in da|ra search index
Implemented as microservice that can be plugged in to Dataverse separately and
has own Docker image
EU data archives can get separate prefixes within the same Dataverse instance
and increase their visibility and recognition
PID Plugin can be used in combination with external storage configuration (based
on Swift) to host data locally in national infrastructures
http://www.da-ra.de
DataverseEU with da|ra PID
plugin
17. Besides UI and metadata fields, also SOLR (search index)
has to be translated
Every update of the software gives more lines to be
translated
API of CESSDA CV is very experimental
Still some issues with PID plugin mappings
For institutions interested in Archive in a box solution both
technical stuff and governance are important
Issues
18. Development server: https://dataverse-dev.cessda.eu/
Docker images available at the end of the year for each partner
All developments are published on
https://bitbucket.org/cessda/cessda.dataverseeu/src/develop/
(closed)
https://github.com/IQSS/dataverse-docker
`Technical Developments will be continued in SSHOC project
Outreach and support for SP in GUIDE proposal
Results and future plans
19. Social Sciences & Humanities Open Cloud
January 2019 – April 2022
Task 5.1
CESSDA/AUSSDA, CESSDA/NSD, DARIAH/PSNC, DARIAH/UGOE,
CLARIN ERIC, CNR, Lead: CESSDA/DANS-KNAW
Deliverables:
Development of new functionality
Report on sustainability of the service
20. Development Support packages for Aspiring/non-member
SPs/DAs
Dataverse as an ‘Archive in a box’ package (software
images with instruction and training material) as one
example
Work Package 4, Task 4.3
What kind of assistance do YOU need?
CESSDA GUIDE Proposal