At ING we needed a way to implement Data science models from exploration into production. I will do this talk from my experience on the exploration and production Hadoop environment as a senior Ops engineer. For this we are using OpenShift to run Docker containers that connect to the big data Hadoop environment.
During this talk I will explain why we need this and how this is done at ING. Also how to set up a docker container running a data science model using Hive, Python, and Spark. I’ll explain how to use Docker files to build Docker images, add all the needed components inside the Docker image, and how to run different versions of software in different containers.
In the end I will also give a demo of how it runs and is automated using Git with webhook connecting to Jenkins and start the docker service that will connect to a big data Hadoop environment.
This is going to be a great technical talk for engineers and data scientist.
Speaker
Lennard Cornelis, Ops Engineer, ING
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Denodo
Correctly Architecting your Solutions for Analytical & Operational Uses reviews the two main types of use cases that can be solved with the Denodo Platform. Both high concurrency scenarios and big reporting use cases are discussed in this presentation in a comparative way, explaining the different approaches that you must take to be successful in any situation.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/wdZgpo.
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...4Science
DSpace-CRIS is an extended version of DSpace that offers a powerful and flexible data model to describe not only publications but all research entities and their relationships. DSpace-CRIS 7 will feature a new Angular UI and REST API in addition to functionality for compliance with OpenAire, integrating publications from external sources, bidirectional ORCID integration, and synchronizing with other systems. DSpace-CRIS also extends data modeling capabilities and provides tools for data quality, metadata management, and extensibility.
Continuous Optimization for Distributed BigData AnalysisKai Sasaki
This document discusses challenges with distributed data analysis and Treasure Data's approach to addressing them. Some key points:
- Distributed data analysis faces challenges around network bandwidth, throughput, data consistency, and reliability.
- Treasure Data uses a columnar storage format based on MessagePack to more efficiently save bandwidth and storage space.
- They implement time index pushdown to enable reading only relevant data within a time range, reducing network usage.
- Automatic optimization of partitioning layout and repartitioning aims to balance partition file size, time ranges, and keys to maximize performance and throughput while minimizing memory pressure.
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo
Enterprise-wide deployments require an architecture that scales horizontally and can work in a geographically distributed environment. The Denodo Platform can scale for a single instance used for departmental projects all the way to enterprise-wide distributed clusters. This webinar will explain how the Denodo Platform can scale to handle the most demanding requirements and will provide examples of some actual deployment configurations.
More information and FREE registrations to this webinar: http://goo.gl/ma3U5h
To learn more click to this link: http://go.denodo.com/a2a
Join the conversation at #Architect2Architect
Agenda:
Deployment Configurations
HA and Clustering
Geographically Distributed Configurations
Development Configurations
Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...Globus
Globus is a non-profit data management platform developed by the University of Chicago to increase the efficiency of data-driven research. It allows researchers to easily transfer large datasets, securely share data with collaborators across different storage systems, and automate the movement of data from instruments and compute facilities. Globus sees growing adoption with over 120 subscribers and has transferred over 768 petabytes of data.
At ING we needed a way to implement Data science models from exploration into production. I will do this talk from my experience on the exploration and production Hadoop environment as a senior Ops engineer. For this we are using OpenShift to run Docker containers that connect to the big data Hadoop environment.
During this talk I will explain why we need this and how this is done at ING. Also how to set up a docker container running a data science model using Hive, Python, and Spark. I’ll explain how to use Docker files to build Docker images, add all the needed components inside the Docker image, and how to run different versions of software in different containers.
In the end I will also give a demo of how it runs and is automated using Git with webhook connecting to Jenkins and start the docker service that will connect to a big data Hadoop environment.
This is going to be a great technical talk for engineers and data scientist.
Speaker
Lennard Cornelis, Ops Engineer, ING
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Denodo
Correctly Architecting your Solutions for Analytical & Operational Uses reviews the two main types of use cases that can be solved with the Denodo Platform. Both high concurrency scenarios and big reporting use cases are discussed in this presentation in a comparative way, explaining the different approaches that you must take to be successful in any situation.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/wdZgpo.
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...4Science
DSpace-CRIS is an extended version of DSpace that offers a powerful and flexible data model to describe not only publications but all research entities and their relationships. DSpace-CRIS 7 will feature a new Angular UI and REST API in addition to functionality for compliance with OpenAire, integrating publications from external sources, bidirectional ORCID integration, and synchronizing with other systems. DSpace-CRIS also extends data modeling capabilities and provides tools for data quality, metadata management, and extensibility.
Continuous Optimization for Distributed BigData AnalysisKai Sasaki
This document discusses challenges with distributed data analysis and Treasure Data's approach to addressing them. Some key points:
- Distributed data analysis faces challenges around network bandwidth, throughput, data consistency, and reliability.
- Treasure Data uses a columnar storage format based on MessagePack to more efficiently save bandwidth and storage space.
- They implement time index pushdown to enable reading only relevant data within a time range, reducing network usage.
- Automatic optimization of partitioning layout and repartitioning aims to balance partition file size, time ranges, and keys to maximize performance and throughput while minimizing memory pressure.
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo
Enterprise-wide deployments require an architecture that scales horizontally and can work in a geographically distributed environment. The Denodo Platform can scale for a single instance used for departmental projects all the way to enterprise-wide distributed clusters. This webinar will explain how the Denodo Platform can scale to handle the most demanding requirements and will provide examples of some actual deployment configurations.
More information and FREE registrations to this webinar: http://goo.gl/ma3U5h
To learn more click to this link: http://go.denodo.com/a2a
Join the conversation at #Architect2Architect
Agenda:
Deployment Configurations
HA and Clustering
Geographically Distributed Configurations
Development Configurations
Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...Globus
Globus is a non-profit data management platform developed by the University of Chicago to increase the efficiency of data-driven research. It allows researchers to easily transfer large datasets, securely share data with collaborators across different storage systems, and automate the movement of data from instruments and compute facilities. Globus sees growing adoption with over 120 subscribers and has transferred over 768 petabytes of data.
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
This document discusses principles for building a healthy data platform, including:
1. Establishing explicit contracts between teams to define dependencies and service level agreements.
2. Abstracting the data platform into services for ingesting, storing, and processing data in motion and at rest.
3. Enabling observability of data pipelines through metadata collection and integration with tools like Marquez to provide lineage, availability, and change management visibility.
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionDenodo
This presentation introduces our new product: Denodo Platform for AWS. You will see the current data virtualization landscape, the new cloud deployment options that are being introduced with the Denodo Platform 6.0 and some examples of when it will be useful to deploy Denodo in the cloud.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/PcvHmj.
The document summarizes a workshop between NASA, software developers, science communities, and data centers to discuss HDF and HDF-EOS tools. Key topics included interactions between these groups, technical details of EOSDIS and HDF-EOS, available and needed tools, resources for developers, and next steps to continue engagement through websites and future meetings.
American Water shares how bringing IoT to fleet management can provide value to the customer. In the utilities industry, fleet management plays a major part in the business. The front line is one of the largest parts of the business whether it is the field employees working on mains, or those working on the customers' property. American Water strives to provide the best customer experience and part of that includes improving the effectiveness of our fleet.
Currently, there is no insight or active feedback on the effectiveness of the routes or driving behaviors. As a PoC, American Water leveraged NiFi to track metrics against a simulated truck, showing the initial values in capturing this type of data.
Technologies: NiFi, Druid, Hive
Improve your SQL workload with observabilityOVHcloud
La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications.
Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience.
Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?
Running Dataverse repository in the European Open Science Cloud (EOSC)vty
The document discusses Dataverse, an open source data repository software. It summarizes that Dataverse was developed by Harvard University, has a large community and development team, and is used by many countries as a data repository infrastructure. It then describes the SSHOC Dataverse project which aims to create a multilingual, standardized, and reusable open data infrastructure across several European countries. Finally, it notes that Dataverse is a reliable cloud service that enables FAIR data sharing and can be easily deployed by research organizations.
This document summarizes the work done to enhance the Geospatial Data Abstraction Library (GDAL) to better support NASA Earth Observing System (EOS) data products. It describes three phases of work: 1) a proof-of-concept ArcGIS plugin for product-specific HDF drivers, 2) generalized HDF drivers and an XML format, and 3) collaboration with GDAL developers utilizing HDF drivers and a Virtual Format (VRT) specification. The third phase highlights include enhanced generic functions, coordination with GDAL developers, testing across GIS clients, outreach to other data centers, and building tutorials. Future work areas are also outlined.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
The document describes the HDF Product Designer software tool. It was created to facilitate the design of interoperable scientific data products in HDF5 format. The tool allows intuitive editing of HDF5 objects and supports conventions like CF and ACDD. It also provides validation services to test file compliance. The goal is to help scientists design data products that follow standards and are easy for others to use.
Slides shown in Hedvig booth at VMworld 2016. Highlight scale-out, software-defined storage - both hyperscale and hyperconverged - for large VMware vSphere environments.
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumData Driven Innovation
This document provides an overview and demonstration of Streamsets Data Collector (SDC) and SDC Edge for ingesting data from IoT devices and the edge. It discusses the challenges of ingesting data from distributed edge locations. It then describes the key features of SDC for designing flexible data flows with minimal coding. It also introduces SDC Edge, a lightweight agent for running SDC pipelines on edge devices. The presentation includes demonstrations of using SDC with Kafka and using SDC Edge to ingest and analyze data from Android devices and send it to Elasticsearch. It concludes with discussing additional topics and providing useful links.
Efficient and effective: can we combine both to realize high-value, open, sca...Research Data Alliance
The document discusses the INDIGO-DataCloud project, which aims to develop an open source cloud platform for computing and data management tailored for science. It seeks to address gaps in interoperability, scalability, and data handling across public and private clouds. The project defined requirements from various scientific communities and developed components implementing its architecture to provide solutions for distributed computing and data resources.
John Readey presented on HDF5 in the cloud using HDFCloud. HDF5 can provide a cost-effective cloud infrastructure by paying for what is used rather than what may be needed. HDFCloud uses an HDF5 server to enable accessing HDF5 data through a REST API, allowing users to access large datasets without downloading entire files. It maps HDF5 objects to cloud object storage for scalable performance and uses Docker containers for elastic scaling.
The document summarizes updates on Hierarchical Data Formats (HDF) software releases and tools. It discusses the latest releases of HDF5 1.8.19 and 1.10.1, compatibility issues when moving to newer versions, updates on tools like HDF-Java and HDFView 3.0, supported compilers and systems, and a new compression library for interoperability. It invites readers to provide feedback on their needs.
Big Data Quickstart Series 3: Perform Data IntegrationAlibaba Cloud
This document summarizes Derek Meng's presentation on data integration using Alibaba Cloud's MaxCompute big data platform. It discusses the general process of data integration including data acquisition, transformation, and governance. It provides an overview of MaxCompute basics, including its architecture, basic concepts such as projects and tables, and how to use MaxCompute's data channel and SQL. The document concludes with a brief introduction to DataWorks for data integration and a demo.
The document discusses the use of Semantic MediaWiki (SMW) by the IT department of the Lower Austrian provincial government for network documentation, open government data, and other projects. SMW is used to dynamically generate and semantically query documentation about the government's network infrastructure, publish open data on the intranet and internet, and document batch job and server information. The old static documentation methods are replaced with full-text search, reusable content, and generated reports using SMW.
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...DataWorks Summit
Each of today’s most forward-thinking enterprises have been forced to face similar data challenges: the reliance on real-time data to better serve their customers and, subsequently, the requirement of complying with regulations to protect that data – one example being the General Data Protection Regulation (GDPR).
The solution to this emerging challenge is a tricky one – for companies like ING, this data governance challenge has been met with metadata, a consistent view across a large heterogeneous ecosystem and collaboration with an active open source community.
This joint presentation, John Mertic – director of program management for ODPi – and Ferd Scheepers – Global Chief Information Architect of ING – will address the benefits of a vendor-neutral approach to data governance, the need for an open metadata standard, along with insight around how companies ING, IBM, Hortonworks and more are delivering solutions to this challenge as an open source initiative.
Speakers
John Mertic, Director of Program Management for ODPi, R Consortium, and Open Mainframe Project, The Linux Foundation
Maryna Strelchuk, Information Architect, ING
This document discusses the 5 year evolution of Dataverse, an open source data repository platform. It began as a tool for collaborative data curation and sharing within research teams. Over time, features were added like dataset version control, APIs, and integration with other systems. The document outlines challenges around maintenance and sustainability. It also covers efforts to improve Dataverse's interoperability, such as integrating metadata standards and controlled vocabularies, and making datasets FAIR compliant. The goal is to establish Dataverse as a core component of the European Open Science Cloud by improving areas like software quality, integration with tools, and standardization.
DataverseEU: Building Multilingual infrastructure for the Social Sciences in...vty
This document discusses the DataverseEU project, which aims to build a multilingual infrastructure for social science data in Europe using the Dataverse platform. Key points:
- The project is led by DANS and funded by CESSDA to promote sharing of social science research data across Europe.
- Technical development includes a Docker module to deploy Dataverse in the cloud, multilingual interfaces in several European languages, and a plugin to integrate various persistent identifier services.
- The Docker module allows hosting unlimited Dataverses on different ports and building multilingual interfaces. It decomposes Dataverse into separate database, search, and application containers.
- The da|ra PID plugin will allow service providers to switch between identifier
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
This document discusses principles for building a healthy data platform, including:
1. Establishing explicit contracts between teams to define dependencies and service level agreements.
2. Abstracting the data platform into services for ingesting, storing, and processing data in motion and at rest.
3. Enabling observability of data pipelines through metadata collection and integration with tools like Marquez to provide lineage, availability, and change management visibility.
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionDenodo
This presentation introduces our new product: Denodo Platform for AWS. You will see the current data virtualization landscape, the new cloud deployment options that are being introduced with the Denodo Platform 6.0 and some examples of when it will be useful to deploy Denodo in the cloud.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/PcvHmj.
The document summarizes a workshop between NASA, software developers, science communities, and data centers to discuss HDF and HDF-EOS tools. Key topics included interactions between these groups, technical details of EOSDIS and HDF-EOS, available and needed tools, resources for developers, and next steps to continue engagement through websites and future meetings.
American Water shares how bringing IoT to fleet management can provide value to the customer. In the utilities industry, fleet management plays a major part in the business. The front line is one of the largest parts of the business whether it is the field employees working on mains, or those working on the customers' property. American Water strives to provide the best customer experience and part of that includes improving the effectiveness of our fleet.
Currently, there is no insight or active feedback on the effectiveness of the routes or driving behaviors. As a PoC, American Water leveraged NiFi to track metrics against a simulated truck, showing the initial values in capturing this type of data.
Technologies: NiFi, Druid, Hive
Improve your SQL workload with observabilityOVHcloud
La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications.
Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience.
Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?
Running Dataverse repository in the European Open Science Cloud (EOSC)vty
The document discusses Dataverse, an open source data repository software. It summarizes that Dataverse was developed by Harvard University, has a large community and development team, and is used by many countries as a data repository infrastructure. It then describes the SSHOC Dataverse project which aims to create a multilingual, standardized, and reusable open data infrastructure across several European countries. Finally, it notes that Dataverse is a reliable cloud service that enables FAIR data sharing and can be easily deployed by research organizations.
This document summarizes the work done to enhance the Geospatial Data Abstraction Library (GDAL) to better support NASA Earth Observing System (EOS) data products. It describes three phases of work: 1) a proof-of-concept ArcGIS plugin for product-specific HDF drivers, 2) generalized HDF drivers and an XML format, and 3) collaboration with GDAL developers utilizing HDF drivers and a Virtual Format (VRT) specification. The third phase highlights include enhanced generic functions, coordination with GDAL developers, testing across GIS clients, outreach to other data centers, and building tutorials. Future work areas are also outlined.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
The document describes the HDF Product Designer software tool. It was created to facilitate the design of interoperable scientific data products in HDF5 format. The tool allows intuitive editing of HDF5 objects and supports conventions like CF and ACDD. It also provides validation services to test file compliance. The goal is to help scientists design data products that follow standards and are easy for others to use.
Slides shown in Hedvig booth at VMworld 2016. Highlight scale-out, software-defined storage - both hyperscale and hyperconverged - for large VMware vSphere environments.
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumData Driven Innovation
This document provides an overview and demonstration of Streamsets Data Collector (SDC) and SDC Edge for ingesting data from IoT devices and the edge. It discusses the challenges of ingesting data from distributed edge locations. It then describes the key features of SDC for designing flexible data flows with minimal coding. It also introduces SDC Edge, a lightweight agent for running SDC pipelines on edge devices. The presentation includes demonstrations of using SDC with Kafka and using SDC Edge to ingest and analyze data from Android devices and send it to Elasticsearch. It concludes with discussing additional topics and providing useful links.
Efficient and effective: can we combine both to realize high-value, open, sca...Research Data Alliance
The document discusses the INDIGO-DataCloud project, which aims to develop an open source cloud platform for computing and data management tailored for science. It seeks to address gaps in interoperability, scalability, and data handling across public and private clouds. The project defined requirements from various scientific communities and developed components implementing its architecture to provide solutions for distributed computing and data resources.
John Readey presented on HDF5 in the cloud using HDFCloud. HDF5 can provide a cost-effective cloud infrastructure by paying for what is used rather than what may be needed. HDFCloud uses an HDF5 server to enable accessing HDF5 data through a REST API, allowing users to access large datasets without downloading entire files. It maps HDF5 objects to cloud object storage for scalable performance and uses Docker containers for elastic scaling.
The document summarizes updates on Hierarchical Data Formats (HDF) software releases and tools. It discusses the latest releases of HDF5 1.8.19 and 1.10.1, compatibility issues when moving to newer versions, updates on tools like HDF-Java and HDFView 3.0, supported compilers and systems, and a new compression library for interoperability. It invites readers to provide feedback on their needs.
Big Data Quickstart Series 3: Perform Data IntegrationAlibaba Cloud
This document summarizes Derek Meng's presentation on data integration using Alibaba Cloud's MaxCompute big data platform. It discusses the general process of data integration including data acquisition, transformation, and governance. It provides an overview of MaxCompute basics, including its architecture, basic concepts such as projects and tables, and how to use MaxCompute's data channel and SQL. The document concludes with a brief introduction to DataWorks for data integration and a demo.
The document discusses the use of Semantic MediaWiki (SMW) by the IT department of the Lower Austrian provincial government for network documentation, open government data, and other projects. SMW is used to dynamically generate and semantically query documentation about the government's network infrastructure, publish open data on the intranet and internet, and document batch job and server information. The old static documentation methods are replaced with full-text search, reusable content, and generated reports using SMW.
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...DataWorks Summit
Each of today’s most forward-thinking enterprises have been forced to face similar data challenges: the reliance on real-time data to better serve their customers and, subsequently, the requirement of complying with regulations to protect that data – one example being the General Data Protection Regulation (GDPR).
The solution to this emerging challenge is a tricky one – for companies like ING, this data governance challenge has been met with metadata, a consistent view across a large heterogeneous ecosystem and collaboration with an active open source community.
This joint presentation, John Mertic – director of program management for ODPi – and Ferd Scheepers – Global Chief Information Architect of ING – will address the benefits of a vendor-neutral approach to data governance, the need for an open metadata standard, along with insight around how companies ING, IBM, Hortonworks and more are delivering solutions to this challenge as an open source initiative.
Speakers
John Mertic, Director of Program Management for ODPi, R Consortium, and Open Mainframe Project, The Linux Foundation
Maryna Strelchuk, Information Architect, ING
This document discusses the 5 year evolution of Dataverse, an open source data repository platform. It began as a tool for collaborative data curation and sharing within research teams. Over time, features were added like dataset version control, APIs, and integration with other systems. The document outlines challenges around maintenance and sustainability. It also covers efforts to improve Dataverse's interoperability, such as integrating metadata standards and controlled vocabularies, and making datasets FAIR compliant. The goal is to establish Dataverse as a core component of the European Open Science Cloud by improving areas like software quality, integration with tools, and standardization.
DataverseEU: Building Multilingual infrastructure for the Social Sciences in...vty
This document discusses the DataverseEU project, which aims to build a multilingual infrastructure for social science data in Europe using the Dataverse platform. Key points:
- The project is led by DANS and funded by CESSDA to promote sharing of social science research data across Europe.
- Technical development includes a Docker module to deploy Dataverse in the cloud, multilingual interfaces in several European languages, and a plugin to integrate various persistent identifier services.
- The Docker module allows hosting unlimited Dataverses on different ports and building multilingual interfaces. It decomposes Dataverse into separate database, search, and application containers.
- The da|ra PID plugin will allow service providers to switch between identifier
Nanda Vijaydev, BlueData - Deploying H2O in Large Scale Distributed Environme...Sri Ambati
This session was recorded in San Francisco on February 5th, 2019 and can be viewed here: https://youtu.be/CgoxjmdyMiU
This session will discuss how to get up and running quickly with containerized H2O environments (H2O Flow, Sparkling Water, and Driverless AI) at scale, in a multi-tenant architecture with a shared pool of resources using CPUs and/or GPUs. See how how you can spin up (and tear down) your H2O environments on-demand, with just a few mouse clicks. Find out how to enable quota management of GPU resources for greater efficiency, and easily connect your compute to your datasets for large-scale distributed machine learning. Learn how to operationalize your machine learning pipelines and deliver faster time-to-value for your AI initiative — while ensuring enterprise-grade security and high performance.
Bio: Nanda Vijaydev is senior director of solutions at BlueData (now HPE) - where she leverages technologies like Hadoop, Spark, and TensorFlow to build solutions for enterprise analytics and machine learning use cases. Nanda has 10 years of experience in data management and data science. Previously, she worked on data science and big data projects in multiple industries, including healthcare and media; was a principal solutions architect at Silicon Valley Data Science; and served as director of solutions engineering at Karmasphere. Nanda has an in-depth understanding of the data analytics and data management space, particularly in the areas of data integration, ETL, warehousing, reporting, and machine learning.
Tutorial Workgroup - Model versioning and collaborationPascalDesmarets1
Hackolade Studio has native integration with Git repositories to provide state-of-the-art collaboration, versioning, branching, conflict resolution, peer review workflows, change tracking and traceability. Mostly, it allows to co-locate data models and schemas with application code, and further integrate with DevOps CI/CD pipelines as part of our vision for Metadata-as-Code.
Co-located application code and data models provide the single source-of-truth for business and technical stakeholders.
The document discusses modernizing a data warehouse using the Microsoft Analytics Platform System (APS). APS is described as a turnkey appliance that allows organizations to integrate relational and non-relational data in a single system for enterprise-ready querying and business intelligence. It provides a scalable solution for growing data volumes and types that removes limitations of traditional data warehousing approaches.
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
This is a slide deck from QuerySurge's Big Data Testing webinar.
Learn why Testing is pivotal to the success of your Big Data Strategy .
Learn more at www.querysurge.com
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data, Hadoop and NoSQL. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
This information is geared towards:
- Big Data & Data Warehouse Architects,
- ETL Developers
- ETL Testers, Big Data Testers
- Data Analysts
- Operations teams
- Business Intelligence (BI) Architects
- Data Management Officers & Directors
You will learn how to:
- Improve your Data Quality
- Accelerate your data testing cycles
- Reduce your costs & risks
- Provide a huge ROI (as high as 1,300%)
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
This document discusses virtualizing Hadoop for the enterprise. It begins with discussing trends driving changes in enterprise IT like cloud, mobile apps, and big data. It then discusses how Hadoop can address big, fast, and flexible data needs. The rest of the document discusses how virtualizing Hadoop through solutions like Project Serengeti can provide enterprises with elasticity, high availability, and operational simplicity for their Hadoop implementations. It also discusses how virtualization allows enterprises to integrate Hadoop with other workloads and data platforms.
How AD has been re-engineered to extend to the cloudLDAPCon
1. Windows Server Active Directory (AD) has evolved over three main identity models as organizations' needs have changed with technology. Azure Active Directory (AAD) represents the third generation identity ecosystem model.
2. AAD is a cloud-based identity and access management service that is not the same as on-premises AD. It provides identity management as a service and can synchronize with on-premises directories.
3. Key capabilities of AAD include providing a single identity for multiple applications, managing access to cloud apps, monitoring access to enterprise apps, and providing personalized access to applications for users.
Dataverse can be deployed using Docker containers to improve maintainability and portability. The document discusses how Docker can isolate applications and their dependencies into portable containers. It provides an example of deploying Dataverse as a set of microservices within Docker containers. Instructions are included on building Docker images, running containers, and managing the containers and images through commands and tools like Docker Desktop, Docker Hub, and Docker Compose.
FIWARE provides an open standard for managing context and digital twin data to enable the development of smart solutions across multiple sectors. The FIWARE context broker uses NGSI APIs to integrate data from different sources and build a digital twin representation of the real world. Smart data models define common data models for different domains to increase interoperability and reduce development costs when building smart applications. The smart data models initiative is led by several organizations and aims to create a community for defining and maintaining open data models using an agile process.
The Fastest Way to Redis on Pivotal Cloud FoundryVMware Tanzu
What do developers choose when they need a fast performing datastore with a flexible data model? Hands-down, they choose Redis.
But, waiting for a Redis instance to be set up is not a favorite activity for many developers. This is why on-demand services for Redis have become popular. Developers can start building their applications with Redis right away. There is no fiddling around with installing, configuring, and operating the service.
Redis for Pivotal Cloud Foundry offers dedicated and pre-provisioned service plans for Cloud Foundry developers that work in any cloud. These plans are tailored for typical patterns such as application caching and providing an in-memory datastore. These cover the most common requirements for developers creating net new applications or who are replatforming existing Redis applications.
We'd like to invite you to a webinar discussing different ways to use Redis in cloud-native applications. We'll cover:
- Use cases and requirements for developers
- Alternative ways to access and manage Redis in the cloud
- Features and roadmap of Redis for Pivotal Cloud Foundry
- Quick demo
Presenters: Greg Chase, Director of Products, Pivotal and Craig Olrich, Platform Architect, Pivotal
SOLID Programming with Portable Class LibrariesVagif Abilov
Developers often don't pay attention to code portability until they need to target multiple platforms. However, large amount of non-portable code often hints about violation of clean code principles, so it is worth investigating which part of the source code base are platform-specific and for what reasons.
In this session we will give an overview of portable class libraries, show how to extract PCL components from a real-world application and go through typical challenges that are faced when writing portable code. We will present the original tool that analyzes assemblies for portability compliance and can be used as a guard to prevent mixing business logic with infrastructure-specific functionality. Finally we will demonstrate how PCLs help targeting platforms such as Windows Store, Android and iOS.
EOSC-hub brings together multiple service providers to create the Hub: a single contact point for European researchers and innovators to discover, access, use and reuse a broad spectrum of resources for advanced data-driven research.
This presentation introduces the services on offer to scientists of all disciplines
This document discusses different options for deploying a Hadoop cluster, including using an appliance like Oracle's Big Data Appliance, deploying on cloud infrastructure through Amazon EMR, or building your own "do-it-yourself" cluster. It provides details on the hardware, software, and costs associated with each option. The conclusion compares the pros and cons of each approach, noting that appliances provide high performance and integration but may be less flexible, while cloud deployments offer scalability and pay-per-use but require consideration of data privacy. Building your own cluster gives more control but requires more work to set up and manage.
A Successful Journey to the Cloud with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3mPLIlo
A shift to the cloud is a common element of any current data strategy. However, a successful transition to the cloud is not easy and can take years. It comes with security challenges, changes in downstream and upstream applications, and new ways to operate and deploy software. An abstraction layer that decouples data access from storage and processing can be a key element to enable a smooth journey to the cloud.
Attend this webinar to learn more about:
- How to use Data Virtualization to gradually change data systems without impacting business operations
- How Denodo integrates with the larger cloud ecosystems to enable security
- How simple it is to create and manage a Denodo cloud deployment
Webinar: DataStax Enterprise 5.0 What’s New and How It’ll Make Your Life EasierDataStax
Want help building applications with real-time value at epic scale? How about solving your database performance and availability issues? Then, you want to hear more about DataStax Enterprise 5.0. Join this webinar to learn what’s new in DSE 5.0 ‒ the largest software release to date at DataStax. DSE 5.0 introduces multi-model support including Graph and JSON data models along with a ton of new and enhanced enterprise database capabilities.
View webinar recording here: https://youtu.be/3pfm4ntASJ0
This document discusses Data as a Service (DaaS) in cloud computing. It defines DaaS and explains that it allows users to access data stored in the cloud from any location. The document outlines the components, architecture, pricing models, benefits and drawbacks of DaaS. It provides examples of companies that offer DaaS like Google, Windows Azure, and Amazon.
Similar to Persistent identifiers in DataverseEU project (20)
Decentralised identifiers and knowledge graphs vty
Building an Operating System for Open Science: data integration challenges, Dataverse data repository and knowledge graphs. Lecture by Slava Tykhonov, DANS-KNAW, for the Journées Scientifiques de Rochebrune 2023 (JSR'23).
Decentralised identifiers for CLARIAH infrastructure vty
Slides of the presentation for CLARIAH community on the ideas how to make controlled vocabularies sustainable and FAIR (Findable, Accessible, Interoperable, Reusable) with the help of Decentralized Identifiers (DIDs).
Dataverse repository for research data in the COVID-19 Museumvty
The Covid-19 Museum has an ambition to create a platform to deposit, consult, aggregate and study heterogeneous data about the pandemics using features of a distributed web service. To achieve this purpose, Dataverse has been selected as a reliable FAIR data repository with built-in search engine and functionality that allows adding computing resources to explore archived resources both on data and metadata. Presentation by
Slava Tykhonov, DANS-KNAW (The Royal Netherlands Academy of Arts and Sciences). Université Paris Cité, 19 April 2022.
Building collaborative Machine Learning platform for Dataverse network. Lecture by Slava Tykhonov (DANS-KNAW, the Netherlands), DANS seminar series, 29.03.2022
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...vty
Presentation at ISKO Knowledge Organisation Research Observatory. RESEARCH REPOSITORIES AND DATAVERSE: NEGOTIATING METADATA, VOCABULARIES AND DOMAIN NEEDS
The presentation for the W3C Semantic Web in Health Care and Life Sciences community group by Slava Tykhonov, DANS-KNAW, the Royal Netherlands Academy of Arts and Sciences (October 2020). The recording is available https://www.youtube.com/watch?v=G9oiyNM_RHc
CLARIN CMDI use case and flexible metadata schemes vty
Presentation for CLARIAH IG Linked Open Data on the latest developments for Dataverse FAIR data repository. Building SEMAF workflow with external controlled vocabularies support and Semantic API. Using the theory of inventive problem solving TRIZ for the further innovation in Linked Data.
Flexible metadata schemes for research data repositories - CLARIN Conference'21vty
The development of the Common Framework in Dataverse and the CMDI use case. Building AI/ML based workflow for the prediction and linking concepts from external controlled vocabularies to the CMDI metadata values.
Controlled vocabularies and ontologies in Dataverse data repositoryvty
This document discusses supporting external controlled vocabularies in Dataverse. It proposes implementing a JavaScript interface to allow linking metadata fields to terms from external vocabularies accessed via SKOSMOS APIs. Several challenges are identified, such as applying support to any field, backward compatibility, and ensuring vocabularies come from authoritative sources. Caching concepts and linking dataset files directly to terms are also proposed to improve interoperability.
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...vty
This document summarizes an presentation about automating CI/CD testing, installation, and deployment of Dataverse in the European Open Science Cloud. It discusses using Docker and Kubernetes for deployment, a community-driven QA plan using pyDataverse for test automation, and providing quality assurance as a service. The presentation also covers topics like the CESSDA maturity model, integrating Dataverse on Google Cloud, and using serverless computing for some Dataverse applications and services.
Building COVID-19 Museum as Open Science Projectvty
This document discusses building a COVID-19 Museum as an open science project. It describes the speaker's background working on various data management projects. It discusses moving towards open science and sharing data according to FAIR principles. It outlines the Time Machine project for digitizing historical documents and its approach to data management. The rest of the document discusses using the Dataverse platform to build repositories, linking metadata to ontologies, using tools like Weblate for translations, and exploring the use of artificial intelligence and machine learning to enhance metadata and facilitate human-in-the-loop review processes.
External controlled vocabularies support in Dataversevty
This presentation discusses adding support for external controlled vocabularies to the Dataverse data repository platform. It describes how ontologies like SKOS can be used to represent vocabularies and allow linking metadata fields in Dataverse to terms. The presentation proposes developing a Semantic Gateway plugin for Dataverse that would allow browsing and linking to external vocabularies hosted in the SKOSMOS framework via its API. This could improve metadata by allowing standardized, linked terms and help make data more FAIR.
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataversevty
This presentation is about external CVs support in Dataverse, Open Source data repository. Data Archiving and Networked Services (DANS-KNAW) decided to use Dataverse as a basic technology to build Data Stations and provide FAIR data services for various Dutch research communities.
Ontologies, controlled vocabularies and Dataversevty
Presentation on Semantic Web technologies for Dataverse Metadata Working Group running by Institute for Quantitative Social Science (IQSS) of Harvard University.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
1. dans.knaw.nl
DANS is een instituut van KNAW en NWO
PIDs in CESSDA DataverseEU
Vyacheslav Tykhonov
Senior Information Scientist (DANS),
DataverseEU lead developer
CESSDA PID workshop,
20.03.2018
2. DataverseEU development model
• We’re not going to create new fork of Dataverse, our
contributions should go to the master branch hosted by
IQSS at Harvard
• Delivered as Docker images and deployed in Google Cloud
as CESSDA Dataverse repository
• Any Service Provider can host separate Dataverse instance
in its own cloud if it’s required
• Metadata from other CESSDA repositories will be harvested
by central DataverseEU repository
• Easy to add new languages if more partners will join during
or after the project
3. DataverseEU tasks overview
• Development of multilingual web interface (German,
Slovenian, Swedish, Hungarian, Italian, French, Spanish)
• Support of localized metadata model corresponding to
CESSDA CMM
• Design and development of PID plugin mechanism to allow
service providers to choose PID service of their preference
• Development of APIs for CESSDA CVs and Topic
Classification based on CESSDA CV Manager services
6. PID structure in Dataverse
Every PID contains:
• Prefix: unique authority (ID of institution or
organization)
• Separator
• Sequence of characters or numbers for dataset ID
identification
Examples:
<PID> ::= <Naming Authority> "/" <Handle Local Name>
doi:10.4232/1.0001 (DOI)
hdl:10411/KL0X8C (handle)
7. Dataverse PID Plugin requirements
• We need flexible way to switch between PID service
providers (da|ra, DataCite, handle)
• Registering DOIs with da|ra will give data providers a
greater visibility and recognition as data references will be
integrated in da|ra search index
• Different data archives can get separate prefixes within the
same Dataverse instance and increase their visibility and
recognition
• PID Plugin can be used in combination with external storage
configuration (based on Swift) to host data locally in national
infrastructures
8. Current implementation of PID service
• out-of-the-box support of DOIs (DataCite) and handles
(handle.net)
• one Dataverse instance can be only bundled or to DOI, or to
handle
• there is no possibility to use own prefixes for different
organisations (DataverseNL is hdl:10411 for all partners)
• switching between DOIs and handles can be done by
executing API requests:
curl -X PUT -d hdl "http://localhost:8080/api/admin/settings/:Protocol"
curl -X PUT -d 10411 “http://localhost:8080/api/admin/settings/:Authority”
curl -X PUT -d doi "http://localhost:8080/api/admin/settings/:Protocol"
curl -X PUT -d 10.5072/FK2 "http://localhost:8080/api/admin/settings/:Authority"
9. The Dataset Lifecycle
Destroyed Updated URL
create
update
destroy
Published versions can
be de-accessed at any
time.
Unpublished versions
(drafts) can also be
deleted.
publish update
Credits: Felix Bensmann (GESIS). Supporting New PID Providers in Dataverse
10. PID assigning strategies
• Different PID for every new version of a dataset (da|ra)
• The same PID for the dataset, shared by all versions (DOI,
handle)
The central idea of PID plugin: every service provider can
choose the strategy for assigning PIDs that will fit the best to
their needs.
Warning: different communities need various strategies!
11. PID strategies based on community needs
• Sharing data via the Archive
Dataset files aren’t changing, PIDs are different
• Research Data deposit
(Ph.D. students): obligation to make data of thesis or
study publicly available, work in progress, PID is the same
12. The same PID for all versions of a dataset:
Example on difference between versions
In the same time there is no support for granularity, for example:
https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/KL0X8C&version=2.0
is not https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/KL0X8C.2
13. PID Plugin features
• Developed by GESIS and designed as extra module that can
be added to running Dataverse application
• Functionality provided by the PID Plugin triggered by
events:
• Creation (onCreate)
• Update (onUpdate)
• Publication (onPublish)
• Deaccession
• Destruction
• Lookup (lookupDoi, getProviderName)
• All settings are controlled via Dataverse API (suffix)
14. PID Plugin registration with da|ra
• Own XML schema
• Mandatory and optional fields
• Every dataset update will create new PID
• Metadata schema of service providers
should be synchronized with da|ra schema
<xml>
<metadata>
<ID>ABCDE</ID>
<version>v1.0</version>
<doi>
auth.ority/DV/ABCDE
</doi>
<title>title</title>
<url>https://…</url>
</metadata>
</xml>