Data Linkage is an important step that can provide valuable insights for evidence-based decision making, especially for crucial events. Performing sensible queries across heterogeneous databases containing millions of records is a complex task that requires a complete understanding of each contributing database’s schema to define the structure of its information. The key aim is to approximate the structure
and content of the induced data into a concise synopsis in order to extract and link meaningful data-driven facts. We identify such problems as four major research issues in Data Linkage: associated costs in pairwise matching, record matching overheads, semantic flow of information restrictions, and single order classification limitations. In this paper, we give a literature review of research in Data Linkage. The
purpose for this review is to establish a basic understanding of Data Linkage, and to discuss the
background in the Data Linkage research domain. Particularly, we focus on the literature related to the recent advancements in Approximate Matching algorithms at Attribute Level and Structure Level. Their efficiency, functionality and limitations are critically analysed and open-ended problems have been
exposed.
The Road to Open Data Enlightenment Is Paved With Nice ExcusesToon Vanagt
The road to open data enlightenment is paved with nice excuses! These slides include 11 open data revenue models for government agencies who 'pragmatically' need to keep generating revenues being 'authentic sources'. This presentation was delivered by Toon Vanagt from https://data.be as the opening keynote of the 'opening-up' conference in Brussels on 3/12/2014.
No sql databases new millennium database for big data, big users, cloud compu...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Presentation delivered by Ludo Hendrickx and Joris Beek on 11 December 2013 Dutch at the Ministry of Interior, The Hague, The Netherlands. More information on: https://joinup.ec.europa.eu/community/ods/description
Data Linkage is an important step that can provide valuable insights for evidence-based decision making, especially for crucial events. Performing sensible queries across heterogeneous databases containing millions of records is a complex task that requires a complete understanding of each contributing database’s schema to define the structure of its information. The key aim is to approximate the structure
and content of the induced data into a concise synopsis in order to extract and link meaningful data-driven facts. We identify such problems as four major research issues in Data Linkage: associated costs in pairwise matching, record matching overheads, semantic flow of information restrictions, and single order classification limitations. In this paper, we give a literature review of research in Data Linkage. The
purpose for this review is to establish a basic understanding of Data Linkage, and to discuss the
background in the Data Linkage research domain. Particularly, we focus on the literature related to the recent advancements in Approximate Matching algorithms at Attribute Level and Structure Level. Their efficiency, functionality and limitations are critically analysed and open-ended problems have been
exposed.
The Road to Open Data Enlightenment Is Paved With Nice ExcusesToon Vanagt
The road to open data enlightenment is paved with nice excuses! These slides include 11 open data revenue models for government agencies who 'pragmatically' need to keep generating revenues being 'authentic sources'. This presentation was delivered by Toon Vanagt from https://data.be as the opening keynote of the 'opening-up' conference in Brussels on 3/12/2014.
No sql databases new millennium database for big data, big users, cloud compu...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Presentation delivered by Ludo Hendrickx and Joris Beek on 11 December 2013 Dutch at the Ministry of Interior, The Hague, The Netherlands. More information on: https://joinup.ec.europa.eu/community/ods/description
Role of Data Cleaning in Data WarehouseRamakant Soni
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.
Annotation Approach for Document with Recommendation ijmpict
An enormous number of organizations generate and share textual descriptions of their products, facilities, and activities. Such collections of textual data comprise a significant amount of controlled information, which residues buried in the unstructured text. Whereas information extraction systems simplify the extraction of structured associations, they are frequently expensive and incorrect, particularly when working on top of text that does not comprise any examples of the targeted structured data. Projected an alternative methodology that simplifies the structured metadata generation by recognizing documents that are possible to contain information of awareness and this data will be beneficial for querying the database. Moreover, we intend algorithms to extract attribute-value pairs, and similarly devise new mechanisms to map such pairs to manually created schemes. We apply clustering technique to the item content information to complement the user rating information, which improves the correctness of collaborative similarity, and solves the cold start problem.
Improving Service Recommendation Method on Map reduce by User Preferences and...paperpublications3
Abstract: Service recommender systems have been shown as valuable tools for providing appropriate recommendations to users. In the last decade, the amount of customers, services and online information has grown rapidly, yielding the big data analysis problem for service recommender systems. Consequently, traditional service recommender systems often suffer from scalability and inefficiency problems most of existing service recommender systems present the same ratings and rankings of services to different users without considering diverse users' preferences, and therefore fails to meet users' personalized requirements. In this paper, to address the above challenges and presenting a personalized service recommendation list and recommending the most appropriate services to the users effectively. Specifically, keywords are used to indicate users' preferences, and a user-based Collaborative filtering algorithm is adopted to generate appropriate recommendations.Keywords: recommender system, user preference, keyword, Big Data, mapreduce, Hadoop.
Title: Improving Service Recommendation Method on Map reduce by User Preferences and Reviews
Author: Dayanand Bhovi, Mr. Ashwin Kumar
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Ahmad Assaf
Publicly available datasets contain knowledge from various domains such as encyclopedic, government, geographic, entertainment and so on. The increasing diversity of these datasets makes it difficult to annotate them with a fixed number of pre-defined tags. Moreover, manually entered tags are subjective and may not capture their essence and breadth. We propose a mechanism to automatically attach meta information to data objects by leveraging knowledge bases like DBpedia and Freebase which facilitates data search and acquisition for business users.
Linked Open Data (LOD) has emerged as one of the largest collections of interlinked datasets on the web. In order to benefit from this mine of data, one needs to access to descriptive information about each dataset (or metadata). This metadata enables dataset discovery, understanding, integration and maintenance. Data portals, which are datasets' access points, offer metadata represented in different and heterogeneous models. We first propose a harmonized dataset model based on a systematic literature survey that enables complete metadata coverage to enable data discovery, exploration and reuse by business users. Second, rich metadata information is currently very limited to a few data portals where they are usually provided manually, thus being often incomplete and inconsistent in terms of quality. We propose a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles. This approach applies several techniques in order to check the validity of the metadata provided and to generate descriptive and statistical information for a particular dataset or for an entire data portal.
Traditional data quality is a thoroughly researched field with several benchmarks and frameworks to grasp its dimensions. Ensuring data quality in Linked Open Data is much more complex. It consists of structured information supported by models, ontologies and vocabularies and contains queryable endpoints and links. We propose an objective assessment framework for Linked Data quality based on quality metrics that can be automatically measured. We further present an extensible quality measurement tool implementing this framework that helps on one hand data owners to rate the quality of their datasets and get some hints on possible improvements, and on the other hand data consumers to choose their data sources from a ranked set.
Big data is a term which refers to those data sets or combinations of data sets whose volume,
complexity, and rate of growth make them difficult to be captured, managed, processed or analysed by
traditional tools and technologies. Big data is relatively a new concept that includes huge quantities of data,
social media analytics and real time data. Over the past era, there have been a lot of efforts and studies are
carried out in growing proficient tools for performing various tasks in big data. Due to the large and
complex collection of datasets it is difficult to process on traditional data processing applications. This
concern turns to be further mandatory for producing various tools in big data. In this survey, a various
collection of big data tools are illustrated and also analysed with the salient features.
Understanding, Planning and Achieving
Data Quality in Your Organization
by Joe Caserta, President of Caserta Concepts
For more information, visit www.casertaconcepts.com or contact us at info@casertaconcepts.com
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
Processing of the data generated from transactions that occur every day which resulted in nearly thousands of data per day requires software capable of enabling users to conduct a search of the necessary data. Data mining becomes a solution for the problem. To that end, many large industries began creating software that can perform data processing. Due to the high cost to obtain data mining software that comes from the big industry, then eventually some communities such as universities eventually provide convenience for users who want just to learn or to deepen the data mining to create software based on open source. Meanwhile, many commercial vendors market their products respectively. WEKA and Salford System are both of data mining software. They have the advantages and the disadvantages. This study is to compare them by using several attributes. The users can select which software is more suitable for their daily activities.
Painting the Future of Big Data with Apache Spark and MongoDBMongoDB
MongoDB is the fastest growing non-relational database, while Apache Spark is the fastest growing data processing engine, and the most active big data project in the history of Apache. Databricks, founded by the creators of Spark, will present how they see Spark evolving to address new use cases, and how to combine the power of MongoDB with Spark.
A Generic Model for Student Data Analytic Web Service (SDAWS)Editor IJCATR
Any university management system accumulates a cartload of data and analytics can be applied on it to gather useful
information to aid the academic decision making process. This paper is a novel attempt to demonstrate the significance of a data
analytic web service in the education domain. This can be integrated with the University Management System or any other application
of the university easily. Analytics as a web service offers much benefits over the traditional analysis methods. The web service can be
hosted on a web server and accessed over the internet or on to the private cloud of the campus. The data from various courses from
different departments can be uploaded and analyzed easily. In this paper we design a web service framework to be used in educational
data mining that provide analysis as a service.
Data Anonymization for Privacy Preservation in Big Datarahulmonikasharma
Cloud computing provides capable ascendable IT edifice to provision numerous processing of a various big data applications in sectors such as healthcare and business. Mainly electronic health records data sets and in such applications generally contain privacy-sensitive data. The most popular technique for data privacy preservation is anonymizing the data through generalization. Proposal is to examine the issue against proximity privacy breaches for big data anonymization and try to recognize a scalable solution to this issue. Scalable clustering approach with two phase consisting of clustering algorithm and K-Anonymity scheme with Generalisation and suppression is intended to work on this problem. Design of the algorithms is done with MapReduce to increase high scalability by carrying out dataparallel execution in cloud. Wide-ranging researches on actual data sets substantiate that the method deliberately advances the competence of defensive proximity privacy breaks, the scalability and the efficiency of anonymization over existing methods. Anonymizing data sets through generalization to gratify some of the privacy attributes like k- Anonymity is a popularly-used type of privacy preserving methods. Currently, the gauge of data in numerous cloud surges extremely in agreement with the Big Data, making it a dare for frequently used tools to actually get, manage, and process large-scale data for a particular accepted time scale. Hence, it is a trial for prevailing anonymization approaches to attain privacy conservation for big data private information due to scalabilty issues.
CLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAINijmpict
Cloud computing is a key element in many nations’ pursuit of fast-tracked digital transformation and the
quick implementation of digital tools but is still facing considerable barriers due to the distinct challenges
that information technology adoption faces in public sector environments. Using scientometric data from
the Web of Science database, this study explores the current state of research and the structure of the
public sector cloud computing knowledge domain in a novel way, utilizing the CiteSpace visual analytic
software to produce knowledge maps that visualize public sector cloud computing research in terms of
publication activity, constituent authors, and publication venues, as well as exploring the intellectual base
of the knowledge domain. For public sector cloud computing researchers and practitioners, the study
provides visual insights and analyses that support future research, collaboration, and evidence-based
cloud computing implementation and utilization.
The NIH Data Commons - BD2K All Hands Meeting 2015Vivien Bonazzi
Presentation given at the BD2K All Hands meeting in Bethesda, MD, USA in November 2015
https://datascience.nih.gov/bd2k/events/NOV2015-AllHands
Video cast of this presentation:
http://videocast.nih.gov/summary.asp?Live=17480&bhcp=1
talk starts at 2hrs 40min (its about 55mins long) - includes video!
Document describing the Commons : https://datascience.nih.gov/commons
Role of Data Cleaning in Data WarehouseRamakant Soni
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.
Annotation Approach for Document with Recommendation ijmpict
An enormous number of organizations generate and share textual descriptions of their products, facilities, and activities. Such collections of textual data comprise a significant amount of controlled information, which residues buried in the unstructured text. Whereas information extraction systems simplify the extraction of structured associations, they are frequently expensive and incorrect, particularly when working on top of text that does not comprise any examples of the targeted structured data. Projected an alternative methodology that simplifies the structured metadata generation by recognizing documents that are possible to contain information of awareness and this data will be beneficial for querying the database. Moreover, we intend algorithms to extract attribute-value pairs, and similarly devise new mechanisms to map such pairs to manually created schemes. We apply clustering technique to the item content information to complement the user rating information, which improves the correctness of collaborative similarity, and solves the cold start problem.
Improving Service Recommendation Method on Map reduce by User Preferences and...paperpublications3
Abstract: Service recommender systems have been shown as valuable tools for providing appropriate recommendations to users. In the last decade, the amount of customers, services and online information has grown rapidly, yielding the big data analysis problem for service recommender systems. Consequently, traditional service recommender systems often suffer from scalability and inefficiency problems most of existing service recommender systems present the same ratings and rankings of services to different users without considering diverse users' preferences, and therefore fails to meet users' personalized requirements. In this paper, to address the above challenges and presenting a personalized service recommendation list and recommending the most appropriate services to the users effectively. Specifically, keywords are used to indicate users' preferences, and a user-based Collaborative filtering algorithm is adopted to generate appropriate recommendations.Keywords: recommender system, user preference, keyword, Big Data, mapreduce, Hadoop.
Title: Improving Service Recommendation Method on Map reduce by User Preferences and Reviews
Author: Dayanand Bhovi, Mr. Ashwin Kumar
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Ahmad Assaf
Publicly available datasets contain knowledge from various domains such as encyclopedic, government, geographic, entertainment and so on. The increasing diversity of these datasets makes it difficult to annotate them with a fixed number of pre-defined tags. Moreover, manually entered tags are subjective and may not capture their essence and breadth. We propose a mechanism to automatically attach meta information to data objects by leveraging knowledge bases like DBpedia and Freebase which facilitates data search and acquisition for business users.
Linked Open Data (LOD) has emerged as one of the largest collections of interlinked datasets on the web. In order to benefit from this mine of data, one needs to access to descriptive information about each dataset (or metadata). This metadata enables dataset discovery, understanding, integration and maintenance. Data portals, which are datasets' access points, offer metadata represented in different and heterogeneous models. We first propose a harmonized dataset model based on a systematic literature survey that enables complete metadata coverage to enable data discovery, exploration and reuse by business users. Second, rich metadata information is currently very limited to a few data portals where they are usually provided manually, thus being often incomplete and inconsistent in terms of quality. We propose a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles. This approach applies several techniques in order to check the validity of the metadata provided and to generate descriptive and statistical information for a particular dataset or for an entire data portal.
Traditional data quality is a thoroughly researched field with several benchmarks and frameworks to grasp its dimensions. Ensuring data quality in Linked Open Data is much more complex. It consists of structured information supported by models, ontologies and vocabularies and contains queryable endpoints and links. We propose an objective assessment framework for Linked Data quality based on quality metrics that can be automatically measured. We further present an extensible quality measurement tool implementing this framework that helps on one hand data owners to rate the quality of their datasets and get some hints on possible improvements, and on the other hand data consumers to choose their data sources from a ranked set.
Big data is a term which refers to those data sets or combinations of data sets whose volume,
complexity, and rate of growth make them difficult to be captured, managed, processed or analysed by
traditional tools and technologies. Big data is relatively a new concept that includes huge quantities of data,
social media analytics and real time data. Over the past era, there have been a lot of efforts and studies are
carried out in growing proficient tools for performing various tasks in big data. Due to the large and
complex collection of datasets it is difficult to process on traditional data processing applications. This
concern turns to be further mandatory for producing various tools in big data. In this survey, a various
collection of big data tools are illustrated and also analysed with the salient features.
Understanding, Planning and Achieving
Data Quality in Your Organization
by Joe Caserta, President of Caserta Concepts
For more information, visit www.casertaconcepts.com or contact us at info@casertaconcepts.com
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
Processing of the data generated from transactions that occur every day which resulted in nearly thousands of data per day requires software capable of enabling users to conduct a search of the necessary data. Data mining becomes a solution for the problem. To that end, many large industries began creating software that can perform data processing. Due to the high cost to obtain data mining software that comes from the big industry, then eventually some communities such as universities eventually provide convenience for users who want just to learn or to deepen the data mining to create software based on open source. Meanwhile, many commercial vendors market their products respectively. WEKA and Salford System are both of data mining software. They have the advantages and the disadvantages. This study is to compare them by using several attributes. The users can select which software is more suitable for their daily activities.
Painting the Future of Big Data with Apache Spark and MongoDBMongoDB
MongoDB is the fastest growing non-relational database, while Apache Spark is the fastest growing data processing engine, and the most active big data project in the history of Apache. Databricks, founded by the creators of Spark, will present how they see Spark evolving to address new use cases, and how to combine the power of MongoDB with Spark.
A Generic Model for Student Data Analytic Web Service (SDAWS)Editor IJCATR
Any university management system accumulates a cartload of data and analytics can be applied on it to gather useful
information to aid the academic decision making process. This paper is a novel attempt to demonstrate the significance of a data
analytic web service in the education domain. This can be integrated with the University Management System or any other application
of the university easily. Analytics as a web service offers much benefits over the traditional analysis methods. The web service can be
hosted on a web server and accessed over the internet or on to the private cloud of the campus. The data from various courses from
different departments can be uploaded and analyzed easily. In this paper we design a web service framework to be used in educational
data mining that provide analysis as a service.
Data Anonymization for Privacy Preservation in Big Datarahulmonikasharma
Cloud computing provides capable ascendable IT edifice to provision numerous processing of a various big data applications in sectors such as healthcare and business. Mainly electronic health records data sets and in such applications generally contain privacy-sensitive data. The most popular technique for data privacy preservation is anonymizing the data through generalization. Proposal is to examine the issue against proximity privacy breaches for big data anonymization and try to recognize a scalable solution to this issue. Scalable clustering approach with two phase consisting of clustering algorithm and K-Anonymity scheme with Generalisation and suppression is intended to work on this problem. Design of the algorithms is done with MapReduce to increase high scalability by carrying out dataparallel execution in cloud. Wide-ranging researches on actual data sets substantiate that the method deliberately advances the competence of defensive proximity privacy breaks, the scalability and the efficiency of anonymization over existing methods. Anonymizing data sets through generalization to gratify some of the privacy attributes like k- Anonymity is a popularly-used type of privacy preserving methods. Currently, the gauge of data in numerous cloud surges extremely in agreement with the Big Data, making it a dare for frequently used tools to actually get, manage, and process large-scale data for a particular accepted time scale. Hence, it is a trial for prevailing anonymization approaches to attain privacy conservation for big data private information due to scalabilty issues.
CLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAINijmpict
Cloud computing is a key element in many nations’ pursuit of fast-tracked digital transformation and the
quick implementation of digital tools but is still facing considerable barriers due to the distinct challenges
that information technology adoption faces in public sector environments. Using scientometric data from
the Web of Science database, this study explores the current state of research and the structure of the
public sector cloud computing knowledge domain in a novel way, utilizing the CiteSpace visual analytic
software to produce knowledge maps that visualize public sector cloud computing research in terms of
publication activity, constituent authors, and publication venues, as well as exploring the intellectual base
of the knowledge domain. For public sector cloud computing researchers and practitioners, the study
provides visual insights and analyses that support future research, collaboration, and evidence-based
cloud computing implementation and utilization.
The NIH Data Commons - BD2K All Hands Meeting 2015Vivien Bonazzi
Presentation given at the BD2K All Hands meeting in Bethesda, MD, USA in November 2015
https://datascience.nih.gov/bd2k/events/NOV2015-AllHands
Video cast of this presentation:
http://videocast.nih.gov/summary.asp?Live=17480&bhcp=1
talk starts at 2hrs 40min (its about 55mins long) - includes video!
Document describing the Commons : https://datascience.nih.gov/commons
It is a attempt to provide unified view of open data, In this system, data is collected from different sources in different formats. Data producer will define semantic relationship among datasets which is to input to our DC system.
Data consumer can pick a set of dataset randomly(or depends on his/her interest) and ask system to get HTTP API for it. System will identify which datasets are linked with each other (connected components) and generate HTTP API for each component which will produce unified output in JSON/XML format
It helps in maintaining loose coupling between underline storage structure and consumer client which is built on open data.
AdMap: a framework for advertising using MapReduce pipelineCSITiaesprime
There is a vast collection of data for consumers due to tremendous development in digital marketing. For their ads or for consumers to validate nearby services which already are upgraded to the dataset systems, consumers are more concerned with the amount of data. Hence there is a void formed between the producer and the client. To fill that void, there is the need for a framework which can facilitate all the needs for query updating of the data. The present systems have some shortcomings by a vast number of information that each time lead to decision tree-based approach. A systematic solution to the automated incorporation of data into a Hadoop distributed file system (HDFS) warehouse (Hadoop file system) includes a data hub server, a generic data charging mechanism and a metadata model. In our model framework, the database would be able to govern the data processing schema. In the future, as a variety of data is archived, the datalake will play a critical role in managing that data. To order to carry out a planned loading function, the setup files immense catalogue move the datahub server together to attach the miscellaneous details dynamically to its schemas.
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Citadelh2020
CITADEL is a H2020 European project that is creating an ecosystem of best practices, tools, and recommendations to transform Public Administrations (PAs) via an inclusive approach in order to provide stakeholders with more efficient, inclusive and citizen-centric services. The CITADEL ecosystem will allow PAs to use what they already know plus new data to implement what really matters to citizens in order to shape and co-create more efficient and inclusive public services. CITADEL innovates by using ICTs to find out why citizens stop using public services, and use this information to re-adjust provision to bring them back in. Also, it identifies why citizens are not using a given public service (due to affordability, accessibility, lack of knowledge, embarrassment, lack of interest, etc.) and, where appropriate, use this information to make public services more attractive, so they start using the services.
The DataTank, a tool designed and developed by IMEC’s IDLab, will be extended to provide the Data Harvesting/Curation/Fusion (DHCF) component of the platform. The DataTank provides an open source, open data platform which not only allows publishing datasets according to standardised guidelines and taxonomies (DCAT-AP), but also transforms the data into a variety of reusable formats. The extension will include an intelligent way of harvesting and fusion of different data sources using semantics and Linked Data mapping technologies developed by IDLab. In the context of CITADEL the new HCF component will enable the visualization and analysis of trends for the usage of public services in European cities, playing a key role in generating personalized recommendations to the citizens as well as to PAs in terms of suggesting improvements to the current suite of public services.
https://twitter.com/Citadelh2020
https://twitter.com/gayane_sedraky
https://twitter.com/imec_int
https://twitter.com/IDLabResearch
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Gayane Sedrakyan
CITADEL is a H2020 European project that is creating an ecosystem of best practices, tools, and recommendations to transform Public Administrations (PAs) via an inclusive approach in order to provide stakeholders with more efficient, inclusive and citizen-centric services. The CITADEL ecosystem will allow PAs to use what they already know plus new data to implement what really matters to citizens in order to shape and co-create more efficient and inclusive public services. CITADEL innovates by using ICTs to find out why citizens stop using public services, and use this information to re-adjust provision to bring them back in. Also, it identifies why citizens are not using a given public service (due to affordability, accessibility, lack of knowledge, embarrassment, lack of interest, etc.) and, where appropriate, use this information to make public services more attractive, so they start using the services.
The DataTank, a tool designed and developed by IMEC’s IDLab, will be extended to provide the Data Harvesting/Curation/Fusion (DHCF) component of the platform. The DataTank provides an open source, open data platform which not only allows publishing datasets according to standardised guidelines and taxonomies (DCAT-AP), but also transforms the data into a variety of reusable formats. The extension will include an intelligent way of harvesting and fusion of different data sources using semantics and Linked Data mapping technologies developed by IDLab. In the context of CITADEL the new HCF component will enable the visualization and analysis of trends for the usage of public services in European cities, playing a key role in generating personalized recommendations to the citizens as well as to PAs in terms of suggesting improvements to the current suite of public services.
DITAS Cloud Platform allows developers to design data-intensive applications, deploy them on a mixed cloud/edge environment and execute the resulting distributed application in an optimal way by exploiting the data and computation movement strategies, no matter the number of different devices, their type and the heterogeneity of runtime environments. It brings to your developer toolbox the best of Cloud & Edge worlds.
Wide access to spatial Citizen Science data - ECSA Berlin 2016COBWEB Project
Authors: Paul van Genuchten, Lieke Verhelst, Clemens Portele
Presented at the European Citizen Science Association conference Berlin, May 2016.
One of the objectives of COBWEB is to publish citizen science data to GEOSS, the Global Earth Observation System of Systems. GEOSS has a focus on spatial standards (CSW, SensorWeb, WMS/WFS). However, a major part of citizen science community is not aware of these standards, and average users use search engines to discover data and common formats to analyse data. So how do we bridge the gap between services in GEOSS and search engines?
The Census Hub Project can be considerated at the moment as the most advanced project where Internet technologies and SDMX solutions for data transmission get together for an ambicious goal: the data dissemination of Census 2011 results.
We analyze the Census Hub architecture, where a central Hub at Eurostat side manage the user interface, transforming all selections made by the user on the screen in an sdmx query. This query is sent to the web service at NSI side, that parses the query and transforms it in an SQL query that can be used with a data base containing census data. Depending on how many countrys are involved in the answer, the hub will query the web service provided for that country. Finally, the Hub receive all answer fron NSI's and build up a final table, putting all answers toghether. The importance of this implementation is that is a completely new system that change completely the way to disseminate and exchange official data among organizations.
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards.
Visit: https://www.eudat.eu/eudat-summer-school
1. YGGDRASYLL, A CONCEPT FOR A VIRTUAL DATA CENTRE
Hans de Wolf, Pieter Beerthuizen, Camiel Plevier
Dutch Space B.V., Mendelweg 30, 2333 CS Leiden, The Netherlands
h.de.wolf@dutchspace.nl | p.beerthuizen@dutchspace.nl | c.plevier@dutchspace.nl
ABSTRACT
YGGDRASILL is the name for a Virtual Data Centre, an
infrastructural solution to make a large variety of data
available in a simple and uniform way to a user
community, while only requiring minimal effort from
the data providers.
In many cases, science projects want and need to make
their results available to the community, thus acting as
data provider to related projects. However, most of
these projects are focussed on the domain-specific
scientific activities and cannot afford to spend
significant technical and administrative effort on setting
up a facility that provides search and download
functions. It is not sufficient to make it possible to
download the data, it must also be possible for users to
find the appropriate data.
The concept of a virtual data centre delivers a solution
to this problem, by offering a central web portal that
offers users advanced functions to locate and download
data products. The YGGDRASILL virtual data centre
improves this concept by minimizing the effort to act as
a data provider in the virtual data centre. In addition to
facilitating the delivery of data products that have been
prepared in advance, YGGDRASILL provides also the
means to create customized data products by processing
on-demand.
The development of YGGDRASILL was driven by the
needs from the Dutch national project on climate
Change, “Climate changes Spatial Planning”
(http://www.klimaatvoorruimte.nl).
1. INTRODUCTION
Climate change is one of the major environmental issues
for the coming years, both regionally and globally. The
Netherlands are expected to face climate change
impacts on all land use related sectors and on water
management, and therefore on spatial planning in
general.
The programme “Climate changes Spatial Planning”
(CcSP) (http://www.klimaatvoorruimte.nl) focuses to
enhance joint-learning between those communities and
people in practice within spatial planning. Its mission is
to make climate change and climate variability one of
the guiding principles for spatial planning in the
Netherlands.
The main objectives of the programme are:
To offer the Dutch government, the private sector
and other stakeholders a clustered, high-quality and
accessible knowledge infrastructure on the
interface of climate change and spatial planning.
To engage a dialogue between stakeholders and
scientists in order to support the development
of spatially explicit adaptation and mitigation
strategies that anticipate for climate change and
contribute to a safe, sustainable and resilient socio-
economic infrastructure in the Netherlands.
Figure 1. Themes of the CcSP programme
The programme is centred on five main themes: Climate
Scenarios, Mitigation, Adaptation, Integration and
Communication (fig. 1). Projects are interactively
designed to cover issues relevant to climate and spatial
planning and for the sectors such as biodiversity and
nature, agriculture, fisheries, fresh water, coastal areas,
transport on land and water, sustainable energy
production, business, finance and insurance and
governmental strategies.
Within the scope of this Dutch national programme, the
COM-1 project in theme 5 (communication) is one of a
series of projects that are aimed an strengthening the
Dutch knowledge infrastructure.
The goal of the COM-1 is to create a central portal that
offers project managers and external users access to
2. (consolidated) data products from selected projects
within the Adaptation, Mitigation and Climate
Scenarios themes.
Distribution of data products over the Internet may seem
a straightforward problem to solve. However, in reality
the situation is more complex. In many cases the CcSP
(science) projects do not possess the expertise and/or
resources to make their data more accessible. Project
priorities typically lie with the acquisition of data and
translating it into knowledge, and less on making it
more accessible to others.
The core problem in the development of an
infrastructural solution for this problem is to make a
large variety of data available in a simple and uniform
way to the user community -- while avoiding multiple
similar developments by the data providers. The
solution for this problem was found in the concept of
the Virtual Data Centre (VDC).
2. INTRODUCING THE VDC
Dutch Space developed the YGGDRASILL concept of the
VDC to provide an answer to the challenges presented
above, and demonstrated it successfully. The concept
consists of a central portal that provides access to data
product. Behind this portal, an infrastructure links
together widely dispersed data sources and computer
platforms in a type of cooperative network. It allows the
exchange of data and sharing of knowledge between
participants (and if necessary, each other’s computer
systems and tools). At the same time, the project groups
retain control over their own specific algorithms and
data collections because these remain on their own
computer systems.
The VDC does not contain a centralized repository for
all data products, but instead creates a central, one-stop-
shopping entry point to the data by providing access
through a searchable catalogue. The actual files
containing the data products remain stored at the
facilities of the science projects themselves. These
projects provide meta-data about the data products to
the VDC to update the central catalogue. The provided
meta-data includes all necessary information to obtain
access to the actual data, plus all information that may
be useful (according to the data provider) to discover
datasets in a search action.
In addition to this, the Virtual Data Centre has an index,
which is a cross-project catalogue about types of data
products. The index contains product-independent meta-
information; each catalogue contains product-specific
meta-information.
The VDC is a geographically distributed system. The
‘central’ part (with the webserver, and index and
catalogue databases) is located at another location than
the projects’ data “servers” (which feed the catalogue
and provide actual datasets to users). Some projects may
share hardware, at other times a project may have its
own servers. The projects’ systems that serve data
products and the systems that feed updates to the
catalogue are not necessarily the same. Sending updates
to the catalogue may be a part of the production process
of the datasets, but may also be a completely separate,
even manual, process.
The following sections of this paper will discuss the
operation of the virtual data centre from different points
of view.
3. THE USER’S VIEW: GET DATA PRODUCTS
From the viewpoint of the user who wants to access data
products from the science projects, the use of the VDC
is similar to buying items from a ‘web shop’ on the
Internet. Users obtain access to datasets in several steps:
Using a web form (see fig.2), the user consults an
index: this is a list of all types of data products
available within the programme. This
index provides (searchable) descriptions of all
types, with contact information.
From a web page with a list of all data product
types that match the user-specified criteria, the user
selects a data product type.
Once a user has found and selected the type of data
product, a new web page is presented with more
detailed information on this type of data product.
This web page also presents a form in which the
user can search for specific data product instances
in a catalogue that is specific for the selected type
of data products (see fig.3)
The web forms that are used to search in the index and
in the catalogue are designed such that the user can
build his query incrementally. The user selects a
property, and then provides the condition for that
property (some default values may already be provided).
If desired, the user can specify more conditions,
referring to the same or other properties. The conditions
can be combined using ‘AND’, ‘OR’ or ‘WITHOUT’
(=’AND NOT’) constructions.
3. The example presented in fig.2” shows that the user has
chosen to select the property “Dataset Title” from the
index, and specified for this property the condition
“contains” the text “Country”.
Additionally, the user has specified a second condition
on the property “Topic category”, which “is not any of”
the items from a predefined value list (now showing
“farming”.
These two conditions are connected using the “AND”
construction.
The user can refine the query in this form in several
ways:
Each of the defined conditions can be removed
using the ‘Remove’ buttons at the right-hand side.
Selecting the name of the property to be used and
clicking on the ‘Add’ button can add additional
conditions.
Specifying new values in the appropriate field can
change the values for the currently defined
conditions.
The relation between the conditions can be
modified by changing the ‘AND’ relation to ‘OR’ or
‘WITHOUT’.
The query can be executed by clicking on the “Find
Now” button.
During the design of the system, this approach was
chosen to make it possible for casual users to use the
system without having to learn any query language,
while advanced users would still have the possibility to
formulate complex queries.
The way in which the queries are constructed does not
have the full power of expression of a full query
language; nested conditions are not possible. However,
by supporting specification of a range of values in each
condition, and the “WITHOUT” connection between the
conditions, the possibilities for query formulation
through web forms are regarded as sufficiently
powerful.
Fig.3 shows the web form that is used to search through
the product-specific catalogue to find one or more
specific dataset instances of the selected type. Again, to
find these instances, the user specifies conditions that
apply to the metadata of the datasets. By comparing the
metadata conditions specified by the user against the
metadata (of the data product instances) in
the catalogue, a list of matching dataproduct instances is
produced. Each member of this list identifies a
single data product instance.
Figure 2. Building a query to search in the index of all data product types
4. The example shows that the user has selected the Data
Product type “Country Info 2008”. Some of the
metadata for this type of data product is displayed, such
as a short description (“Pictures of Country maps and
flags with detailed information”) and the licensing (this
information is licensed under the Creative Commons
license).
The user is building a search query, and has specified
that the Population must be between the 3 and 50
million.
As with the web form used to search in the index of all
known types of data products, extra conditions can be
added to the query, in this case based on the “country
name”, “capital name”, “area size” and “population”
properties.
Note that none of the properties have been hard-coded:
the complete form is generated automatically from the
meta-information of the data product type.
The “Find now” button uses all information specified in
the form to find all instances of data products that match
these user-specified criteria.
When the query is executed, it returns a list of data
product instances that match the conditions
specified. This list shows columns with a small
number of (metadata) attributes of the found
instances (here: ‘country name’, ‘capital name’,
‘area size’ and ‘population’). This information is
extracted from the data product’s catalogue. By
clicking on the ‘Show Details’ button, the user can
request a screen with more detailed metadata.
Which metadata properties are displayed in the list
view and the detailed view is not hard-coded, but is
decided by the provider of the data products.
At this moment the user has located data products
that may be interesting or useful – but only the
metadata is available, not the data itself. In order to
obtain the data, the user only has to click on the
‘Order’ button.
Note: if the user is not authorized to obtain the data
product, the ‘Order’ button is dimmed. Access rights are
only checked at login, and when data products are order.
During the development of the system it was decided
that all users should have the possibility to search for
data products. After all, the primary purpose for
building the system was to make data available to users.
Figure 3. Building a query to find data product instances in the catalogue
5. What happens when the users orders a product depends
on the facilities of the data provider.
If the data provider has his own server, the order
button just contains the download URL, and the file
containing the data product instance is downloaded
directly.
If the data provider cannot provide a server, the
YGGDRASILL data product ordering mechanism will
be activated. This does not provide a direct
download of the data product’s files, but places the
order in an order queue, and – invisible to the user
– the VDC software contacts the party that provides
the data product to fetch it. It is not necessary for
the user to stay connected to the portal. When the
portal receives the ordered file, it places the file in a
‘parking area’ at the portal. A message will appear
when the user revisits the portal, with a link to
download the files from the portal.
An additional function is available when the
YGGDRASILL ordering mechanism is used: product
customization and/or processing on demand.
This option is provided by an additional web form that
opens when the ‘Order’ button is clicked. It contains a
number of fields (defined by the data provider) through
which the user can specify custom processing of his
order.
Examples of this customization are:
Converting the data product instance in a different
format (especially for graphics formats)
Reducing temporal or spatial resolution of the data
Extracting a subset of the data from a much larger
set.
In principle, any kind of customization can be done –
but this requires the development of additional software
by the data provider because only they have the
necessary domain-specific expertise to build the tools
for this purpose.
Figure 4. List of data product instances that match the user-specified criteria
6. 4. THE DATA PROVIDER’S VIEW
The virtual data centre was not only designed to be easy
to use by the users (‘data consumers’) but also for the
data providers, because many of them are small science
projects that do not have the level of expertise and/or
resources to operate a real data centre.
In order to distribute data through the YGGDRASILL
virtual data centre the data provider must take the
following steps.
Maintain a repository of data products. As was
explained before, the VDC does not have a centralized
repository that contains copies of all data product files.
The data provider must maintain its own repository for
these files – a task that should require no extra effort.
YGGDRASILL does not impose any restrictions on how
this is organized. The data product files may reside on a
file system or in a database.
Provide computer hardware (with Internet
connection) to feed metadata and (when ordered) data
product files to the portal. YGGDRASILL is agnostic with
respect to the platform, operation has been demonstrated
on Windows and Linux platforms. All that is required is
a Java Virtual Machine to run the provider-side
YGGDRASILL software and an Internet connection for
outgoing traffic using the http protocol (same
configuration as for a web browser). It is not necessary
to provide dedicated computer hardware. Even the
connection to the Internet may be intermittent, although
this is not recommended for timely delivery of data
products.
Construct a Yggdrasill Dataproduct Definition
(YDD) that defines common information about a type of
data product. This information consists of several parts:
Data Product Type index metadata: a mandatory
set of data that describes this type of dataset. This
consists of generic information like title, publisher
and description. This information is used when the
user searches through the index for a suitable type
of datasets.
Data Product Type catalogue properties: define
which metadata are available for every instances of
this type of dataset. This information is used when
the user searches in the catalogue for specific
instances of this dataset. It includes also initial
settings for access rights, but these can be modified
later through interactive web forms.
The creation of a YDD is process that requires
interaction of both the database administrator of the
virtual data centre and of the project scientists. Together
the needs and dataset properties of the project are
interpreted and translated into a dataset type definition.
The YDD is stored in an XML file. This file is read by
the software that implements the YGGDRASILL virtual
data centre, in order to:
add a new record (describing this type of dataset) to
the index.
to prepare one or more database tables (the
catalogue specific for this data product type) that
can hold the information about instances of this
type of dataset.
Announce new dataset instances. A dataset instance
announcement informs the catalogue about new or
updated instances of a dataset defined earlier – or may
even delete instances from the catalogue. This operation
is a routine operation that requires no human
interpretation and should be an automated process.
Because dataset instance announcements are formatted
as tab-delimited text files they can be produced easily
by automated processes, or manually using the
Microsoft Excel spreadsheet.
The tab-delimited files containing the announcements
must be consistent with the dataset type definition.
The Yggdrasill virtual data centre provides Java-based
software that handles the sending of the announcements.
This software uses the HTTP protocol to pass through
firewalls easily.
Deploy a data product instance delivery service. In
order to deliver ordered data product instances, the data
providing science project must deploy some kind of
delivery service.
For large, multi-year science project it may be feasible
to operate their own web or ftp server for this purpose.
Aspect ‘Classical’ Data Centre Yggdrasill Virtual Data Centre
Provide Data Repository (storage) Yes Yes
Provide Hardware (server) Yes, dedicated Lightweight hardware (shared)
Provide Internet Connection Yes, including support for server (ftp
and/or http) protocols
Yes, only http client protocols
Optionally intermittent
Build Portal Yes Not necessary
Install file server Yes Only lightweight Yggdrasill software
Configure Firewall Yes Not necessary
Account administration Yes Optional
Develop Search functions Yes No, provided by Yggdrasill
Scalability and Redundancy Complex Simple
Table 1. Comparison of provider’s effort for a classical data centre and YGGDRASILL Virtual Data Centre
7. If this is available, the data product instance
announcements contain a URL to these instances, and
the Order button in the portal of the virtual data centre
links directly to this URL, providing the user with a
direct download opportunity.
For smaller projects, that cannot afford their own server
infrastructure, YGGDRASILL offers an other solution. At
the side of the data-providing project ‘data delivery
agent’ software is installed that communicates with the
central part of the VDC. This software handles orders
placed at the central portal to deliver the actual data
products.
The Java-based Data Delivery Agent (DDA) software
periodically polls (over http) the central part of the
virtual data centre in order to discover new orders for
data product instances. If it can fulfil an order, it sends
the file containing the ordered data product instance to
the portal, from where it can be downloaded by the user
that ordered it.
The DDA does not prepare the files containing the data
product instance by itself. For this, it calls a product-
specific script that must be created by the data provider
(the DDA provides information to this script, such as
product identification and optional customization
parameters). This script can be very simple, typically
using the provided identifier for the ordered data
products instance to build a filename, and copy this file
from a repository to a working directory.
In contrast to this simple approach, the script can also
be very complex in order to prepare a complete custom
data product by running a mathematical model. In this
case it is not even necessary to have a real data product
instance available, as it is created on-the-fly.
Solutions of intermediate complexity could involve
extracting the data product instance from a repository,
and doing some processing on it (format conversion,
visualization, or temporal/spatial resampling).
The time needed to deploy a data providing service
proves to be very short. For a data product type that
requires only a simple script, typically a single day is
sufficient. This includes definition of the metadata that
describes the data product, installing the announcer for
new data product instances, setting up the Data Delivery
Agent and doing some tests.
5. THE DEVELOPER’S VIEW
Now we have seen how the Virtual Data Centre looks
from the outside (from the viewpoints of a user and
from a data provider), we can take a look at the internal
workings. This description is clarified by fig. 5, the
numbers like 1, 2, 3, etc. refers to the activities in that
illustration. A UML diagram of the operation is
presented in fig.6.
As explained beore, the deployment of a service that
delivers a new type of data product starts 1 with the
definition of the metadata by constructing a Yggdrasill
Dataproduct Definition (YDD). This is an XML
document that defines common information about a
type of dataset.
This information consists of several parts:
Type index metadata: a mandatory set of data that
describes this type of dataset. This consists of
generic information like product name, publisher
contact information and description. This
information is used when the user searches through
the index for a suitable type of data product.
Definition of Data Product Type catalogue fields
define which attributes apply for every instance of
this specific type of data product. This information
is used when the user searches in the catalogue
for specific instances of this dataset. Several types
of attributes are supported, including text strings,
numbers and enumerations. Also the presentation of
these attributes in the search forms on the web
pages (as shown in fig. 3) are defined here (default
values, pop-up lists, etc.). The YDD contains also
simple validation rules (range checks) for the
values of these attributes.
Definition of access rules. The YDD can contain
an initial definition of access rules. These rules
grant access to a user or a group of users as they are
defined in the Yggdrasill portal. A rule may include
a delay period, including that data product instances
become available to the specified user or group
only when the data product instance is older than
the specified period. This option was implemented
because scientists may prefer to release data to a
large audience only when they have had time to
prepare their publications.
Definition of ordering mechanism. This specified
how orders are fulfilled. If a direct download is
offered by the data provider, the this setting
provides the base URL for the download (the
identifier present in a product instance
announcement is used to build the complete URL).
If no direct download is offered, the YGGDRASILL
order fulfilment mechanism is used, the YDD
contains the information needed for the Data
Delivery Agent.
The metadata used in the Yggdrasill virtual data centre
is based on the ISO 19115 standard (a schema required
for describing geographic information and services. It
provides information about the identification, the extent,
the quality, the spatial and temporal schema, spatial
reference, and distribution of digital geographic data.)
The creation of a YDD is process that requires
interaction of both the data centre’s administrator of the
virtual data centre and of the project scientists. Together
the needs and dataset properties of the project are
interpreted and translated this into a dataset
type definition.
8. After the YDD has been prepared, the YDD file is
installed 2. This metadata from the dataset type
definition is automatically used to define a new type of
dataset by:
Adding a new record (describing this type of data
product) to the index.
Creating one or more database tables (extending
the catalogue) that can hold the information about
instances of this type of data product.
The preparatory activities are continued by installing the
Data Product Announcer and Data Delivery Agent
the site of the data provider. Both are provided as part
of the YGGDRASILL deployment in the form of platform-
independent Java software. Finally, the script is
prepared that prepares ordered data products.
The operational use of the virtual data centre starts
when the data provider starts creating data products 3.
These data products are placed in a repository 4.
YGGDRASILL does not impose any requirements on how
this is implemented. As a side effect of the data
production, or as a separate action, the data provider
creates data product announcements 5. These are sent
by the Data Product Announcer to the Data Product
Ingester running at the portal 6. A data product
instance announcement informs the catalogue about new
or updated instances of a dataset defined earlier – or
may even delete instances from the catalogue. This
results in changes in the catalogue’s database tables 7
created from the definitions in the YDD. Prior to this
update, the contents of the data product instance
announcement will be validated against acceptance rules
stated in the YDD.
This operation is a routine operation that requires no
human interpretation and run as an automated process.
Because data product instance announcements are
formatted as tab-delimited text files also they can be
produced easily by automated processes, or manually
using the Microsoft Excel spreadsheet.
When these actions have been taken, the Virtual Data
Centre is ready to provide data to the users.
As described earlier in this document, the user uses the
portal to search for data products 8. This search though
the index 9 returns a list of relevant data product types
AT.
After the user selects a data product type from the list,
the portal:
Figure 5. Operation of the virtual data centre
9. Uses the metadata from YDD to build a custom
web form by which the user can specify data
product-specific search criteria.
Uses the metadata from the YDD to build a
database query that searches through the product-
specific catalogue, which contains the metadata that
describe the instances of the data product.
Presents the results from this search in a web page.
This web page contains links to obtain the dataset
instance files.
To prepare for ordering data product, the portal:
Uses both type-specific metadata (from the index)
and instance-specific metadata (from the catalogue)
to determine access rights
Uses information from the dataset instance
announcement to determine the source from which
the data product instance can be obtained.
This information is used to generate an Order button on
the web form. Data providers that operate their own
servers (HTTP or FTP) can include the appropriate
direct download links in their data product instance
announcements.
When the user places the order AK, the portal places it
in an order queue AL.
Data providers (typically smaller organisations) that do
not provide access to their server to the outside world
can use another approach. When the user places the
order AK, the portal places it in an order queue AL.
When the data provider’s Data Delivery Agent polls the
portal to check if any orders for data product instances
are waiting, the portal responds by sending AM an
identification of the ordered data product instance
(obtained from the catalogue for this data product type).
On receiving this identification, the Data Delivery
Agent launches the provider's custom script to prepare
AN the ordered data product instance and sends AO the
ordered data product instance to the portal, where it is
received by the Data Product Receiver AP and placed in
a parking area of the central portal, from which it can be
downloaded by the user who ordered the product AQ.
6. ADVANTAGES OF THE YGGDRASILL VDC
For users: the YGGDRASILL VDC offers several
advantages to users. A single portal provides access to a
range of data products, presented in the familiar ‘web
shop’ set-up. Because the products share this portal, the
Figure 6. UML Diagram of Data Product Order and Delivery
10. user does not have to learn different search and
download methods, and only needs a single account.
Easy deployment: the YGGDRASILL VDC was designed
for easy deployment. Installation of the Java-based
software at the data provider’s facilities is simple. There
is no need to configure firewall to allow incoming
traffic to pass through. The hardest work is the domain-
specific setup: determining the meta-data for the YDD
and (sometimes) the creation of the script that extracts
the ordered data product from a repository. Typically,
adding a provider with a new type of data product to the
VDC takes about one day of work (more may be needed
when data custom products are generated ad-hoc).
No dedicated server hardware required: especially
small science projects may have difficulties assigning
dedicated hardware for data dissemination. The Data
Delivery Agent can run as a background task while on
a normal PC while it is used for other activities. It is
even no problem when it runs on a laptop which has an
intermittent connection to the Internet: because the
Data Delivery Agent uses a polling mechanism to
retrieve orders from the portal, no failures occur
because the portal tries to connect to the Data Delivery
Agent. When the DDA is disconnected, it just does not
poll the portal anymore, and when the connection is
restored, it just can continue asking for new orders to
be fulfilled.
Of course, for projects that intend to provide frequent
or high-priority data products, assigning dedicated
hardware is preferred, but this can be lightweight
hardware – a typical PC is adequate.
Note: because YGGDRASILL provides no centralized
storage, the data provider must always provide its own
resources to store the repository of its own data
products.
Secure Solution: Because the Data Delivery Agent,
running on the data provider’s computer(s), takes the
initiative of polling for the orders, the http network
traffic passes easily through firewalls. In fact, the Data
Delivery Agent appears to a firewall as just a web
browser. Nearly all firewalls are already configured to
allow this kind of traffic.
Security: The fact that the Data Delivery Agent takes
the initiative makes this also a safe solution – there is no
need to respond to incoming traffic, or modify firewall
settings.
Scalability and robustness: in the description given
before, only a single Data Delivery Agent is running to
deliver the data product instances, on just one computer.
The design of YGGDRASILL supports also other models.
It is possible to run several DDAs simultaneously on the
same computer (each delivering a single data product),
or run a single DDA that handles the delivery of several
types of data products. This model is most suitable for a
science project that delivers multiple types of data
products that are ordered infrequently. The opposite
model, intended for projects that have to supply data
products frequently and timely, is to run the same DDA
on several computers simultaneously. Each of them
polls the YGGDRASILL portal independently, and
receives different orders to fulfil.
Because the software running on the portal re-queues
orders when a DDA does not respond after a set period
of time, these orders will automatically be picked up by
an other computer, thus providing a robust solution.
Optional User Administration: the VDC handles the
authentication of users. The data provider does not have
to create and manage user accounts. Yggdrasill supports
authorization of user groups; administration of these
groups can be done through simple web forms.
The above advantages make the virtual data centre very
useful for small science projects, even when they are
executed by large institutions (who may not be willing
to support decentral servers on their network).