SlideShare a Scribd company logo
1 of 10
Download to read offline
YGGDRASYLL, A CONCEPT FOR A VIRTUAL DATA CENTRE
Hans de Wolf, Pieter Beerthuizen, Camiel Plevier
Dutch Space B.V., Mendelweg 30, 2333 CS Leiden, The Netherlands
h.de.wolf@dutchspace.nl | p.beerthuizen@dutchspace.nl | c.plevier@dutchspace.nl
ABSTRACT
YGGDRASILL is the name for a Virtual Data Centre, an
infrastructural solution to make a large variety of data
available in a simple and uniform way to a user
community, while only requiring minimal effort from
the data providers.
In many cases, science projects want and need to make
their results available to the community, thus acting as
data provider to related projects. However, most of
these projects are focussed on the domain-specific
scientific activities and cannot afford to spend
significant technical and administrative effort on setting
up a facility that provides search and download
functions. It is not sufficient to make it possible to
download the data, it must also be possible for users to
find the appropriate data.
The concept of a virtual data centre delivers a solution
to this problem, by offering a central web portal that
offers users advanced functions to locate and download
data products. The YGGDRASILL virtual data centre
improves this concept by minimizing the effort to act as
a data provider in the virtual data centre. In addition to
facilitating the delivery of data products that have been
prepared in advance, YGGDRASILL provides also the
means to create customized data products by processing
on-demand.
The development of YGGDRASILL was driven by the
needs from the Dutch national project on climate
Change, “Climate changes Spatial Planning”
(http://www.klimaatvoorruimte.nl).
1. INTRODUCTION
Climate change is one of the major environmental issues
for the coming years, both regionally and globally. The
Netherlands are expected to face climate change
impacts on all land use related sectors and on water
management, and therefore on spatial planning in
general.
The programme “Climate changes Spatial Planning”
(CcSP) (http://www.klimaatvoorruimte.nl) focuses to
enhance joint-learning between those communities and
people in practice within spatial planning. Its mission is
to make climate change and climate variability one of
the guiding principles for spatial planning in the
Netherlands.
The main objectives of the programme are:
To offer the Dutch government, the private sector
and other stakeholders a clustered, high-quality and
accessible knowledge infrastructure on the
interface of climate change and spatial planning.
To engage a dialogue between stakeholders and
scientists in order to support the development
of spatially explicit adaptation and mitigation
strategies that anticipate for climate change and
contribute to a safe, sustainable and resilient socio-
economic infrastructure in the Netherlands.
Figure 1. Themes of the CcSP programme
The programme is centred on five main themes: Climate
Scenarios, Mitigation, Adaptation, Integration and
Communication (fig. 1). Projects are interactively
designed to cover issues relevant to climate and spatial
planning and for the sectors such as biodiversity and
nature, agriculture, fisheries, fresh water, coastal areas,
transport on land and water, sustainable energy
production, business, finance and insurance and
governmental strategies.
Within the scope of this Dutch national programme, the
COM-1 project in theme 5 (communication) is one of a
series of projects that are aimed an strengthening the
Dutch knowledge infrastructure.
The goal of the COM-1 is to create a central portal that
offers project managers and external users access to
(consolidated) data products from selected projects
within the Adaptation, Mitigation and Climate
Scenarios themes.
Distribution of data products over the Internet may seem
a straightforward problem to solve. However, in reality
the situation is more complex. In many cases the CcSP
(science) projects do not possess the expertise and/or
resources to make their data more accessible. Project
priorities typically lie with the acquisition of data and
translating it into knowledge, and less on making it
more accessible to others.
The core problem in the development of an
infrastructural solution for this problem is to make a
large variety of data available in a simple and uniform
way to the user community -- while avoiding multiple
similar developments by the data providers. The
solution for this problem was found in the concept of
the Virtual Data Centre (VDC).
2. INTRODUCING THE VDC
Dutch Space developed the YGGDRASILL concept of the
VDC to provide an answer to the challenges presented
above, and demonstrated it successfully. The concept
consists of a central portal that provides access to data
product. Behind this portal, an infrastructure links
together widely dispersed data sources and computer
platforms in a type of cooperative network. It allows the
exchange of data and sharing of knowledge between
participants (and if necessary, each other’s computer
systems and tools). At the same time, the project groups
retain control over their own specific algorithms and
data collections because these remain on their own
computer systems.
The VDC does not contain a centralized repository for
all data products, but instead creates a central, one-stop-
shopping entry point to the data by providing access
through a searchable catalogue. The actual files
containing the data products remain stored at the
facilities of the science projects themselves. These
projects provide meta-data about the data products to
the VDC to update the central catalogue. The provided
meta-data includes all necessary information to obtain
access to the actual data, plus all information that may
be useful (according to the data provider) to discover
datasets in a search action.
In addition to this, the Virtual Data Centre has an index,
which is a cross-project catalogue about types of data
products. The index contains product-independent meta-
information; each catalogue contains product-specific
meta-information.
The VDC is a geographically distributed system. The
‘central’ part (with the webserver, and index and
catalogue databases) is located at another location than
the projects’ data “servers” (which feed the catalogue
and provide actual datasets to users). Some projects may
share hardware, at other times a project may have its
own servers. The projects’ systems that serve data
products and the systems that feed updates to the
catalogue are not necessarily the same. Sending updates
to the catalogue may be a part of the production process
of the datasets, but may also be a completely separate,
even manual, process.
The following sections of this paper will discuss the
operation of the virtual data centre from different points
of view.
3. THE USER’S VIEW: GET DATA PRODUCTS
From the viewpoint of the user who wants to access data
products from the science projects, the use of the VDC
is similar to buying items from a ‘web shop’ on the
Internet. Users obtain access to datasets in several steps:
Using a web form (see fig.2), the user consults an
index: this is a list of all types of data products
available within the programme. This
index provides (searchable) descriptions of all
types, with contact information.
From a web page with a list of all data product
types that match the user-specified criteria, the user
selects a data product type.
Once a user has found and selected the type of data
product, a new web page is presented with more
detailed information on this type of data product.
This web page also presents a form in which the
user can search for specific data product instances
in a catalogue that is specific for the selected type
of data products (see fig.3)
The web forms that are used to search in the index and
in the catalogue are designed such that the user can
build his query incrementally. The user selects a
property, and then provides the condition for that
property (some default values may already be provided).
If desired, the user can specify more conditions,
referring to the same or other properties. The conditions
can be combined using ‘AND’, ‘OR’ or ‘WITHOUT’
(=’AND NOT’) constructions.
The example presented in fig.2” shows that the user has
chosen to select the property “Dataset Title” from the
index, and specified for this property the condition
“contains” the text “Country”.
Additionally, the user has specified a second condition
on the property “Topic category”, which “is not any of”
the items from a predefined value list (now showing
“farming”.
These two conditions are connected using the “AND”
construction.
The user can refine the query in this form in several
ways:
Each of the defined conditions can be removed
using the ‘Remove’ buttons at the right-hand side.
Selecting the name of the property to be used and
clicking on the ‘Add’ button can add additional
conditions.
Specifying new values in the appropriate field can
change the values for the currently defined
conditions.
The relation between the conditions can be
modified by changing the ‘AND’ relation to ‘OR’ or
‘WITHOUT’.
The query can be executed by clicking on the “Find
Now” button.
During the design of the system, this approach was
chosen to make it possible for casual users to use the
system without having to learn any query language,
while advanced users would still have the possibility to
formulate complex queries.
The way in which the queries are constructed does not
have the full power of expression of a full query
language; nested conditions are not possible. However,
by supporting specification of a range of values in each
condition, and the “WITHOUT” connection between the
conditions, the possibilities for query formulation
through web forms are regarded as sufficiently
powerful.
Fig.3 shows the web form that is used to search through
the product-specific catalogue to find one or more
specific dataset instances of the selected type. Again, to
find these instances, the user specifies conditions that
apply to the metadata of the datasets. By comparing the
metadata conditions specified by the user against the
metadata (of the data product instances) in
the catalogue, a list of matching dataproduct instances is
produced. Each member of this list identifies a
single data product instance.
Figure 2. Building a query to search in the index of all data product types
The example shows that the user has selected the Data
Product type “Country Info 2008”. Some of the
metadata for this type of data product is displayed, such
as a short description (“Pictures of Country maps and
flags with detailed information”) and the licensing (this
information is licensed under the Creative Commons
license).
The user is building a search query, and has specified
that the Population must be between the 3 and 50
million.
As with the web form used to search in the index of all
known types of data products, extra conditions can be
added to the query, in this case based on the “country
name”, “capital name”, “area size” and “population”
properties.
Note that none of the properties have been hard-coded:
the complete form is generated automatically from the
meta-information of the data product type.
The “Find now” button uses all information specified in
the form to find all instances of data products that match
these user-specified criteria.
When the query is executed, it returns a list of data
product instances that match the conditions
specified. This list shows columns with a small
number of (metadata) attributes of the found
instances (here: ‘country name’, ‘capital name’,
‘area size’ and ‘population’). This information is
extracted from the data product’s catalogue. By
clicking on the ‘Show Details’ button, the user can
request a screen with more detailed metadata.
Which metadata properties are displayed in the list
view and the detailed view is not hard-coded, but is
decided by the provider of the data products.
At this moment the user has located data products
that may be interesting or useful – but only the
metadata is available, not the data itself. In order to
obtain the data, the user only has to click on the
‘Order’ button.
Note: if the user is not authorized to obtain the data
product, the ‘Order’ button is dimmed. Access rights are
only checked at login, and when data products are order.
During the development of the system it was decided
that all users should have the possibility to search for
data products. After all, the primary purpose for
building the system was to make data available to users.
Figure 3. Building a query to find data product instances in the catalogue
What happens when the users orders a product depends
on the facilities of the data provider.
If the data provider has his own server, the order
button just contains the download URL, and the file
containing the data product instance is downloaded
directly.
If the data provider cannot provide a server, the
YGGDRASILL data product ordering mechanism will
be activated. This does not provide a direct
download of the data product’s files, but places the
order in an order queue, and – invisible to the user
– the VDC software contacts the party that provides
the data product to fetch it. It is not necessary for
the user to stay connected to the portal. When the
portal receives the ordered file, it places the file in a
‘parking area’ at the portal. A message will appear
when the user revisits the portal, with a link to
download the files from the portal.
An additional function is available when the
YGGDRASILL ordering mechanism is used: product
customization and/or processing on demand.
This option is provided by an additional web form that
opens when the ‘Order’ button is clicked. It contains a
number of fields (defined by the data provider) through
which the user can specify custom processing of his
order.
Examples of this customization are:
Converting the data product instance in a different
format (especially for graphics formats)
Reducing temporal or spatial resolution of the data
Extracting a subset of the data from a much larger
set.
In principle, any kind of customization can be done –
but this requires the development of additional software
by the data provider because only they have the
necessary domain-specific expertise to build the tools
for this purpose.
Figure 4. List of data product instances that match the user-specified criteria
4. THE DATA PROVIDER’S VIEW
The virtual data centre was not only designed to be easy
to use by the users (‘data consumers’) but also for the
data providers, because many of them are small science
projects that do not have the level of expertise and/or
resources to operate a real data centre.
In order to distribute data through the YGGDRASILL
virtual data centre the data provider must take the
following steps.
Maintain a repository of data products. As was
explained before, the VDC does not have a centralized
repository that contains copies of all data product files.
The data provider must maintain its own repository for
these files – a task that should require no extra effort.
YGGDRASILL does not impose any restrictions on how
this is organized. The data product files may reside on a
file system or in a database.
Provide computer hardware (with Internet
connection) to feed metadata and (when ordered) data
product files to the portal. YGGDRASILL is agnostic with
respect to the platform, operation has been demonstrated
on Windows and Linux platforms. All that is required is
a Java Virtual Machine to run the provider-side
YGGDRASILL software and an Internet connection for
outgoing traffic using the http protocol (same
configuration as for a web browser). It is not necessary
to provide dedicated computer hardware. Even the
connection to the Internet may be intermittent, although
this is not recommended for timely delivery of data
products.
Construct a Yggdrasill Dataproduct Definition
(YDD) that defines common information about a type of
data product. This information consists of several parts:
Data Product Type index metadata: a mandatory
set of data that describes this type of dataset. This
consists of generic information like title, publisher
and description. This information is used when the
user searches through the index for a suitable type
of datasets.
Data Product Type catalogue properties: define
which metadata are available for every instances of
this type of dataset. This information is used when
the user searches in the catalogue for specific
instances of this dataset. It includes also initial
settings for access rights, but these can be modified
later through interactive web forms.
The creation of a YDD is process that requires
interaction of both the database administrator of the
virtual data centre and of the project scientists. Together
the needs and dataset properties of the project are
interpreted and translated into a dataset type definition.
The YDD is stored in an XML file. This file is read by
the software that implements the YGGDRASILL virtual
data centre, in order to:
add a new record (describing this type of dataset) to
the index.
to prepare one or more database tables (the
catalogue specific for this data product type) that
can hold the information about instances of this
type of dataset.
Announce new dataset instances. A dataset instance
announcement informs the catalogue about new or
updated instances of a dataset defined earlier – or may
even delete instances from the catalogue. This operation
is a routine operation that requires no human
interpretation and should be an automated process.
Because dataset instance announcements are formatted
as tab-delimited text files they can be produced easily
by automated processes, or manually using the
Microsoft Excel spreadsheet.
The tab-delimited files containing the announcements
must be consistent with the dataset type definition.
The Yggdrasill virtual data centre provides Java-based
software that handles the sending of the announcements.
This software uses the HTTP protocol to pass through
firewalls easily.
Deploy a data product instance delivery service. In
order to deliver ordered data product instances, the data
providing science project must deploy some kind of
delivery service.
For large, multi-year science project it may be feasible
to operate their own web or ftp server for this purpose.
Aspect ‘Classical’ Data Centre Yggdrasill Virtual Data Centre
Provide Data Repository (storage) Yes Yes
Provide Hardware (server) Yes, dedicated Lightweight hardware (shared)
Provide Internet Connection Yes, including support for server (ftp
and/or http) protocols
Yes, only http client protocols
Optionally intermittent
Build Portal Yes Not necessary
Install file server Yes Only lightweight Yggdrasill software
Configure Firewall Yes Not necessary
Account administration Yes Optional
Develop Search functions Yes No, provided by Yggdrasill
Scalability and Redundancy Complex Simple
Table 1. Comparison of provider’s effort for a classical data centre and YGGDRASILL Virtual Data Centre
If this is available, the data product instance
announcements contain a URL to these instances, and
the Order button in the portal of the virtual data centre
links directly to this URL, providing the user with a
direct download opportunity.
For smaller projects, that cannot afford their own server
infrastructure, YGGDRASILL offers an other solution. At
the side of the data-providing project ‘data delivery
agent’ software is installed that communicates with the
central part of the VDC. This software handles orders
placed at the central portal to deliver the actual data
products.
The Java-based Data Delivery Agent (DDA) software
periodically polls (over http) the central part of the
virtual data centre in order to discover new orders for
data product instances. If it can fulfil an order, it sends
the file containing the ordered data product instance to
the portal, from where it can be downloaded by the user
that ordered it.
The DDA does not prepare the files containing the data
product instance by itself. For this, it calls a product-
specific script that must be created by the data provider
(the DDA provides information to this script, such as
product identification and optional customization
parameters). This script can be very simple, typically
using the provided identifier for the ordered data
products instance to build a filename, and copy this file
from a repository to a working directory.
In contrast to this simple approach, the script can also
be very complex in order to prepare a complete custom
data product by running a mathematical model. In this
case it is not even necessary to have a real data product
instance available, as it is created on-the-fly.
Solutions of intermediate complexity could involve
extracting the data product instance from a repository,
and doing some processing on it (format conversion,
visualization, or temporal/spatial resampling).
The time needed to deploy a data providing service
proves to be very short. For a data product type that
requires only a simple script, typically a single day is
sufficient. This includes definition of the metadata that
describes the data product, installing the announcer for
new data product instances, setting up the Data Delivery
Agent and doing some tests.
5. THE DEVELOPER’S VIEW
Now we have seen how the Virtual Data Centre looks
from the outside (from the viewpoints of a user and
from a data provider), we can take a look at the internal
workings. This description is clarified by fig. 5, the
numbers like 1, 2, 3, etc. refers to the activities in that
illustration. A UML diagram of the operation is
presented in fig.6.
As explained beore, the deployment of a service that
delivers a new type of data product starts 1 with the
definition of the metadata by constructing a Yggdrasill
Dataproduct Definition (YDD). This is an XML
document that defines common information about a
type of dataset.
This information consists of several parts:
Type index metadata: a mandatory set of data that
describes this type of dataset. This consists of
generic information like product name, publisher
contact information and description. This
information is used when the user searches through
the index for a suitable type of data product.
Definition of Data Product Type catalogue fields
define which attributes apply for every instance of
this specific type of data product. This information
is used when the user searches in the catalogue
for specific instances of this dataset. Several types
of attributes are supported, including text strings,
numbers and enumerations. Also the presentation of
these attributes in the search forms on the web
pages (as shown in fig. 3) are defined here (default
values, pop-up lists, etc.). The YDD contains also
simple validation rules (range checks) for the
values of these attributes.
Definition of access rules. The YDD can contain
an initial definition of access rules. These rules
grant access to a user or a group of users as they are
defined in the Yggdrasill portal. A rule may include
a delay period, including that data product instances
become available to the specified user or group
only when the data product instance is older than
the specified period. This option was implemented
because scientists may prefer to release data to a
large audience only when they have had time to
prepare their publications.
Definition of ordering mechanism. This specified
how orders are fulfilled. If a direct download is
offered by the data provider, the this setting
provides the base URL for the download (the
identifier present in a product instance
announcement is used to build the complete URL).
If no direct download is offered, the YGGDRASILL
order fulfilment mechanism is used, the YDD
contains the information needed for the Data
Delivery Agent.
The metadata used in the Yggdrasill virtual data centre
is based on the ISO 19115 standard (a schema required
for describing geographic information and services. It
provides information about the identification, the extent,
the quality, the spatial and temporal schema, spatial
reference, and distribution of digital geographic data.)
The creation of a YDD is process that requires
interaction of both the data centre’s administrator of the
virtual data centre and of the project scientists. Together
the needs and dataset properties of the project are
interpreted and translated this into a dataset
type definition.
After the YDD has been prepared, the YDD file is
installed 2. This metadata from the dataset type
definition is automatically used to define a new type of
dataset by:
Adding a new record (describing this type of data
product) to the index.
Creating one or more database tables (extending
the catalogue) that can hold the information about
instances of this type of data product.
The preparatory activities are continued by installing the
Data Product Announcer and Data Delivery Agent
the site of the data provider. Both are provided as part
of the YGGDRASILL deployment in the form of platform-
independent Java software. Finally, the script is
prepared that prepares ordered data products.
The operational use of the virtual data centre starts
when the data provider starts creating data products 3.
These data products are placed in a repository 4.
YGGDRASILL does not impose any requirements on how
this is implemented. As a side effect of the data
production, or as a separate action, the data provider
creates data product announcements 5. These are sent
by the Data Product Announcer to the Data Product
Ingester running at the portal 6. A data product
instance announcement informs the catalogue about new
or updated instances of a dataset defined earlier – or
may even delete instances from the catalogue. This
results in changes in the catalogue’s database tables 7
created from the definitions in the YDD. Prior to this
update, the contents of the data product instance
announcement will be validated against acceptance rules
stated in the YDD.
This operation is a routine operation that requires no
human interpretation and run as an automated process.
Because data product instance announcements are
formatted as tab-delimited text files also they can be
produced easily by automated processes, or manually
using the Microsoft Excel spreadsheet.
When these actions have been taken, the Virtual Data
Centre is ready to provide data to the users.
As described earlier in this document, the user uses the
portal to search for data products 8. This search though
the index 9 returns a list of relevant data product types
AT.
After the user selects a data product type from the list,
the portal:
Figure 5. Operation of the virtual data centre
Uses the metadata from YDD to build a custom
web form by which the user can specify data
product-specific search criteria.
Uses the metadata from the YDD to build a
database query that searches through the product-
specific catalogue, which contains the metadata that
describe the instances of the data product.
Presents the results from this search in a web page.
This web page contains links to obtain the dataset
instance files.
To prepare for ordering data product, the portal:
Uses both type-specific metadata (from the index)
and instance-specific metadata (from the catalogue)
to determine access rights
Uses information from the dataset instance
announcement to determine the source from which
the data product instance can be obtained.
This information is used to generate an Order button on
the web form. Data providers that operate their own
servers (HTTP or FTP) can include the appropriate
direct download links in their data product instance
announcements.
When the user places the order AK, the portal places it
in an order queue AL.
Data providers (typically smaller organisations) that do
not provide access to their server to the outside world
can use another approach. When the user places the
order AK, the portal places it in an order queue AL.
When the data provider’s Data Delivery Agent polls the
portal to check if any orders for data product instances
are waiting, the portal responds by sending AM an
identification of the ordered data product instance
(obtained from the catalogue for this data product type).
On receiving this identification, the Data Delivery
Agent launches the provider's custom script to prepare
AN the ordered data product instance and sends AO the
ordered data product instance to the portal, where it is
received by the Data Product Receiver AP and placed in
a parking area of the central portal, from which it can be
downloaded by the user who ordered the product AQ.
6. ADVANTAGES OF THE YGGDRASILL VDC
For users: the YGGDRASILL VDC offers several
advantages to users. A single portal provides access to a
range of data products, presented in the familiar ‘web
shop’ set-up. Because the products share this portal, the
Figure 6. UML Diagram of Data Product Order and Delivery
user does not have to learn different search and
download methods, and only needs a single account.
Easy deployment: the YGGDRASILL VDC was designed
for easy deployment. Installation of the Java-based
software at the data provider’s facilities is simple. There
is no need to configure firewall to allow incoming
traffic to pass through. The hardest work is the domain-
specific setup: determining the meta-data for the YDD
and (sometimes) the creation of the script that extracts
the ordered data product from a repository. Typically,
adding a provider with a new type of data product to the
VDC takes about one day of work (more may be needed
when data custom products are generated ad-hoc).
No dedicated server hardware required: especially
small science projects may have difficulties assigning
dedicated hardware for data dissemination. The Data
Delivery Agent can run as a background task while on
a normal PC while it is used for other activities. It is
even no problem when it runs on a laptop which has an
intermittent connection to the Internet: because the
Data Delivery Agent uses a polling mechanism to
retrieve orders from the portal, no failures occur
because the portal tries to connect to the Data Delivery
Agent. When the DDA is disconnected, it just does not
poll the portal anymore, and when the connection is
restored, it just can continue asking for new orders to
be fulfilled.
Of course, for projects that intend to provide frequent
or high-priority data products, assigning dedicated
hardware is preferred, but this can be lightweight
hardware – a typical PC is adequate.
Note: because YGGDRASILL provides no centralized
storage, the data provider must always provide its own
resources to store the repository of its own data
products.
Secure Solution: Because the Data Delivery Agent,
running on the data provider’s computer(s), takes the
initiative of polling for the orders, the http network
traffic passes easily through firewalls. In fact, the Data
Delivery Agent appears to a firewall as just a web
browser. Nearly all firewalls are already configured to
allow this kind of traffic.
Security: The fact that the Data Delivery Agent takes
the initiative makes this also a safe solution – there is no
need to respond to incoming traffic, or modify firewall
settings.
Scalability and robustness: in the description given
before, only a single Data Delivery Agent is running to
deliver the data product instances, on just one computer.
The design of YGGDRASILL supports also other models.
It is possible to run several DDAs simultaneously on the
same computer (each delivering a single data product),
or run a single DDA that handles the delivery of several
types of data products. This model is most suitable for a
science project that delivers multiple types of data
products that are ordered infrequently. The opposite
model, intended for projects that have to supply data
products frequently and timely, is to run the same DDA
on several computers simultaneously. Each of them
polls the YGGDRASILL portal independently, and
receives different orders to fulfil.
Because the software running on the portal re-queues
orders when a DDA does not respond after a set period
of time, these orders will automatically be picked up by
an other computer, thus providing a robust solution.
Optional User Administration: the VDC handles the
authentication of users. The data provider does not have
to create and manage user accounts. Yggdrasill supports
authorization of user groups; administration of these
groups can be done through simple web forms.
The above advantages make the virtual data centre very
useful for small science projects, even when they are
executed by large institutions (who may not be willing
to support decentral servers on their network).

More Related Content

What's hot

Accelerating Time to Research Using CloudBank
Accelerating Time to Research Using CloudBankAccelerating Time to Research Using CloudBank
Accelerating Time to Research Using CloudBankSanjay Padhi, Ph.D
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRamakant Soni
 
Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation ijmpict
 
Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...paperpublications3
 
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Ahmad Assaf
 
Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment phanleson
 
The PSI Directive and Open Government Data
The PSI Directive and Open Government DataThe PSI Directive and Open Government Data
The PSI Directive and Open Government DataOpen Data Support
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityCaserta
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 
Comparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining SoftwareComparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining SoftwareUniversitas Pembangunan Panca Budi
 
Painting the Future of Big Data with Apache Spark and MongoDB
Painting the Future of Big Data with Apache Spark and MongoDBPainting the Future of Big Data with Apache Spark and MongoDB
Painting the Future of Big Data with Apache Spark and MongoDBMongoDB
 
A Generic Model for Student Data Analytic Web Service (SDAWS)
A Generic Model for Student Data Analytic Web Service (SDAWS)A Generic Model for Student Data Analytic Web Service (SDAWS)
A Generic Model for Student Data Analytic Web Service (SDAWS)Editor IJCATR
 
Data Anonymization for Privacy Preservation in Big Data
Data Anonymization for Privacy Preservation in Big DataData Anonymization for Privacy Preservation in Big Data
Data Anonymization for Privacy Preservation in Big Datarahulmonikasharma
 

What's hot (17)

Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Accelerating Time to Research Using CloudBank
Accelerating Time to Research Using CloudBankAccelerating Time to Research Using CloudBank
Accelerating Time to Research Using CloudBank
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 
Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation
 
Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...
 
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
 
Module 1
Module  1Module  1
Module 1
 
Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment
 
The PSI Directive and Open Government Data
The PSI Directive and Open Government DataThe PSI Directive and Open Government Data
The PSI Directive and Open Government Data
 
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
Comparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining SoftwareComparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining Software
 
Painting the Future of Big Data with Apache Spark and MongoDB
Painting the Future of Big Data with Apache Spark and MongoDBPainting the Future of Big Data with Apache Spark and MongoDB
Painting the Future of Big Data with Apache Spark and MongoDB
 
A Generic Model for Student Data Analytic Web Service (SDAWS)
A Generic Model for Student Data Analytic Web Service (SDAWS)A Generic Model for Student Data Analytic Web Service (SDAWS)
A Generic Model for Student Data Analytic Web Service (SDAWS)
 
ijcatr04081001
ijcatr04081001ijcatr04081001
ijcatr04081001
 
Data Anonymization for Privacy Preservation in Big Data
Data Anonymization for Privacy Preservation in Big DataData Anonymization for Privacy Preservation in Big Data
Data Anonymization for Privacy Preservation in Big Data
 

Similar to DASIA2009 Yggdrasyll

CLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAIN
CLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAINCLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAIN
CLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAINijmpict
 
D3.1.2 heterogeneous data repositories and related services
D3.1.2 heterogeneous data repositories and related servicesD3.1.2 heterogeneous data repositories and related services
D3.1.2 heterogeneous data repositories and related servicesFOODIE_Project
 
The NIH Data Commons - BD2K All Hands Meeting 2015
The NIH Data Commons -  BD2K All Hands Meeting 2015The NIH Data Commons -  BD2K All Hands Meeting 2015
The NIH Data Commons - BD2K All Hands Meeting 2015Vivien Bonazzi
 
Debbie Wilson: Deliver More Efficient, Joined-Up Services through Improved Ma...
Debbie Wilson: Deliver More Efficient, Joined-Up Services through Improved Ma...Debbie Wilson: Deliver More Efficient, Joined-Up Services through Improved Ma...
Debbie Wilson: Deliver More Efficient, Joined-Up Services through Improved Ma...AGI Geocommunity
 
D3.4.1 Data fusion tools
D3.4.1 Data fusion toolsD3.4.1 Data fusion tools
D3.4.1 Data fusion toolsFOODIE_Project
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
 
Virtual Knowledge Graph by MIT Article.pdf
Virtual Knowledge Graph by MIT Article.pdfVirtual Knowledge Graph by MIT Article.pdf
Virtual Knowledge Graph by MIT Article.pdfNehmeh Taouk elMeaaz
 
AdMap: a framework for advertising using MapReduce pipeline
AdMap: a framework for advertising using MapReduce pipelineAdMap: a framework for advertising using MapReduce pipeline
AdMap: a framework for advertising using MapReduce pipelineCSITiaesprime
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Citadelh2020
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Gayane Sedrakyan
 
Wide access to spatial Citizen Science data - ECSA Berlin 2016
Wide access to spatial Citizen Science data - ECSA Berlin 2016Wide access to spatial Citizen Science data - ECSA Berlin 2016
Wide access to spatial Citizen Science data - ECSA Berlin 2016COBWEB Project
 
A Reconfigurable Component-Based Problem Solving Environment
A Reconfigurable Component-Based Problem Solving EnvironmentA Reconfigurable Component-Based Problem Solving Environment
A Reconfigurable Component-Based Problem Solving EnvironmentSheila Sinclair
 
Data Wrangling for Big Data Challenges andOpportunities.docx
Data Wrangling for Big Data Challenges andOpportunities.docxData Wrangling for Big Data Challenges andOpportunities.docx
Data Wrangling for Big Data Challenges andOpportunities.docxwhittemorelucilla
 
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
 
BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands Vivien Bonazzi
 

Similar to DASIA2009 Yggdrasyll (20)

CLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAIN
CLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAINCLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAIN
CLOUD COMPUTING IN THE PUBLIC SECTOR: MAPPING THE KNOWLEDGE DOMAIN
 
D3.1.2 heterogeneous data repositories and related services
D3.1.2 heterogeneous data repositories and related servicesD3.1.2 heterogeneous data repositories and related services
D3.1.2 heterogeneous data repositories and related services
 
The NIH Data Commons - BD2K All Hands Meeting 2015
The NIH Data Commons -  BD2K All Hands Meeting 2015The NIH Data Commons -  BD2K All Hands Meeting 2015
The NIH Data Commons - BD2K All Hands Meeting 2015
 
Report_Wijaya
Report_WijayaReport_Wijaya
Report_Wijaya
 
Debbie Wilson: Deliver More Efficient, Joined-Up Services through Improved Ma...
Debbie Wilson: Deliver More Efficient, Joined-Up Services through Improved Ma...Debbie Wilson: Deliver More Efficient, Joined-Up Services through Improved Ma...
Debbie Wilson: Deliver More Efficient, Joined-Up Services through Improved Ma...
 
D3.4.1 Data fusion tools
D3.4.1 Data fusion toolsD3.4.1 Data fusion tools
D3.4.1 Data fusion tools
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
Open Data Convergence
Open Data ConvergenceOpen Data Convergence
Open Data Convergence
 
Virtual Knowledge Graph by MIT Article.pdf
Virtual Knowledge Graph by MIT Article.pdfVirtual Knowledge Graph by MIT Article.pdf
Virtual Knowledge Graph by MIT Article.pdf
 
AdMap: a framework for advertising using MapReduce pipeline
AdMap: a framework for advertising using MapReduce pipelineAdMap: a framework for advertising using MapReduce pipeline
AdMap: a framework for advertising using MapReduce pipeline
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
 
Ditas factsheet h2020 v1.1
Ditas factsheet h2020  v1.1Ditas factsheet h2020  v1.1
Ditas factsheet h2020 v1.1
 
Wide access to spatial Citizen Science data - ECSA Berlin 2016
Wide access to spatial Citizen Science data - ECSA Berlin 2016Wide access to spatial Citizen Science data - ECSA Berlin 2016
Wide access to spatial Citizen Science data - ECSA Berlin 2016
 
A Reconfigurable Component-Based Problem Solving Environment
A Reconfigurable Component-Based Problem Solving EnvironmentA Reconfigurable Component-Based Problem Solving Environment
A Reconfigurable Component-Based Problem Solving Environment
 
A CRUD Matrix
A CRUD MatrixA CRUD Matrix
A CRUD Matrix
 
Data Wrangling for Big Data Challenges andOpportunities.docx
Data Wrangling for Big Data Challenges andOpportunities.docxData Wrangling for Big Data Challenges andOpportunities.docx
Data Wrangling for Big Data Challenges andOpportunities.docx
 
Census Hub Project
Census Hub ProjectCensus Hub Project
Census Hub Project
 
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
 
BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands
 

DASIA2009 Yggdrasyll

  • 1. YGGDRASYLL, A CONCEPT FOR A VIRTUAL DATA CENTRE Hans de Wolf, Pieter Beerthuizen, Camiel Plevier Dutch Space B.V., Mendelweg 30, 2333 CS Leiden, The Netherlands h.de.wolf@dutchspace.nl | p.beerthuizen@dutchspace.nl | c.plevier@dutchspace.nl ABSTRACT YGGDRASILL is the name for a Virtual Data Centre, an infrastructural solution to make a large variety of data available in a simple and uniform way to a user community, while only requiring minimal effort from the data providers. In many cases, science projects want and need to make their results available to the community, thus acting as data provider to related projects. However, most of these projects are focussed on the domain-specific scientific activities and cannot afford to spend significant technical and administrative effort on setting up a facility that provides search and download functions. It is not sufficient to make it possible to download the data, it must also be possible for users to find the appropriate data. The concept of a virtual data centre delivers a solution to this problem, by offering a central web portal that offers users advanced functions to locate and download data products. The YGGDRASILL virtual data centre improves this concept by minimizing the effort to act as a data provider in the virtual data centre. In addition to facilitating the delivery of data products that have been prepared in advance, YGGDRASILL provides also the means to create customized data products by processing on-demand. The development of YGGDRASILL was driven by the needs from the Dutch national project on climate Change, “Climate changes Spatial Planning” (http://www.klimaatvoorruimte.nl). 1. INTRODUCTION Climate change is one of the major environmental issues for the coming years, both regionally and globally. The Netherlands are expected to face climate change impacts on all land use related sectors and on water management, and therefore on spatial planning in general. The programme “Climate changes Spatial Planning” (CcSP) (http://www.klimaatvoorruimte.nl) focuses to enhance joint-learning between those communities and people in practice within spatial planning. Its mission is to make climate change and climate variability one of the guiding principles for spatial planning in the Netherlands. The main objectives of the programme are: To offer the Dutch government, the private sector and other stakeholders a clustered, high-quality and accessible knowledge infrastructure on the interface of climate change and spatial planning. To engage a dialogue between stakeholders and scientists in order to support the development of spatially explicit adaptation and mitigation strategies that anticipate for climate change and contribute to a safe, sustainable and resilient socio- economic infrastructure in the Netherlands. Figure 1. Themes of the CcSP programme The programme is centred on five main themes: Climate Scenarios, Mitigation, Adaptation, Integration and Communication (fig. 1). Projects are interactively designed to cover issues relevant to climate and spatial planning and for the sectors such as biodiversity and nature, agriculture, fisheries, fresh water, coastal areas, transport on land and water, sustainable energy production, business, finance and insurance and governmental strategies. Within the scope of this Dutch national programme, the COM-1 project in theme 5 (communication) is one of a series of projects that are aimed an strengthening the Dutch knowledge infrastructure. The goal of the COM-1 is to create a central portal that offers project managers and external users access to
  • 2. (consolidated) data products from selected projects within the Adaptation, Mitigation and Climate Scenarios themes. Distribution of data products over the Internet may seem a straightforward problem to solve. However, in reality the situation is more complex. In many cases the CcSP (science) projects do not possess the expertise and/or resources to make their data more accessible. Project priorities typically lie with the acquisition of data and translating it into knowledge, and less on making it more accessible to others. The core problem in the development of an infrastructural solution for this problem is to make a large variety of data available in a simple and uniform way to the user community -- while avoiding multiple similar developments by the data providers. The solution for this problem was found in the concept of the Virtual Data Centre (VDC). 2. INTRODUCING THE VDC Dutch Space developed the YGGDRASILL concept of the VDC to provide an answer to the challenges presented above, and demonstrated it successfully. The concept consists of a central portal that provides access to data product. Behind this portal, an infrastructure links together widely dispersed data sources and computer platforms in a type of cooperative network. It allows the exchange of data and sharing of knowledge between participants (and if necessary, each other’s computer systems and tools). At the same time, the project groups retain control over their own specific algorithms and data collections because these remain on their own computer systems. The VDC does not contain a centralized repository for all data products, but instead creates a central, one-stop- shopping entry point to the data by providing access through a searchable catalogue. The actual files containing the data products remain stored at the facilities of the science projects themselves. These projects provide meta-data about the data products to the VDC to update the central catalogue. The provided meta-data includes all necessary information to obtain access to the actual data, plus all information that may be useful (according to the data provider) to discover datasets in a search action. In addition to this, the Virtual Data Centre has an index, which is a cross-project catalogue about types of data products. The index contains product-independent meta- information; each catalogue contains product-specific meta-information. The VDC is a geographically distributed system. The ‘central’ part (with the webserver, and index and catalogue databases) is located at another location than the projects’ data “servers” (which feed the catalogue and provide actual datasets to users). Some projects may share hardware, at other times a project may have its own servers. The projects’ systems that serve data products and the systems that feed updates to the catalogue are not necessarily the same. Sending updates to the catalogue may be a part of the production process of the datasets, but may also be a completely separate, even manual, process. The following sections of this paper will discuss the operation of the virtual data centre from different points of view. 3. THE USER’S VIEW: GET DATA PRODUCTS From the viewpoint of the user who wants to access data products from the science projects, the use of the VDC is similar to buying items from a ‘web shop’ on the Internet. Users obtain access to datasets in several steps: Using a web form (see fig.2), the user consults an index: this is a list of all types of data products available within the programme. This index provides (searchable) descriptions of all types, with contact information. From a web page with a list of all data product types that match the user-specified criteria, the user selects a data product type. Once a user has found and selected the type of data product, a new web page is presented with more detailed information on this type of data product. This web page also presents a form in which the user can search for specific data product instances in a catalogue that is specific for the selected type of data products (see fig.3) The web forms that are used to search in the index and in the catalogue are designed such that the user can build his query incrementally. The user selects a property, and then provides the condition for that property (some default values may already be provided). If desired, the user can specify more conditions, referring to the same or other properties. The conditions can be combined using ‘AND’, ‘OR’ or ‘WITHOUT’ (=’AND NOT’) constructions.
  • 3. The example presented in fig.2” shows that the user has chosen to select the property “Dataset Title” from the index, and specified for this property the condition “contains” the text “Country”. Additionally, the user has specified a second condition on the property “Topic category”, which “is not any of” the items from a predefined value list (now showing “farming”. These two conditions are connected using the “AND” construction. The user can refine the query in this form in several ways: Each of the defined conditions can be removed using the ‘Remove’ buttons at the right-hand side. Selecting the name of the property to be used and clicking on the ‘Add’ button can add additional conditions. Specifying new values in the appropriate field can change the values for the currently defined conditions. The relation between the conditions can be modified by changing the ‘AND’ relation to ‘OR’ or ‘WITHOUT’. The query can be executed by clicking on the “Find Now” button. During the design of the system, this approach was chosen to make it possible for casual users to use the system without having to learn any query language, while advanced users would still have the possibility to formulate complex queries. The way in which the queries are constructed does not have the full power of expression of a full query language; nested conditions are not possible. However, by supporting specification of a range of values in each condition, and the “WITHOUT” connection between the conditions, the possibilities for query formulation through web forms are regarded as sufficiently powerful. Fig.3 shows the web form that is used to search through the product-specific catalogue to find one or more specific dataset instances of the selected type. Again, to find these instances, the user specifies conditions that apply to the metadata of the datasets. By comparing the metadata conditions specified by the user against the metadata (of the data product instances) in the catalogue, a list of matching dataproduct instances is produced. Each member of this list identifies a single data product instance. Figure 2. Building a query to search in the index of all data product types
  • 4. The example shows that the user has selected the Data Product type “Country Info 2008”. Some of the metadata for this type of data product is displayed, such as a short description (“Pictures of Country maps and flags with detailed information”) and the licensing (this information is licensed under the Creative Commons license). The user is building a search query, and has specified that the Population must be between the 3 and 50 million. As with the web form used to search in the index of all known types of data products, extra conditions can be added to the query, in this case based on the “country name”, “capital name”, “area size” and “population” properties. Note that none of the properties have been hard-coded: the complete form is generated automatically from the meta-information of the data product type. The “Find now” button uses all information specified in the form to find all instances of data products that match these user-specified criteria. When the query is executed, it returns a list of data product instances that match the conditions specified. This list shows columns with a small number of (metadata) attributes of the found instances (here: ‘country name’, ‘capital name’, ‘area size’ and ‘population’). This information is extracted from the data product’s catalogue. By clicking on the ‘Show Details’ button, the user can request a screen with more detailed metadata. Which metadata properties are displayed in the list view and the detailed view is not hard-coded, but is decided by the provider of the data products. At this moment the user has located data products that may be interesting or useful – but only the metadata is available, not the data itself. In order to obtain the data, the user only has to click on the ‘Order’ button. Note: if the user is not authorized to obtain the data product, the ‘Order’ button is dimmed. Access rights are only checked at login, and when data products are order. During the development of the system it was decided that all users should have the possibility to search for data products. After all, the primary purpose for building the system was to make data available to users. Figure 3. Building a query to find data product instances in the catalogue
  • 5. What happens when the users orders a product depends on the facilities of the data provider. If the data provider has his own server, the order button just contains the download URL, and the file containing the data product instance is downloaded directly. If the data provider cannot provide a server, the YGGDRASILL data product ordering mechanism will be activated. This does not provide a direct download of the data product’s files, but places the order in an order queue, and – invisible to the user – the VDC software contacts the party that provides the data product to fetch it. It is not necessary for the user to stay connected to the portal. When the portal receives the ordered file, it places the file in a ‘parking area’ at the portal. A message will appear when the user revisits the portal, with a link to download the files from the portal. An additional function is available when the YGGDRASILL ordering mechanism is used: product customization and/or processing on demand. This option is provided by an additional web form that opens when the ‘Order’ button is clicked. It contains a number of fields (defined by the data provider) through which the user can specify custom processing of his order. Examples of this customization are: Converting the data product instance in a different format (especially for graphics formats) Reducing temporal or spatial resolution of the data Extracting a subset of the data from a much larger set. In principle, any kind of customization can be done – but this requires the development of additional software by the data provider because only they have the necessary domain-specific expertise to build the tools for this purpose. Figure 4. List of data product instances that match the user-specified criteria
  • 6. 4. THE DATA PROVIDER’S VIEW The virtual data centre was not only designed to be easy to use by the users (‘data consumers’) but also for the data providers, because many of them are small science projects that do not have the level of expertise and/or resources to operate a real data centre. In order to distribute data through the YGGDRASILL virtual data centre the data provider must take the following steps. Maintain a repository of data products. As was explained before, the VDC does not have a centralized repository that contains copies of all data product files. The data provider must maintain its own repository for these files – a task that should require no extra effort. YGGDRASILL does not impose any restrictions on how this is organized. The data product files may reside on a file system or in a database. Provide computer hardware (with Internet connection) to feed metadata and (when ordered) data product files to the portal. YGGDRASILL is agnostic with respect to the platform, operation has been demonstrated on Windows and Linux platforms. All that is required is a Java Virtual Machine to run the provider-side YGGDRASILL software and an Internet connection for outgoing traffic using the http protocol (same configuration as for a web browser). It is not necessary to provide dedicated computer hardware. Even the connection to the Internet may be intermittent, although this is not recommended for timely delivery of data products. Construct a Yggdrasill Dataproduct Definition (YDD) that defines common information about a type of data product. This information consists of several parts: Data Product Type index metadata: a mandatory set of data that describes this type of dataset. This consists of generic information like title, publisher and description. This information is used when the user searches through the index for a suitable type of datasets. Data Product Type catalogue properties: define which metadata are available for every instances of this type of dataset. This information is used when the user searches in the catalogue for specific instances of this dataset. It includes also initial settings for access rights, but these can be modified later through interactive web forms. The creation of a YDD is process that requires interaction of both the database administrator of the virtual data centre and of the project scientists. Together the needs and dataset properties of the project are interpreted and translated into a dataset type definition. The YDD is stored in an XML file. This file is read by the software that implements the YGGDRASILL virtual data centre, in order to: add a new record (describing this type of dataset) to the index. to prepare one or more database tables (the catalogue specific for this data product type) that can hold the information about instances of this type of dataset. Announce new dataset instances. A dataset instance announcement informs the catalogue about new or updated instances of a dataset defined earlier – or may even delete instances from the catalogue. This operation is a routine operation that requires no human interpretation and should be an automated process. Because dataset instance announcements are formatted as tab-delimited text files they can be produced easily by automated processes, or manually using the Microsoft Excel spreadsheet. The tab-delimited files containing the announcements must be consistent with the dataset type definition. The Yggdrasill virtual data centre provides Java-based software that handles the sending of the announcements. This software uses the HTTP protocol to pass through firewalls easily. Deploy a data product instance delivery service. In order to deliver ordered data product instances, the data providing science project must deploy some kind of delivery service. For large, multi-year science project it may be feasible to operate their own web or ftp server for this purpose. Aspect ‘Classical’ Data Centre Yggdrasill Virtual Data Centre Provide Data Repository (storage) Yes Yes Provide Hardware (server) Yes, dedicated Lightweight hardware (shared) Provide Internet Connection Yes, including support for server (ftp and/or http) protocols Yes, only http client protocols Optionally intermittent Build Portal Yes Not necessary Install file server Yes Only lightweight Yggdrasill software Configure Firewall Yes Not necessary Account administration Yes Optional Develop Search functions Yes No, provided by Yggdrasill Scalability and Redundancy Complex Simple Table 1. Comparison of provider’s effort for a classical data centre and YGGDRASILL Virtual Data Centre
  • 7. If this is available, the data product instance announcements contain a URL to these instances, and the Order button in the portal of the virtual data centre links directly to this URL, providing the user with a direct download opportunity. For smaller projects, that cannot afford their own server infrastructure, YGGDRASILL offers an other solution. At the side of the data-providing project ‘data delivery agent’ software is installed that communicates with the central part of the VDC. This software handles orders placed at the central portal to deliver the actual data products. The Java-based Data Delivery Agent (DDA) software periodically polls (over http) the central part of the virtual data centre in order to discover new orders for data product instances. If it can fulfil an order, it sends the file containing the ordered data product instance to the portal, from where it can be downloaded by the user that ordered it. The DDA does not prepare the files containing the data product instance by itself. For this, it calls a product- specific script that must be created by the data provider (the DDA provides information to this script, such as product identification and optional customization parameters). This script can be very simple, typically using the provided identifier for the ordered data products instance to build a filename, and copy this file from a repository to a working directory. In contrast to this simple approach, the script can also be very complex in order to prepare a complete custom data product by running a mathematical model. In this case it is not even necessary to have a real data product instance available, as it is created on-the-fly. Solutions of intermediate complexity could involve extracting the data product instance from a repository, and doing some processing on it (format conversion, visualization, or temporal/spatial resampling). The time needed to deploy a data providing service proves to be very short. For a data product type that requires only a simple script, typically a single day is sufficient. This includes definition of the metadata that describes the data product, installing the announcer for new data product instances, setting up the Data Delivery Agent and doing some tests. 5. THE DEVELOPER’S VIEW Now we have seen how the Virtual Data Centre looks from the outside (from the viewpoints of a user and from a data provider), we can take a look at the internal workings. This description is clarified by fig. 5, the numbers like 1, 2, 3, etc. refers to the activities in that illustration. A UML diagram of the operation is presented in fig.6. As explained beore, the deployment of a service that delivers a new type of data product starts 1 with the definition of the metadata by constructing a Yggdrasill Dataproduct Definition (YDD). This is an XML document that defines common information about a type of dataset. This information consists of several parts: Type index metadata: a mandatory set of data that describes this type of dataset. This consists of generic information like product name, publisher contact information and description. This information is used when the user searches through the index for a suitable type of data product. Definition of Data Product Type catalogue fields define which attributes apply for every instance of this specific type of data product. This information is used when the user searches in the catalogue for specific instances of this dataset. Several types of attributes are supported, including text strings, numbers and enumerations. Also the presentation of these attributes in the search forms on the web pages (as shown in fig. 3) are defined here (default values, pop-up lists, etc.). The YDD contains also simple validation rules (range checks) for the values of these attributes. Definition of access rules. The YDD can contain an initial definition of access rules. These rules grant access to a user or a group of users as they are defined in the Yggdrasill portal. A rule may include a delay period, including that data product instances become available to the specified user or group only when the data product instance is older than the specified period. This option was implemented because scientists may prefer to release data to a large audience only when they have had time to prepare their publications. Definition of ordering mechanism. This specified how orders are fulfilled. If a direct download is offered by the data provider, the this setting provides the base URL for the download (the identifier present in a product instance announcement is used to build the complete URL). If no direct download is offered, the YGGDRASILL order fulfilment mechanism is used, the YDD contains the information needed for the Data Delivery Agent. The metadata used in the Yggdrasill virtual data centre is based on the ISO 19115 standard (a schema required for describing geographic information and services. It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data.) The creation of a YDD is process that requires interaction of both the data centre’s administrator of the virtual data centre and of the project scientists. Together the needs and dataset properties of the project are interpreted and translated this into a dataset type definition.
  • 8. After the YDD has been prepared, the YDD file is installed 2. This metadata from the dataset type definition is automatically used to define a new type of dataset by: Adding a new record (describing this type of data product) to the index. Creating one or more database tables (extending the catalogue) that can hold the information about instances of this type of data product. The preparatory activities are continued by installing the Data Product Announcer and Data Delivery Agent the site of the data provider. Both are provided as part of the YGGDRASILL deployment in the form of platform- independent Java software. Finally, the script is prepared that prepares ordered data products. The operational use of the virtual data centre starts when the data provider starts creating data products 3. These data products are placed in a repository 4. YGGDRASILL does not impose any requirements on how this is implemented. As a side effect of the data production, or as a separate action, the data provider creates data product announcements 5. These are sent by the Data Product Announcer to the Data Product Ingester running at the portal 6. A data product instance announcement informs the catalogue about new or updated instances of a dataset defined earlier – or may even delete instances from the catalogue. This results in changes in the catalogue’s database tables 7 created from the definitions in the YDD. Prior to this update, the contents of the data product instance announcement will be validated against acceptance rules stated in the YDD. This operation is a routine operation that requires no human interpretation and run as an automated process. Because data product instance announcements are formatted as tab-delimited text files also they can be produced easily by automated processes, or manually using the Microsoft Excel spreadsheet. When these actions have been taken, the Virtual Data Centre is ready to provide data to the users. As described earlier in this document, the user uses the portal to search for data products 8. This search though the index 9 returns a list of relevant data product types AT. After the user selects a data product type from the list, the portal: Figure 5. Operation of the virtual data centre
  • 9. Uses the metadata from YDD to build a custom web form by which the user can specify data product-specific search criteria. Uses the metadata from the YDD to build a database query that searches through the product- specific catalogue, which contains the metadata that describe the instances of the data product. Presents the results from this search in a web page. This web page contains links to obtain the dataset instance files. To prepare for ordering data product, the portal: Uses both type-specific metadata (from the index) and instance-specific metadata (from the catalogue) to determine access rights Uses information from the dataset instance announcement to determine the source from which the data product instance can be obtained. This information is used to generate an Order button on the web form. Data providers that operate their own servers (HTTP or FTP) can include the appropriate direct download links in their data product instance announcements. When the user places the order AK, the portal places it in an order queue AL. Data providers (typically smaller organisations) that do not provide access to their server to the outside world can use another approach. When the user places the order AK, the portal places it in an order queue AL. When the data provider’s Data Delivery Agent polls the portal to check if any orders for data product instances are waiting, the portal responds by sending AM an identification of the ordered data product instance (obtained from the catalogue for this data product type). On receiving this identification, the Data Delivery Agent launches the provider's custom script to prepare AN the ordered data product instance and sends AO the ordered data product instance to the portal, where it is received by the Data Product Receiver AP and placed in a parking area of the central portal, from which it can be downloaded by the user who ordered the product AQ. 6. ADVANTAGES OF THE YGGDRASILL VDC For users: the YGGDRASILL VDC offers several advantages to users. A single portal provides access to a range of data products, presented in the familiar ‘web shop’ set-up. Because the products share this portal, the Figure 6. UML Diagram of Data Product Order and Delivery
  • 10. user does not have to learn different search and download methods, and only needs a single account. Easy deployment: the YGGDRASILL VDC was designed for easy deployment. Installation of the Java-based software at the data provider’s facilities is simple. There is no need to configure firewall to allow incoming traffic to pass through. The hardest work is the domain- specific setup: determining the meta-data for the YDD and (sometimes) the creation of the script that extracts the ordered data product from a repository. Typically, adding a provider with a new type of data product to the VDC takes about one day of work (more may be needed when data custom products are generated ad-hoc). No dedicated server hardware required: especially small science projects may have difficulties assigning dedicated hardware for data dissemination. The Data Delivery Agent can run as a background task while on a normal PC while it is used for other activities. It is even no problem when it runs on a laptop which has an intermittent connection to the Internet: because the Data Delivery Agent uses a polling mechanism to retrieve orders from the portal, no failures occur because the portal tries to connect to the Data Delivery Agent. When the DDA is disconnected, it just does not poll the portal anymore, and when the connection is restored, it just can continue asking for new orders to be fulfilled. Of course, for projects that intend to provide frequent or high-priority data products, assigning dedicated hardware is preferred, but this can be lightweight hardware – a typical PC is adequate. Note: because YGGDRASILL provides no centralized storage, the data provider must always provide its own resources to store the repository of its own data products. Secure Solution: Because the Data Delivery Agent, running on the data provider’s computer(s), takes the initiative of polling for the orders, the http network traffic passes easily through firewalls. In fact, the Data Delivery Agent appears to a firewall as just a web browser. Nearly all firewalls are already configured to allow this kind of traffic. Security: The fact that the Data Delivery Agent takes the initiative makes this also a safe solution – there is no need to respond to incoming traffic, or modify firewall settings. Scalability and robustness: in the description given before, only a single Data Delivery Agent is running to deliver the data product instances, on just one computer. The design of YGGDRASILL supports also other models. It is possible to run several DDAs simultaneously on the same computer (each delivering a single data product), or run a single DDA that handles the delivery of several types of data products. This model is most suitable for a science project that delivers multiple types of data products that are ordered infrequently. The opposite model, intended for projects that have to supply data products frequently and timely, is to run the same DDA on several computers simultaneously. Each of them polls the YGGDRASILL portal independently, and receives different orders to fulfil. Because the software running on the portal re-queues orders when a DDA does not respond after a set period of time, these orders will automatically be picked up by an other computer, thus providing a robust solution. Optional User Administration: the VDC handles the authentication of users. The data provider does not have to create and manage user accounts. Yggdrasill supports authorization of user groups; administration of these groups can be done through simple web forms. The above advantages make the virtual data centre very useful for small science projects, even when they are executed by large institutions (who may not be willing to support decentral servers on their network).