DASIA2009 Yggdrasyll

YGGDRASYLL, A CONCEPT FOR A VIRTUAL DATA CENTRE
Hans de Wolf, Pieter Beerthuizen, Camiel Plevier
Dutch Space B.V., Mendelweg 30, 2333 CS Leiden, The Netherlands
h.de.wolf@dutchspace.nl | p.beerthuizen@dutchspace.nl | c.plevier@dutchspace.nl
ABSTRACT
YGGDRASILL is the name for a Virtual Data Centre, an
infrastructural solution to make a large variety of data
available in a simple and uniform way to a user
community, while only requiring minimal effort from
the data providers.
In many cases, science projects want and need to make
their results available to the community, thus acting as
data provider to related projects. However, most of
these projects are focussed on the domain-specific
scientific activities and cannot afford to spend
significant technical and administrative effort on setting
up a facility that provides search and download
functions. It is not sufficient to make it possible to
download the data, it must also be possible for users to
find the appropriate data.
The concept of a virtual data centre delivers a solution
to this problem, by offering a central web portal that
offers users advanced functions to locate and download
data products. The YGGDRASILL virtual data centre
improves this concept by minimizing the effort to act as
a data provider in the virtual data centre. In addition to
facilitating the delivery of data products that have been
prepared in advance, YGGDRASILL provides also the
means to create customized data products by processing
on-demand.
The development of YGGDRASILL was driven by the
needs from the Dutch national project on climate
Change, “Climate changes Spatial Planning”
(http://www.klimaatvoorruimte.nl).
1. INTRODUCTION
Climate change is one of the major environmental issues
for the coming years, both regionally and globally. The
Netherlands are expected to face climate change
impacts on all land use related sectors and on water
management, and therefore on spatial planning in
general.
The programme “Climate changes Spatial Planning”
(CcSP) (http://www.klimaatvoorruimte.nl) focuses to
enhance joint-learning between those communities and
people in practice within spatial planning. Its mission is
to make climate change and climate variability one of
the guiding principles for spatial planning in the
Netherlands.
The main objectives of the programme are:
To offer the Dutch government, the private sector
and other stakeholders a clustered, high-quality and
accessible knowledge infrastructure on the
interface of climate change and spatial planning.
To engage a dialogue between stakeholders and
scientists in order to support the development
of spatially explicit adaptation and mitigation
strategies that anticipate for climate change and
contribute to a safe, sustainable and resilient socio-
economic infrastructure in the Netherlands.
Figure 1. Themes of the CcSP programme
The programme is centred on five main themes: Climate
Scenarios, Mitigation, Adaptation, Integration and
Communication (fig. 1). Projects are interactively
designed to cover issues relevant to climate and spatial
planning and for the sectors such as biodiversity and
nature, agriculture, fisheries, fresh water, coastal areas,
transport on land and water, sustainable energy
production, business, finance and insurance and
governmental strategies.
Within the scope of this Dutch national programme, the
COM-1 project in theme 5 (communication) is one of a
series of projects that are aimed an strengthening the
Dutch knowledge infrastructure.
The goal of the COM-1 is to create a central portal that
offers project managers and external users access to

(consolidated) data products from selected projects
within the Adaptation, Mitigation and Climate
Scenarios themes.
Distribution of data products over the Internet may seem
a straightforward problem to solve. However, in reality
the situation is more complex. In many cases the CcSP
(science) projects do not possess the expertise and/or
resources to make their data more accessible. Project
priorities typically lie with the acquisition of data and
translating it into knowledge, and less on making it
more accessible to others.
The core problem in the development of an
infrastructural solution for this problem is to make a
large variety of data available in a simple and uniform
way to the user community -- while avoiding multiple
similar developments by the data providers. The
solution for this problem was found in the concept of
the Virtual Data Centre (VDC).
2. INTRODUCING THE VDC
Dutch Space developed the YGGDRASILL concept of the
VDC to provide an answer to the challenges presented
above, and demonstrated it successfully. The concept
consists of a central portal that provides access to data
product. Behind this portal, an infrastructure links
together widely dispersed data sources and computer
platforms in a type of cooperative network. It allows the
exchange of data and sharing of knowledge between
participants (and if necessary, each other’s computer
systems and tools). At the same time, the project groups
retain control over their own specific algorithms and
data collections because these remain on their own
computer systems.
The VDC does not contain a centralized repository for
all data products, but instead creates a central, one-stop-
shopping entry point to the data by providing access
through a searchable catalogue. The actual files
containing the data products remain stored at the
facilities of the science projects themselves. These
projects provide meta-data about the data products to
the VDC to update the central catalogue. The provided
meta-data includes all necessary information to obtain
access to the actual data, plus all information that may
be useful (according to the data provider) to discover
datasets in a search action.
In addition to this, the Virtual Data Centre has an index,
which is a cross-project catalogue about types of data
products. The index contains product-independent meta-
information; each catalogue contains product-specific
meta-information.
The VDC is a geographically distributed system. The
‘central’ part (with the webserver, and index and
catalogue databases) is located at another location than
the projects’ data “servers” (which feed the catalogue
and provide actual datasets to users). Some projects may
share hardware, at other times a project may have its
own servers. The projects’ systems that serve data
products and the systems that feed updates to the
catalogue are not necessarily the same. Sending updates
to the catalogue may be a part of the production process
of the datasets, but may also be a completely separate,
even manual, process.
The following sections of this paper will discuss the
operation of the virtual data centre from different points
of view.
3. THE USER’S VIEW: GET DATA PRODUCTS
From the viewpoint of the user who wants to access data
products from the science projects, the use of the VDC
is similar to buying items from a ‘web shop’ on the
Internet. Users obtain access to datasets in several steps:
Using a web form (see fig.2), the user consults an
index: this is a list of all types of data products
available within the programme. This
index provides (searchable) descriptions of all
types, with contact information.
From a web page with a list of all data product
types that match the user-specified criteria, the user
selects a data product type.
Once a user has found and selected the type of data
product, a new web page is presented with more
detailed information on this type of data product.
This web page also presents a form in which the
user can search for specific data product instances
in a catalogue that is specific for the selected type
of data products (see fig.3)
The web forms that are used to search in the index and
in the catalogue are designed such that the user can
build his query incrementally. The user selects a
property, and then provides the condition for that
property (some default values may already be provided).
If desired, the user can specify more conditions,
referring to the same or other properties. The conditions
can be combined using ‘AND’, ‘OR’ or ‘WITHOUT’
(=’AND NOT’) constructions.

The example presented in fig.2” shows that the user has
chosen to select the property “Dataset Title” from the
index, and specified for this property the condition
“contains” the text “Country”.
Additionally, the user has specified a second condition
on the property “Topic category”, which “is not any of”
the items from a predefined value list (now showing
“farming”.
These two conditions are connected using the “AND”
construction.
The user can refine the query in this form in several
ways:
Each of the defined conditions can be removed
using the ‘Remove’ buttons at the right-hand side.
Selecting the name of the property to be used and
clicking on the ‘Add’ button can add additional
conditions.
Specifying new values in the appropriate field can
change the values for the currently defined
conditions.
The relation between the conditions can be
modified by changing the ‘AND’ relation to ‘OR’ or
‘WITHOUT’.
The query can be executed by clicking on the “Find
Now” button.
During the design of the system, this approach was
chosen to make it possible for casual users to use the
system without having to learn any query language,
while advanced users would still have the possibility to
formulate complex queries.
The way in which the queries are constructed does not
have the full power of expression of a full query
language; nested conditions are not possible. However,
by supporting specification of a range of values in each
condition, and the “WITHOUT” connection between the
conditions, the possibilities for query formulation
through web forms are regarded as sufficiently
powerful.
Fig.3 shows the web form that is used to search through
the product-specific catalogue to find one or more
specific dataset instances of the selected type. Again, to
find these instances, the user specifies conditions that
apply to the metadata of the datasets. By comparing the
metadata conditions specified by the user against the
metadata (of the data product instances) in
the catalogue, a list of matching dataproduct instances is
produced. Each member of this list identifies a
single data product instance.
Figure 2. Building a query to search in the index of all data product types

The example shows that the user has selected the Data
Product type “Country Info 2008”. Some of the
metadata for this type of data product is displayed, such
as a short description (“Pictures of Country maps and
flags with detailed information”) and the licensing (this
information is licensed under the Creative Commons
license).
The user is building a search query, and has specified
that the Population must be between the 3 and 50
million.
As with the web form used to search in the index of all
known types of data products, extra conditions can be
added to the query, in this case based on the “country
name”, “capital name”, “area size” and “population”
properties.
Note that none of the properties have been hard-coded:
the complete form is generated automatically from the
meta-information of the data product type.
The “Find now” button uses all information specified in
the form to find all instances of data products that match
these user-specified criteria.
When the query is executed, it returns a list of data
product instances that match the conditions
specified. This list shows columns with a small
number of (metadata) attributes of the found
instances (here: ‘country name’, ‘capital name’,
‘area size’ and ‘population’). This information is
extracted from the data product’s catalogue. By
clicking on the ‘Show Details’ button, the user can
request a screen with more detailed metadata.
Which metadata properties are displayed in the list
view and the detailed view is not hard-coded, but is
decided by the provider of the data products.
At this moment the user has located data products
that may be interesting or useful – but only the
metadata is available, not the data itself. In order to
obtain the data, the user only has to click on the
‘Order’ button.
Note: if the user is not authorized to obtain the data
product, the ‘Order’ button is dimmed. Access rights are
only checked at login, and when data products are order.
During the development of the system it was decided
that all users should have the possibility to search for
data products. After all, the primary purpose for
building the system was to make data available to users.
Figure 3. Building a query to find data product instances in the catalogue

What happens when the users orders a product depends
on the facilities of the data provider.
If the data provider has his own server, the order
button just contains the download URL, and the file
containing the data product instance is downloaded
directly.
If the data provider cannot provide a server, the
YGGDRASILL data product ordering mechanism will
be activated. This does not provide a direct
download of the data product’s files, but places the
order in an order queue, and – invisible to the user
– the VDC software contacts the party that provides
the data product to fetch it. It is not necessary for
the user to stay connected to the portal. When the
portal receives the ordered file, it places the file in a
‘parking area’ at the portal. A message will appear
when the user revisits the portal, with a link to
download the files from the portal.
An additional function is available when the
YGGDRASILL ordering mechanism is used: product
customization and/or processing on demand.
This option is provided by an additional web form that
opens when the ‘Order’ button is clicked. It contains a
number of fields (defined by the data provider) through
which the user can specify custom processing of his
order.
Examples of this customization are:
Converting the data product instance in a different
format (especially for graphics formats)
Reducing temporal or spatial resolution of the data
Extracting a subset of the data from a much larger
set.
In principle, any kind of customization can be done –
but this requires the development of additional software
by the data provider because only they have the
necessary domain-specific expertise to build the tools
for this purpose.
Figure 4. List of data product instances that match the user-specified criteria

4. THE DATA PROVIDER’S VIEW
The virtual data centre was not only designed to be easy
to use by the users (‘data consumers’) but also for the
data providers, because many of them are small science
projects that do not have the level of expertise and/or
resources to operate a real data centre.
In order to distribute data through the YGGDRASILL
virtual data centre the data provider must take the
following steps.
Maintain a repository of data products. As was
explained before, the VDC does not have a centralized
repository that contains copies of all data product files.
The data provider must maintain its own repository for
these files – a task that should require no extra effort.
YGGDRASILL does not impose any restrictions on how
this is organized. The data product files may reside on a
file system or in a database.
Provide computer hardware (with Internet
connection) to feed metadata and (when ordered) data
product files to the portal. YGGDRASILL is agnostic with
respect to the platform, operation has been demonstrated
on Windows and Linux platforms. All that is required is
a Java Virtual Machine to run the provider-side
YGGDRASILL software and an Internet connection for
outgoing traffic using the http protocol (same
configuration as for a web browser). It is not necessary
to provide dedicated computer hardware. Even the
connection to the Internet may be intermittent, although
this is not recommended for timely delivery of data
products.
Construct a Yggdrasill Dataproduct Definition
(YDD) that defines common information about a type of
data product. This information consists of several parts:
Data Product Type index metadata: a mandatory
set of data that describes this type of dataset. This
consists of generic information like title, publisher
and description. This information is used when the
user searches through the index for a suitable type
of datasets.
Data Product Type catalogue properties: define
which metadata are available for every instances of
this type of dataset. This information is used when
the user searches in the catalogue for specific
instances of this dataset. It includes also initial
settings for access rights, but these can be modified
later through interactive web forms.
The creation of a YDD is process that requires
interaction of both the database administrator of the
virtual data centre and of the project scientists. Together
the needs and dataset properties of the project are
interpreted and translated into a dataset type definition.
The YDD is stored in an XML file. This file is read by
the software that implements the YGGDRASILL virtual
data centre, in order to:
add a new record (describing this type of dataset) to
the index.
to prepare one or more database tables (the
catalogue specific for this data product type) that
can hold the information about instances of this
type of dataset.
Announce new dataset instances. A dataset instance
announcement informs the catalogue about new or
updated instances of a dataset defined earlier – or may
even delete instances from the catalogue. This operation
is a routine operation that requires no human
interpretation and should be an automated process.
Because dataset instance announcements are formatted
as tab-delimited text files they can be produced easily
by automated processes, or manually using the
Microsoft Excel spreadsheet.
The tab-delimited files containing the announcements
must be consistent with the dataset type definition.
The Yggdrasill virtual data centre provides Java-based
software that handles the sending of the announcements.
This software uses the HTTP protocol to pass through
firewalls easily.
Deploy a data product instance delivery service. In
order to deliver ordered data product instances, the data
providing science project must deploy some kind of
delivery service.
For large, multi-year science project it may be feasible
to operate their own web or ftp server for this purpose.
Aspect ‘Classical’ Data Centre Yggdrasill Virtual Data Centre
Provide Data Repository (storage) Yes Yes
Provide Hardware (server) Yes, dedicated Lightweight hardware (shared)
Provide Internet Connection Yes, including support for server (ftp
and/or http) protocols
Yes, only http client protocols
Optionally intermittent
Build Portal Yes Not necessary
Install file server Yes Only lightweight Yggdrasill software
Configure Firewall Yes Not necessary
Account administration Yes Optional
Develop Search functions Yes No, provided by Yggdrasill
Scalability and Redundancy Complex Simple
Table 1. Comparison of provider’s effort for a classical data centre and YGGDRASILL Virtual Data Centre

If this is available, the data product instance
announcements contain a URL to these instances, and
the Order button in the portal of the virtual data centre
links directly to this URL, providing the user with a
direct download opportunity.
For smaller projects, that cannot afford their own server
infrastructure, YGGDRASILL offers an other solution. At
the side of the data-providing project ‘data delivery
agent’ software is installed that communicates with the
central part of the VDC. This software handles orders
placed at the central portal to deliver the actual data
products.
The Java-based Data Delivery Agent (DDA) software
periodically polls (over http) the central part of the
virtual data centre in order to discover new orders for
data product instances. If it can fulfil an order, it sends
the file containing the ordered data product instance to
the portal, from where it can be downloaded by the user
that ordered it.
The DDA does not prepare the files containing the data
product instance by itself. For this, it calls a product-
specific script that must be created by the data provider
(the DDA provides information to this script, such as
product identification and optional customization
parameters). This script can be very simple, typically
using the provided identifier for the ordered data
products instance to build a filename, and copy this file
from a repository to a working directory.
In contrast to this simple approach, the script can also
be very complex in order to prepare a complete custom
data product by running a mathematical model. In this
case it is not even necessary to have a real data product
instance available, as it is created on-the-fly.
Solutions of intermediate complexity could involve
extracting the data product instance from a repository,
and doing some processing on it (format conversion,
visualization, or temporal/spatial resampling).
The time needed to deploy a data providing service
proves to be very short. For a data product type that
requires only a simple script, typically a single day is
sufficient. This includes definition of the metadata that
describes the data product, installing the announcer for
new data product instances, setting up the Data Delivery
Agent and doing some tests.
5. THE DEVELOPER’S VIEW
Now we have seen how the Virtual Data Centre looks
from the outside (from the viewpoints of a user and
from a data provider), we can take a look at the internal
workings. This description is clarified by fig. 5, the
numbers like 1, 2, 3, etc. refers to the activities in that
illustration. A UML diagram of the operation is
presented in fig.6.
As explained beore, the deployment of a service that
delivers a new type of data product starts 1 with the
definition of the metadata by constructing a Yggdrasill
Dataproduct Definition (YDD). This is an XML
document that defines common information about a
type of dataset.
This information consists of several parts:
Type index metadata: a mandatory set of data that
describes this type of dataset. This consists of
generic information like product name, publisher
contact information and description. This
information is used when the user searches through
the index for a suitable type of data product.
Definition of Data Product Type catalogue fields
define which attributes apply for every instance of
this specific type of data product. This information
is used when the user searches in the catalogue
for specific instances of this dataset. Several types
of attributes are supported, including text strings,
numbers and enumerations. Also the presentation of
these attributes in the search forms on the web
pages (as shown in fig. 3) are defined here (default
values, pop-up lists, etc.). The YDD contains also
simple validation rules (range checks) for the
values of these attributes.
Definition of access rules. The YDD can contain
an initial definition of access rules. These rules
grant access to a user or a group of users as they are
defined in the Yggdrasill portal. A rule may include
a delay period, including that data product instances
become available to the specified user or group
only when the data product instance is older than
the specified period. This option was implemented
because scientists may prefer to release data to a
large audience only when they have had time to
prepare their publications.
Definition of ordering mechanism. This specified
how orders are fulfilled. If a direct download is
offered by the data provider, the this setting
provides the base URL for the download (the
identifier present in a product instance
announcement is used to build the complete URL).
If no direct download is offered, the YGGDRASILL
order fulfilment mechanism is used, the YDD
contains the information needed for the Data
Delivery Agent.
The metadata used in the Yggdrasill virtual data centre
is based on the ISO 19115 standard (a schema required
for describing geographic information and services. It
provides information about the identification, the extent,
the quality, the spatial and temporal schema, spatial
reference, and distribution of digital geographic data.)
The creation of a YDD is process that requires
interaction of both the data centre’s administrator of the
virtual data centre and of the project scientists. Together
the needs and dataset properties of the project are
interpreted and translated this into a dataset
type definition.

After the YDD has been prepared, the YDD file is
installed 2. This metadata from the dataset type
definition is automatically used to define a new type of
dataset by:
Adding a new record (describing this type of data
product) to the index.
Creating one or more database tables (extending
the catalogue) that can hold the information about
instances of this type of data product.
The preparatory activities are continued by installing the
Data Product Announcer and Data Delivery Agent
the site of the data provider. Both are provided as part
of the YGGDRASILL deployment in the form of platform-
independent Java software. Finally, the script is
prepared that prepares ordered data products.
The operational use of the virtual data centre starts
when the data provider starts creating data products 3.
These data products are placed in a repository 4.
YGGDRASILL does not impose any requirements on how
this is implemented. As a side effect of the data
production, or as a separate action, the data provider
creates data product announcements 5. These are sent
by the Data Product Announcer to the Data Product
Ingester running at the portal 6. A data product
instance announcement informs the catalogue about new
or updated instances of a dataset defined earlier – or
may even delete instances from the catalogue. This
results in changes in the catalogue’s database tables 7
created from the definitions in the YDD. Prior to this
update, the contents of the data product instance
announcement will be validated against acceptance rules
stated in the YDD.
This operation is a routine operation that requires no
human interpretation and run as an automated process.
Because data product instance announcements are
formatted as tab-delimited text files also they can be
produced easily by automated processes, or manually
using the Microsoft Excel spreadsheet.
When these actions have been taken, the Virtual Data
Centre is ready to provide data to the users.
As described earlier in this document, the user uses the
portal to search for data products 8. This search though
the index 9 returns a list of relevant data product types
AT.
After the user selects a data product type from the list,
the portal:
Figure 5. Operation of the virtual data centre

Uses the metadata from YDD to build a custom
web form by which the user can specify data
product-specific search criteria.
Uses the metadata from the YDD to build a
database query that searches through the product-
specific catalogue, which contains the metadata that
describe the instances of the data product.
Presents the results from this search in a web page.
This web page contains links to obtain the dataset
instance files.
To prepare for ordering data product, the portal:
Uses both type-specific metadata (from the index)
and instance-specific metadata (from the catalogue)
to determine access rights
Uses information from the dataset instance
announcement to determine the source from which
the data product instance can be obtained.
This information is used to generate an Order button on
the web form. Data providers that operate their own
servers (HTTP or FTP) can include the appropriate
direct download links in their data product instance
announcements.
When the user places the order AK, the portal places it
in an order queue AL.
Data providers (typically smaller organisations) that do
not provide access to their server to the outside world
can use another approach. When the user places the
order AK, the portal places it in an order queue AL.
When the data provider’s Data Delivery Agent polls the
portal to check if any orders for data product instances
are waiting, the portal responds by sending AM an
identification of the ordered data product instance
(obtained from the catalogue for this data product type).
On receiving this identification, the Data Delivery
Agent launches the provider's custom script to prepare
AN the ordered data product instance and sends AO the
ordered data product instance to the portal, where it is
received by the Data Product Receiver AP and placed in
a parking area of the central portal, from which it can be
downloaded by the user who ordered the product AQ.
6. ADVANTAGES OF THE YGGDRASILL VDC
For users: the YGGDRASILL VDC offers several
advantages to users. A single portal provides access to a
range of data products, presented in the familiar ‘web
shop’ set-up. Because the products share this portal, the
Figure 6. UML Diagram of Data Product Order and Delivery

user does not have to learn different search and
download methods, and only needs a single account.
Easy deployment: the YGGDRASILL VDC was designed
for easy deployment. Installation of the Java-based
software at the data provider’s facilities is simple. There
is no need to configure firewall to allow incoming
traffic to pass through. The hardest work is the domain-
specific setup: determining the meta-data for the YDD
and (sometimes) the creation of the script that extracts
the ordered data product from a repository. Typically,
adding a provider with a new type of data product to the
VDC takes about one day of work (more may be needed
when data custom products are generated ad-hoc).
No dedicated server hardware required: especially
small science projects may have difficulties assigning
dedicated hardware for data dissemination. The Data
Delivery Agent can run as a background task while on
a normal PC while it is used for other activities. It is
even no problem when it runs on a laptop which has an
intermittent connection to the Internet: because the
Data Delivery Agent uses a polling mechanism to
retrieve orders from the portal, no failures occur
because the portal tries to connect to the Data Delivery
Agent. When the DDA is disconnected, it just does not
poll the portal anymore, and when the connection is
restored, it just can continue asking for new orders to
be fulfilled.
Of course, for projects that intend to provide frequent
or high-priority data products, assigning dedicated
hardware is preferred, but this can be lightweight
hardware – a typical PC is adequate.
Note: because YGGDRASILL provides no centralized
storage, the data provider must always provide its own
resources to store the repository of its own data
products.
Secure Solution: Because the Data Delivery Agent,
running on the data provider’s computer(s), takes the
initiative of polling for the orders, the http network
traffic passes easily through firewalls. In fact, the Data
Delivery Agent appears to a firewall as just a web
browser. Nearly all firewalls are already configured to
allow this kind of traffic.
Security: The fact that the Data Delivery Agent takes
the initiative makes this also a safe solution – there is no
need to respond to incoming traffic, or modify firewall
settings.
Scalability and robustness: in the description given
before, only a single Data Delivery Agent is running to
deliver the data product instances, on just one computer.
The design of YGGDRASILL supports also other models.
It is possible to run several DDAs simultaneously on the
same computer (each delivering a single data product),
or run a single DDA that handles the delivery of several
types of data products. This model is most suitable for a
science project that delivers multiple types of data
products that are ordered infrequently. The opposite
model, intended for projects that have to supply data
products frequently and timely, is to run the same DDA
on several computers simultaneously. Each of them
polls the YGGDRASILL portal independently, and
receives different orders to fulfil.
Because the software running on the portal re-queues
orders when a DDA does not respond after a set period
of time, these orders will automatically be picked up by
an other computer, thus providing a robust solution.
Optional User Administration: the VDC handles the
authentication of users. The data provider does not have
to create and manage user accounts. Yggdrasill supports
authorization of user groups; administration of these
groups can be done through simple web forms.
The above advantages make the virtual data centre very
useful for small science projects, even when they are
executed by large institutions (who may not be willing
to support decentral servers on their network).

DASIA2009 Yggdrasyll

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to DASIA2009 Yggdrasyll

Similar to DASIA2009 Yggdrasyll (20)

DASIA2009 Yggdrasyll