Elag 2012 - Under the hood of 3TU.Datacentrum.

Under the hood of 3TU.Datacentrum,
a repository for research data.
abstract

Egbert Gramsbergen
TU Delft Library /
3TU.Datacentrum
e.f.gramsbergen@tudelft.nl

ELAG, 2012-05-17

3TU.Datacentrum
• 3 Dutch TU’s: Delft, Eindhoven, Twente
• Project 2008-2011, going concern 2012-
• Data archive
– 2008-
– “finished” data
– preserve but do not forget usability
– metadata harvestable (OAI-PMH)
– metadata crawlable (OAI-ORE linked data)
– data citable (by DataCite DOI’s)
• Data labs
– Just starting
– Unfinished data + software/scripts

Technology

• Fedora
Repository software

• THREDDS / OPeNDAP
Repository software

?
http://commons.wikimedia.org/wiki/File:Engine_of_Trabant_601_S_of_Trabi_Safari_in_Dresden.jpg

Fedora digital objects

XML container with “datastreams” containing /
pointing to (meta)data

•3 special RDF datastreams
indexed in triple store
-> query with REST API / SPARQL

•Any number of content datastreams

xml datastreams may be inline,
other datastreams are on a location managed by Fedora

Fedora Content Model Architecture
Content Model object: links to Service Definition(s)
optionally defines datastreams + mime-types
Service Definition object: defines operations (methods) on data objects
incl parameters + validity constraints
Service Deployment object: implements the methods
Requests are handled by some service whose location is known to the Service Deployment

URL: /objects/<data object pid>/methods/<service definition pid>/<method name>[?<params>]

Fedora API & Saxon xslt2 service
API’s for viewing and manipulating objects
View API (REST, GET method)
– findObjects
– getDissemination
– getObjectHistory
– listDatastreams
– risearch (query triple store (ITQL, SPARQL))
– …

So everything has a url and returns xml
All methods so far have to return xml or (x)html
xslt is a natural fit
(remember: you can easily open secondary documents aka use the REST API)
xslt2.0 is much more powerful than xslt1.0
With Saxon, you can use Java classes/methods from within xslt
(rarely needed, in 3TU.DC only for spherical trigonometry in geographical calculations)

3TU.DC architecture

Saxon for:
•html pages
•rdf for linked data (OAI-ORE)
•KML for maps
•Faceted search forms
•csv, cdl, Excel for datasets
•xml for indexing by SOLR
•xml for Datacite
•xml for PROAI
•… and more

Not in picture:
•PROAI (OAI-PMH service
provider)
•DOI registration (Datacite)

3TU.DC architecture [2]
Content Model Architecture and xslt’s in detail
•10 content models
•7 service definition objects with 19 methods
•14 service deployment objects using 32 xslt’s

Left to right: content models, service deployments, methods aka xslt’s, service definitions
Lines: CMA, xslt imports, xml includes . All xslt’s are datastreams of one special xslt object.

rdf relations in 3TU.DC

Example relations (namespaces are omitted for brevity)

UI as rdf / linked data viewer

This dataset has some
metadata
and is part of
this dataset

with these
metadata
It was calculated
from this dataset

with these
metadata

measured by
this
instrument

with these
metadata

UI as rdf / linked data viewer [2]

Dilemmas - how far will you go?

•Which relations must be expanded?
•How many levels deep?
•Which inverse relations will you show?
•Show repetitions?

Answer: trial and error

Set of rules for each type of relation

Show enough for context but not too much… it’s a delicate balance

Reminder

What about this
part?

NetCDF

NetCDF: data format + data model

•Developed by UCAR (University Corporation for Atmospheric Research, USA),
roots at NASA, 1987.
•Comes with set of software tools / interfaces for programming
languages.
•Binary format, but data can be dumped in asci or xml
•Used mainly in geosciences (e.g. climate forecast models)
•BUT: fit for almost any type of numeric data + metadata
•Core data type: multidimensional array

>90% of 3TU.DC data is in NetCDF

NetCDF [2]
Example: T(x,y,z,t) - what can we say in NetCDF?

Variable T (4D array)
Variables x,y,z,t (1D arrays)
Dimensions x,y,z,t
Attributes: creator=‘me’
Attributes: x.units=‘m’, y.units=‘m’, z.units=‘m’, t.units=‘s’, T.units=‘deg_C’
T.name=‘Temperature’, T.error=0.1, etc…
You may invent your own attributes or use conventions (e.g. CF4)

newer NetCDF versions:
•More complex / irregular / nested structures
•built-in compression by variable
boost compression with “leastSignificantDigit=n”

OPeNDAP

OPeNDAP: protocol to talk to NetCDF (and similar) data over internet
THREDDS: server that speaks OPeNDAP

•Internal metadata directly visible on site
•APIs for all main programming languages
•Queries to obtain:
– cross-sections (slices, blocks)
– samples (take only 1 in n points)
– aggregated datasets (e.g. glue together consecutive time series)

Queries are handled server-side
(Datafiles in 3TU.DC are up to 100GB)

OPeNDAP python example
import urllib
import numpy as np
import netCDF4
import pydap
import matplotlib
import matplotlib.pyplot as plt
import pylab
from pydap.client import open_url
year = '2008'
month = '08'
myurl = 'http://opendap.tudelft.nl/thredds/dodsC/data2/darelux/maisbich/Tcalibrated/‘
+year+'/'+month+'/Tcalibrated'+year+'_'+month+'.nc'
dataset = open_url(myurl) # make connection
print dataset.keys() # inspect dataset
T = dataset['temperature'] # choose a variable
print T.shape # inspect the dimensions of this variable
T_red = T[:2000,:150] # take only a part
T_temp = T_red.array
T_time = T_red.time
T_dist = T_red.distance
mesh = plt.pcolormesh(T_dist[:],T_time[:],T_temp[:]) # let’s make a nice plot
mesh.axes.set_title('water temperature Maisbich [deg C]')
mesh.axes.set_xlabel('distance [m]')
mesh.axes.set_ylabel('time [days since '+year+'-'+month+'-01T00:00:00]')
mesh.figure.colorbar(mesh)
mesh.figure.savefig('maisbich-'+year+'-'+month+'.png')
mesh.figure.clf()

OPeNDAP catalogs

Datasets are organized in catalogs (catalog.xml)
•Usually (not necessarily) maps to folder
•Contains location, size, date, available services of datasets

Catalogs are our hook to Fedora
catalog.xml  Fedora object

OPeNDAP – Fedora integration

Typical bulk ingest

For predictable data structures (e.g. a 2TB disk with data delivered every 3
month structured in a well-agreed manner):

Bulk ingest from datalab [future?]

Less predictable data structures (e.g. datalab which lifts barrier after
embargo period):

THANK YOU
QQ?

data.3tu.nl

Elag 2012 - Under the hood of 3TU.Datacentrum.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elag 2012 - Under the hood of 3TU.Datacentrum.

Similar to Elag 2012 - Under the hood of 3TU.Datacentrum. (20)

Recently uploaded

Recently uploaded (20)

Elag 2012 - Under the hood of 3TU.Datacentrum.