Elag 2012 - Under the hood of 3TU.Datacentrum.


Published on

The 3TU.Datacentrum repository of research data hosts datasets as well as other objects representing measuring devices, locations, time periods and the like. Virtually all metadata is in rdf so the repository can be approached as an rdf graph. We will show how this is implemented with Fedora Commons, heavily leaning on rdf queries and xslt2.0. As a result of this architecture, it is relatively easy to make the repository linked-data-enabled by generating OAI/ORE resource maps.
While most of the metadata is rdf, most of the data is in NetCDF. Although not very well known in the library world, this is very popular format in various fields of science and engineering. It comes with its own data server Opendap which offers a rich API to interact with the data. Our repository is therefore a hybrid Fedora + Opendap setup and we will show how the two are integrated into a unified view and how they are kept in sync on ingest.
This was presented at the ELAG conference, Palma de Mallorca 2012.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Elag 2012 - Under the hood of 3TU.Datacentrum.

  1. 1. Under the hood of 3TU.Datacentrum, a repository for research data. abstractEgbert GramsbergenTU Delft Library /3TU.Datacentrume.f.gramsbergen@tudelft.nlELAG, 2012-05-17
  2. 2. 3TU.Datacentrum• 3 Dutch TU’s: Delft, Eindhoven, Twente• Project 2008-2011, going concern 2012-• Data archive – 2008- – “finished” data – preserve but do not forget usability – metadata harvestable (OAI-PMH) – metadata crawlable (OAI-ORE linked data) – data citable (by DataCite DOI’s)• Data labs – Just starting – Unfinished data + software/scripts
  3. 3. Technology• Fedora Repository software• THREDDS / OPeNDAP Repository software ?http://commons.wikimedia.org/wiki/File:Engine_of_Trabant_601_S_of_Trabi_Safari_in_Dresden.jpg
  4. 4. Fedora digital objects XML container with “datastreams” containing / pointing to (meta)data •3 special RDF datastreams indexed in triple store -> query with REST API / SPARQL •Any number of content datastreams xml datastreams may be inline, other datastreams are on a location managed by Fedora
  5. 5. Fedora Content Model ArchitectureContent Model object: links to Service Definition(s)optionally defines datastreams + mime-typesService Definition object: defines operations (methods) on data objectsincl parameters + validity constraintsService Deployment object: implements the methodsRequests are handled by some service whose location is known to the Service DeploymentURL: /objects/<data object pid>/methods/<service definition pid>/<method name>[?<params>]
  6. 6. Fedora API & Saxon xslt2 serviceAPI’s for viewing and manipulating objectsView API (REST, GET method) – findObjects – getDissemination – getObjectHistory – listDatastreams – risearch (query triple store (ITQL, SPARQL)) – …So everything has a url and returns xmlAll methods so far have to return xml or (x)htmlxslt is a natural fit(remember: you can easily open secondary documents aka use the REST API)xslt2.0 is much more powerful than xslt1.0With Saxon, you can use Java classes/methods from within xslt(rarely needed, in 3TU.DC only for spherical trigonometry in geographical calculations)
  7. 7. 3TU.DC architecture Saxon for: •html pages •rdf for linked data (OAI-ORE) •KML for maps •Faceted search forms •csv, cdl, Excel for datasets •xml for indexing by SOLR •xml for Datacite •xml for PROAI •… and more Not in picture: •PROAI (OAI-PMH service provider) •DOI registration (Datacite)
  8. 8. 3TU.DC architecture [2]Content Model Architecture and xslt’s in detail•10 content models•7 service definition objects with 19 methods•14 service deployment objects using 32 xslt’s Left to right: content models, service deployments, methods aka xslt’s, service definitions Lines: CMA, xslt imports, xml includes . All xslt’s are datastreams of one special xslt object.
  9. 9. rdf relations in 3TU.DCExample relations (namespaces are omitted for brevity)
  10. 10. UI as rdf / linked data viewer This dataset has some metadata and is part of this dataset with these metadata It was calculated from this dataset with these metadata measured by this instrument with these metadata
  11. 11. UI as rdf / linked data viewer [2]Dilemmas - how far will you go?•Which relations must be expanded?•How many levels deep?•Which inverse relations will you show?•Show repetitions?Answer: trial and errorSet of rules for each type of relationShow enough for context but not too much… it’s a delicate balance
  12. 12. Reminder What about this part?
  13. 13. NetCDFNetCDF: data format + data model•Developed by UCAR (University Corporation for Atmospheric Research, USA),roots at NASA, 1987.•Comes with set of software tools / interfaces for programminglanguages.•Binary format, but data can be dumped in asci or xml•Used mainly in geosciences (e.g. climate forecast models)•BUT: fit for almost any type of numeric data + metadata•Core data type: multidimensional array>90% of 3TU.DC data is in NetCDF
  14. 14. NetCDF [2]Example: T(x,y,z,t) - what can we say in NetCDF?Variable T (4D array)Variables x,y,z,t (1D arrays)Dimensions x,y,z,tAttributes: creator=‘me’Attributes: x.units=‘m’, y.units=‘m’, z.units=‘m’, t.units=‘s’, T.units=‘deg_C’ T.name=‘Temperature’, T.error=0.1, etc…You may invent your own attributes or use conventions (e.g. CF4)newer NetCDF versions:•More complex / irregular / nested structures•built-in compression by variableboost compression with “leastSignificantDigit=n”
  15. 15. OPeNDAPOPeNDAP: protocol to talk to NetCDF (and similar) data over internetTHREDDS: server that speaks OPeNDAP•Internal metadata directly visible on site•APIs for all main programming languages•Queries to obtain: – cross-sections (slices, blocks) – samples (take only 1 in n points) – aggregated datasets (e.g. glue together consecutive time series) Queries are handled server-side (Datafiles in 3TU.DC are up to 100GB)
  16. 16. OPeNDAP python exampleimport urllibimport numpy as npimport netCDF4import pydapimport matplotlibimport matplotlib.pyplot as pltimport pylabfrom pydap.client import open_urlyear = 2008month = 08myurl = http://opendap.tudelft.nl/thredds/dodsC/data2/darelux/maisbich/Tcalibrated/‘ +year+/+month+/Tcalibrated+year+_+month+.ncdataset = open_url(myurl) # make connectionprint dataset.keys() # inspect datasetT = dataset[temperature] # choose a variableprint T.shape # inspect the dimensions of this variableT_red = T[:2000,:150] # take only a partT_temp = T_red.arrayT_time = T_red.timeT_dist = T_red.distancemesh = plt.pcolormesh(T_dist[:],T_time[:],T_temp[:]) # let’s make a nice plotmesh.axes.set_title(water temperature Maisbich [deg C])mesh.axes.set_xlabel(distance [m])mesh.axes.set_ylabel(time [days since +year+-+month+-01T00:00:00])mesh.figure.colorbar(mesh)mesh.figure.savefig(maisbich-+year+-+month+.png)mesh.figure.clf()
  17. 17. OPeNDAP catalogsDatasets are organized in catalogs (catalog.xml)•Usually (not necessarily) maps to folder•Contains location, size, date, available services of datasetsCatalogs are our hook to Fedoracatalog.xml  Fedora object
  18. 18. OPeNDAP – Fedora integration
  19. 19. Typical bulk ingestFor predictable data structures (e.g. a 2TB disk with data delivered every 3month structured in a well-agreed manner):
  20. 20. Bulk ingest from datalab [future?]Less predictable data structures (e.g. datalab which lifts barrier afterembargo period):
  21. 21. THANK YOU QQ? data.3tu.nl