SlideShare a Scribd company logo
Under the hood of 3TU.Datacentrum,
             a repository for research data.
                             abstract




Egbert Gramsbergen
TU Delft Library /
3TU.Datacentrum
e.f.gramsbergen@tudelft.nl




ELAG, 2012-05-17
3TU.Datacentrum
• 3 Dutch TU’s: Delft, Eindhoven, Twente
• Project 2008-2011, going concern 2012-
• Data archive
   –   2008-
   –   “finished” data
   –   preserve but do not forget usability
   –   metadata harvestable (OAI-PMH)
   –   metadata crawlable (OAI-ORE linked data)
   –   data citable (by DataCite DOI’s)
• Data labs
   – Just starting
   – Unfinished data + software/scripts
Technology

• Fedora
     Repository software


• THREDDS / OPeNDAP
     Repository software




                                      ?
http://commons.wikimedia.org/wiki/File:Engine_of_Trabant_601_S_of_Trabi_Safari_in_Dresden.jpg
Fedora digital objects

     XML container with “datastreams” containing /
     pointing to (meta)data

     •3 special RDF datastreams
     indexed in triple store
     -> query with REST API / SPARQL




     •Any number of content datastreams



     xml datastreams may be inline,
     other datastreams are on a location managed by Fedora
Fedora Content Model Architecture
Content Model object: links to Service Definition(s)
optionally defines datastreams + mime-types
Service Definition object: defines operations (methods) on data objects
incl parameters + validity constraints
Service Deployment object: implements the methods
Requests are handled by some service whose location is known to the Service Deployment




URL: /objects/<data object pid>/methods/<service definition pid>/<method name>[?<params>]
Fedora API & Saxon xslt2 service
API’s for viewing and manipulating objects
View API (REST, GET method)
     –   findObjects
     –   getDissemination
     –   getObjectHistory
     –   listDatastreams
     –   risearch (query triple store (ITQL, SPARQL))
     –   …

So everything has a url and returns xml
All methods so far have to return xml or (x)html
xslt is a natural fit
(remember: you can easily open secondary documents aka use the REST API)
xslt2.0 is much more powerful than xslt1.0
With Saxon, you can use Java classes/methods from within xslt
(rarely needed, in 3TU.DC only for spherical trigonometry in geographical calculations)
3TU.DC architecture




               Saxon for:
               •html pages
               •rdf for linked data (OAI-ORE)
               •KML for maps
               •Faceted search forms
               •csv, cdl, Excel for datasets
               •xml for indexing by SOLR
               •xml for Datacite
               •xml for PROAI
               •… and more

               Not in picture:
               •PROAI (OAI-PMH service
               provider)
               •DOI registration (Datacite)
3TU.DC architecture [2]
Content Model Architecture and xslt’s in detail
•10 content models
•7 service definition objects with 19 methods
•14 service deployment objects using 32 xslt’s




 Left to right: content models, service deployments, methods aka xslt’s, service definitions
 Lines: CMA, xslt imports, xml includes . All xslt’s are datastreams of one special xslt object.
rdf relations in 3TU.DC




Example relations (namespaces are omitted for brevity)
UI as rdf / linked data viewer


    This dataset         has some
                         metadata
                                      and is part of
                                       this dataset




                   with these
                   metadata
                                    It was calculated
                                    from this dataset

                                                        with these
                                                        metadata


                                             measured by
                                                  this
                                              instrument


                                              with these
                                              metadata
UI as rdf / linked data viewer [2]

Dilemmas - how far will you go?

•Which relations must be expanded?
•How many levels deep?
•Which inverse relations will you show?
•Show repetitions?



Answer: trial and error

Set of rules for each type of relation

Show enough for context but not too much… it’s a delicate balance
Reminder

           What about this
               part?
NetCDF

NetCDF: data format + data model

•Developed by UCAR (University Corporation for Atmospheric Research, USA),
roots at NASA, 1987.
•Comes with set of software tools / interfaces for programming
languages.
•Binary format, but data can be dumped in asci or xml
•Used mainly in geosciences (e.g. climate forecast models)
•BUT: fit for almost any type of numeric data + metadata
•Core data type: multidimensional array


>90% of 3TU.DC data is in NetCDF
NetCDF [2]
Example: T(x,y,z,t) - what can we say in NetCDF?

Variable T (4D array)
Variables x,y,z,t (1D arrays)
Dimensions x,y,z,t
Attributes: creator=‘me’
Attributes: x.units=‘m’, y.units=‘m’, z.units=‘m’, t.units=‘s’, T.units=‘deg_C’
        T.name=‘Temperature’, T.error=0.1, etc…
You may invent your own attributes or use conventions (e.g. CF4)


newer NetCDF versions:
•More complex / irregular / nested structures
•built-in compression by variable
boost compression with “leastSignificantDigit=n”
OPeNDAP

OPeNDAP: protocol to talk to NetCDF (and similar) data over internet
THREDDS: server that speaks OPeNDAP

•Internal metadata directly visible on site
•APIs for all main programming languages
•Queries to obtain:
     – cross-sections (slices, blocks)
     – samples (take only 1 in n points)
     – aggregated datasets (e.g. glue together consecutive time series)

       Queries are handled server-side
       (Datafiles in 3TU.DC are up to 100GB)
OPeNDAP python example
import urllib
import numpy as np
import netCDF4
import pydap
import matplotlib
import matplotlib.pyplot as plt
import pylab
from pydap.client import open_url
year = '2008'
month = '08'
myurl = 'http://opendap.tudelft.nl/thredds/dodsC/data2/darelux/maisbich/Tcalibrated/‘
  +year+'/'+month+'/Tcalibrated'+year+'_'+month+'.nc'
dataset = open_url(myurl) # make connection
print dataset.keys()       # inspect dataset
T = dataset['temperature'] # choose a variable
print T.shape              # inspect the dimensions of this variable
T_red = T[:2000,:150]      # take only a part
T_temp = T_red.array
T_time = T_red.time
T_dist = T_red.distance
mesh = plt.pcolormesh(T_dist[:],T_time[:],T_temp[:]) # let’s make a nice plot
mesh.axes.set_title('water temperature Maisbich [deg C]')
mesh.axes.set_xlabel('distance [m]')
mesh.axes.set_ylabel('time [days since '+year+'-'+month+'-01T00:00:00]')
mesh.figure.colorbar(mesh)
mesh.figure.savefig('maisbich-'+year+'-'+month+'.png')
mesh.figure.clf()
OPeNDAP catalogs

Datasets are organized in catalogs (catalog.xml)
•Usually (not necessarily) maps to folder
•Contains location, size, date, available services of datasets

Catalogs are our hook to Fedora
catalog.xml  Fedora object
OPeNDAP – Fedora integration
Typical bulk ingest

For predictable data structures (e.g. a 2TB disk with data delivered every 3
month structured in a well-agreed manner):
Bulk ingest from datalab [future?]

Less predictable data structures (e.g. datalab which lifts barrier after
embargo period):
THANK YOU
   QQ?


 data.3tu.nl

More Related Content

What's hot

Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
 
Data Structure Lec #1
Data Structure Lec #1Data Structure Lec #1
Data Structure Lec #1
University of Gujrat, Pakistan
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
Ajay Ohri
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
r-kor
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
Sri Ambati
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Ian Foster
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
shaziabibi5
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
Maria Stylianou
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
Prakhar Dhama
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 

What's hot (20)

Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Data Structure Lec #1
Data Structure Lec #1Data Structure Lec #1
Data Structure Lec #1
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 

Similar to Elag 2012 - Under the hood of 3TU.Datacentrum.

User biglm
User biglmUser biglm
User biglm
johnatan pladott
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
Malla Reddy University
 
Roberto Trasarti PhD Thesis
Roberto Trasarti PhD ThesisRoberto Trasarti PhD Thesis
Roberto Trasarti PhD ThesisRoberto Trasarti
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
Matthew Gerring
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
Giorgos Santipantakis
 
ifip2008albashiri.pdf
ifip2008albashiri.pdfifip2008albashiri.pdf
ifip2008albashiri.pdf
KamalAlbashiri
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things
PayamBarnaghi
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
DheerajPachauri
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
MumitAhmed1
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
SharabiNaif
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Anonymous9etQKwW
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
Anirudh
 
Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014
aceas13tern
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
Stéphane Fréchette
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
Tomas Sirny
 
1. Data structures introduction
1. Data structures introduction1. Data structures introduction
1. Data structures introduction
Mandeep Singh
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax
 

Similar to Elag 2012 - Under the hood of 3TU.Datacentrum. (20)

User biglm
User biglmUser biglm
User biglm
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Roberto Trasarti PhD Thesis
Roberto Trasarti PhD ThesisRoberto Trasarti PhD Thesis
Roberto Trasarti PhD Thesis
 
Ado
AdoAdo
Ado
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
ifip2008albashiri.pdf
ifip2008albashiri.pdfifip2008albashiri.pdf
ifip2008albashiri.pdf
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
1. Data structures introduction
1. Data structures introduction1. Data structures introduction
1. Data structures introduction
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 

Elag 2012 - Under the hood of 3TU.Datacentrum.

  • 1. Under the hood of 3TU.Datacentrum, a repository for research data. abstract Egbert Gramsbergen TU Delft Library / 3TU.Datacentrum e.f.gramsbergen@tudelft.nl ELAG, 2012-05-17
  • 2. 3TU.Datacentrum • 3 Dutch TU’s: Delft, Eindhoven, Twente • Project 2008-2011, going concern 2012- • Data archive – 2008- – “finished” data – preserve but do not forget usability – metadata harvestable (OAI-PMH) – metadata crawlable (OAI-ORE linked data) – data citable (by DataCite DOI’s) • Data labs – Just starting – Unfinished data + software/scripts
  • 3. Technology • Fedora Repository software • THREDDS / OPeNDAP Repository software ? http://commons.wikimedia.org/wiki/File:Engine_of_Trabant_601_S_of_Trabi_Safari_in_Dresden.jpg
  • 4. Fedora digital objects XML container with “datastreams” containing / pointing to (meta)data •3 special RDF datastreams indexed in triple store -> query with REST API / SPARQL •Any number of content datastreams xml datastreams may be inline, other datastreams are on a location managed by Fedora
  • 5. Fedora Content Model Architecture Content Model object: links to Service Definition(s) optionally defines datastreams + mime-types Service Definition object: defines operations (methods) on data objects incl parameters + validity constraints Service Deployment object: implements the methods Requests are handled by some service whose location is known to the Service Deployment URL: /objects/<data object pid>/methods/<service definition pid>/<method name>[?<params>]
  • 6. Fedora API & Saxon xslt2 service API’s for viewing and manipulating objects View API (REST, GET method) – findObjects – getDissemination – getObjectHistory – listDatastreams – risearch (query triple store (ITQL, SPARQL)) – … So everything has a url and returns xml All methods so far have to return xml or (x)html xslt is a natural fit (remember: you can easily open secondary documents aka use the REST API) xslt2.0 is much more powerful than xslt1.0 With Saxon, you can use Java classes/methods from within xslt (rarely needed, in 3TU.DC only for spherical trigonometry in geographical calculations)
  • 7. 3TU.DC architecture Saxon for: •html pages •rdf for linked data (OAI-ORE) •KML for maps •Faceted search forms •csv, cdl, Excel for datasets •xml for indexing by SOLR •xml for Datacite •xml for PROAI •… and more Not in picture: •PROAI (OAI-PMH service provider) •DOI registration (Datacite)
  • 8. 3TU.DC architecture [2] Content Model Architecture and xslt’s in detail •10 content models •7 service definition objects with 19 methods •14 service deployment objects using 32 xslt’s Left to right: content models, service deployments, methods aka xslt’s, service definitions Lines: CMA, xslt imports, xml includes . All xslt’s are datastreams of one special xslt object.
  • 9. rdf relations in 3TU.DC Example relations (namespaces are omitted for brevity)
  • 10. UI as rdf / linked data viewer This dataset has some metadata and is part of this dataset with these metadata It was calculated from this dataset with these metadata measured by this instrument with these metadata
  • 11. UI as rdf / linked data viewer [2] Dilemmas - how far will you go? •Which relations must be expanded? •How many levels deep? •Which inverse relations will you show? •Show repetitions? Answer: trial and error Set of rules for each type of relation Show enough for context but not too much… it’s a delicate balance
  • 12. Reminder What about this part?
  • 13. NetCDF NetCDF: data format + data model •Developed by UCAR (University Corporation for Atmospheric Research, USA), roots at NASA, 1987. •Comes with set of software tools / interfaces for programming languages. •Binary format, but data can be dumped in asci or xml •Used mainly in geosciences (e.g. climate forecast models) •BUT: fit for almost any type of numeric data + metadata •Core data type: multidimensional array >90% of 3TU.DC data is in NetCDF
  • 14. NetCDF [2] Example: T(x,y,z,t) - what can we say in NetCDF? Variable T (4D array) Variables x,y,z,t (1D arrays) Dimensions x,y,z,t Attributes: creator=‘me’ Attributes: x.units=‘m’, y.units=‘m’, z.units=‘m’, t.units=‘s’, T.units=‘deg_C’ T.name=‘Temperature’, T.error=0.1, etc… You may invent your own attributes or use conventions (e.g. CF4) newer NetCDF versions: •More complex / irregular / nested structures •built-in compression by variable boost compression with “leastSignificantDigit=n”
  • 15. OPeNDAP OPeNDAP: protocol to talk to NetCDF (and similar) data over internet THREDDS: server that speaks OPeNDAP •Internal metadata directly visible on site •APIs for all main programming languages •Queries to obtain: – cross-sections (slices, blocks) – samples (take only 1 in n points) – aggregated datasets (e.g. glue together consecutive time series) Queries are handled server-side (Datafiles in 3TU.DC are up to 100GB)
  • 16. OPeNDAP python example import urllib import numpy as np import netCDF4 import pydap import matplotlib import matplotlib.pyplot as plt import pylab from pydap.client import open_url year = '2008' month = '08' myurl = 'http://opendap.tudelft.nl/thredds/dodsC/data2/darelux/maisbich/Tcalibrated/‘ +year+'/'+month+'/Tcalibrated'+year+'_'+month+'.nc' dataset = open_url(myurl) # make connection print dataset.keys() # inspect dataset T = dataset['temperature'] # choose a variable print T.shape # inspect the dimensions of this variable T_red = T[:2000,:150] # take only a part T_temp = T_red.array T_time = T_red.time T_dist = T_red.distance mesh = plt.pcolormesh(T_dist[:],T_time[:],T_temp[:]) # let’s make a nice plot mesh.axes.set_title('water temperature Maisbich [deg C]') mesh.axes.set_xlabel('distance [m]') mesh.axes.set_ylabel('time [days since '+year+'-'+month+'-01T00:00:00]') mesh.figure.colorbar(mesh) mesh.figure.savefig('maisbich-'+year+'-'+month+'.png') mesh.figure.clf()
  • 17. OPeNDAP catalogs Datasets are organized in catalogs (catalog.xml) •Usually (not necessarily) maps to folder •Contains location, size, date, available services of datasets Catalogs are our hook to Fedora catalog.xml  Fedora object
  • 18. OPeNDAP – Fedora integration
  • 19. Typical bulk ingest For predictable data structures (e.g. a 2TB disk with data delivered every 3 month structured in a well-agreed manner):
  • 20. Bulk ingest from datalab [future?] Less predictable data structures (e.g. datalab which lifts barrier after embargo period):
  • 21. THANK YOU QQ? data.3tu.nl