Ontology-based multi-domain
metadata for research data
management using triple stores
João Rocha da Silva
joaorosilva@gmai...
Contents
• Diverse metadata: relational modeling challenges
• Current approaches built on relational databases
• Dendro: g...
Problem: diverse metadata
Relational modeling challenges
3
Analytical Chemistry
Dataset
Mechanical
Engineering Dataset
…
Generic
Author
Description
Creation date
…
Author
Descriptio...
Common challenges in RDB
schema modeling
• Entities with unknown attributes at time of
modeling
• Time-variant attribute v...
Data management and
description platforms
Study of relational models
6
DSpace
• Academic publications management platform
• Not targeted specifically at data
• More than 1000 active installation...
DSpace
• Designed for self-deposit by common users
• Good deposit workflow (validation, licensing…)
8
U.Porto Open Repository Homepage (http://repositorio-aberto.up.pt)
Powered by DSpace
9
Powered by DSpace
A thesis record in the repository (http://repositorio-aberto.up.pt/handle/10216/58508)
10
Bitstream Metadata
Schema
Metadata
Descriptor
Item
*
1
**
metadata
value
*
1
11
DSpace
12
•Metadata profiles for objects other than Items
•Descriptor hierarchy for specialization
•Collaborative schema derivation
•...
14
CKAN
• Open-source data publishing platform
• Deposit requires minimal metadata at first
• Flexible metadata model
• Open-S...
1
2
16
1
17
!
source CKAN 18
!
source CKAN 18
Entity with variable,
time-dependent
attributes
!
source CKAN 18
Entity with variable,
time-dependent
attributes
Fixed attrs.
!
source CKAN 18
Attribute name
Entity with variable,
time-dependent
attributes
Fixed attrs.
!
source CKAN 18
Attribute name
Value
(always varchar)
Entity with variable,
time-dependent
attributes
Fixed attrs.
!
source CKAN 18
Attribute name
Timestamps
Value
(always varchar)
Entity with variable,
time-dependent
attributes
Fixed attrs.
!
source CKA...
Invenio
• Software behing Zenodo, a data publishing portal
• Static metadata model
• Very complex relational schema genera...
1
2
20
541 Tables
No FKs
!21
!22
!22
Ontologies
Semantic annotation for richer metadata
23
24
!
!
!
!
!
!
http://dendro.fe.up.pt/project/
datanotes/data/base
%20data.xls
24
!
!
!
!
http://dendro.fe.up.pt/
project/datanotes/data
nie:isLogicalPartOf
!
!
!
!
!
!
http://dendro.fe.up.pt/project/
dat...
!
!
!
!
http://dendro.fe.up.pt/
project/datanotes/data
nie:isLogicalPartOf
rdf:type
nie:File
!
!
!
!
!
!
http://dendro.fe....
!
!
!
!
http://dendro.fe.up.pt/
project/datanotes/data
nie:isLogicalPartOf
“Base data of the
DCB experiments”
dc:title
rdf...
!
!
!
!
http://dendro.fe.up.pt/
project/datanotes/data
nie:isLogicalPartOf
“Base data of the
DCB experiments”
dc:title
bas...
!
!
!
!
http://dendro.fe.up.pt/
project/datanotes/data
nie:isLogicalPartOf
“Base data of the
DCB experiments”
dc:title
bas...
Semantic MediaWiki
• Semantic extension of MediaWiki, the code behind
Wikipedia
• Semantic Links between pages
• Uses onto...
Loading an ontology
26
Describing a resource
27
Semantic Forms
From DataNotes + UPBox
http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20d...
Semantic Forms
From DataNotes + UPBox
http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20d...
Semantic Forms
From DataNotes + UPBox
http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20d...
31
!
source MediaWiki
“Old Versions” aka
“copy everything and add a timestamp” 31
!
source MediaWiki
!
source MediaWiki
now imagine we want images of different kinds,
with different attributes…
32
Redundancy…
Relational
Database
(MySQL)
Triple Store
(Apache
Jena)
Mapping Logic
33
CKAN
DSpace
Invenio
Semantic MediaWiki
Time
Flexible
attributes
Wide
use
DB-code
coupling
34
Issues review
• Entities with unknown attributes at time of modeling
• Time-variant attribute values
• Inheritance / sub-c...
Dendro
a graph-based data management platform
36
Graph databases
• Represent entities (Users, Products, Places…) as
vertexes (entity types are called classes)
• Connection...
Getting all my Projects
• Will fetch all the projects created by the user
• Will also return their attributes (“database c...
Inference
• Transitive Properties
• Subclasses
• Multiple Inheritance
•Resource can be a Folder and a Dataset
at the same ...
Loading an ontology
• Load ontology straight from the web
• No platform-specific syntax (like in SMW)
40
Nothing comes for free
• Aggregation operators slow
• No ACID properties
• Transactions are not supported in standard
SPAR...
Dendro
• Dropbox and File/Folder description platform
• Variable descriptions
• Time-dependent values
• Directory structur...
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base
Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferenced...
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base
Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferenced...
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base
Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferenced...
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base
Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferenced...
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base
Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferenced...
nie:isLogicalPartOf
Pn
Dn
280mm
“DCB Base
Data”
120
Dn-1
dcb:initialCrackLength
dc:title
dcb:specimenWidth
dc:isReferenced...
Demo
Dendroβ
44
Conclusions
• Recording rich metadata requires data model
flexibility
• Unknown attributes, time-variant information or
hie...
Conclusions (cont’d)
• Graph-based models are more flexible and easily
expansible through ontology loading
• Ontologies are...
João Rocha da Silva is an Informatics Engineering PhD student at the Faculty of Engineering of the University of
Porto. He...
Upcoming SlideShare
Loading in …5
×

Ontology-based multi-domain metadata for research data management using triple stores

851 views

Published on

A presentation given on the IDEAS 2014 Conference about database modelling using triple stores for research data management.

IDEAS '14, July 07 - 09 2014, Porto, Portugal.

Paper Abstract:

Most current research data management solutions rely on a fixed set of descriptors (e.g. Dublin Core Terms) for the description of the resources that they manage. While these are easy to understand and use, their semantics are limited to general concepts, leaving out domain-specific metadata and representing values as sets of text values. While this enables retrieval through free-text search, faceted search and dataset interlinking becomes limited. From the point of view of the relational database schema modeler, designing a more flexible metadata model represents a non-trivial challenge because of the open nature of the model. This work demonstrates the current approaches followed by current open-source platforms and propose a graph-based model for achieving modular, ontology-based metadata for interlinked data assets in the Semantic Web. This proposed model was implemented in a collaborative research data management platform currently under development at the University of Porto.

Published in: Software, Education, Technology
  • Be the first to comment

  • Be the first to like this

Ontology-based multi-domain metadata for research data management using triple stores

  1. 1. Ontology-based multi-domain metadata for research data management using triple stores João Rocha da Silva joaorosilva@gmail.com Faculdade de Engenharia da Universidade do Porto / INESC TEC Cristina Ribeiro mcr@fe.up.pt DEI—Faculdade de Engenharia da Universidade do Porto / INESC TEC João Correia Lopes jlopes@fe.up.pt IDEAS '14, July 07 - 09 2014, Porto, Portugal
  2. 2. Contents • Diverse metadata: relational modeling challenges • Current approaches built on relational databases • Dendro: graph-based research data management • Live demo • Conclusions 2
  3. 3. Problem: diverse metadata Relational modeling challenges 3
  4. 4. Analytical Chemistry Dataset Mechanical Engineering Dataset … Generic Author Description Creation date … Author Description Creation date … … Domain Specific Sample Count Analysed Substance … Initial Crack Length Specimen Type … 4
  5. 5. Common challenges in RDB schema modeling • Entities with unknown attributes at time of modeling • Time-variant attribute values • Inheritance / sub-class mapping • Resource hierarchies (parents of parents…) • Schemas rely on external documentation 5
  6. 6. Data management and description platforms Study of relational models 6
  7. 7. DSpace • Academic publications management platform • Not targeted specifically at data • More than 1000 active installations • Mature open-source codebase 7
  8. 8. DSpace • Designed for self-deposit by common users • Good deposit workflow (validation, licensing…) 8
  9. 9. U.Porto Open Repository Homepage (http://repositorio-aberto.up.pt) Powered by DSpace 9
  10. 10. Powered by DSpace A thesis record in the repository (http://repositorio-aberto.up.pt/handle/10216/58508) 10
  11. 11. Bitstream Metadata Schema Metadata Descriptor Item * 1 ** metadata value * 1 11
  12. 12. DSpace 12
  13. 13. •Metadata profiles for objects other than Items •Descriptor hierarchy for specialization •Collaborative schema derivation •Validation of metadata completeness against different schemas •Restricting possible metadata for each type of resource New requirements 13
  14. 14. 14
  15. 15. CKAN • Open-source data publishing platform • Deposit requires minimal metadata at first • Flexible metadata model • Open-Source 15
  16. 16. 1 2 16
  17. 17. 1 17
  18. 18. ! source CKAN 18
  19. 19. ! source CKAN 18
  20. 20. Entity with variable, time-dependent attributes ! source CKAN 18
  21. 21. Entity with variable, time-dependent attributes Fixed attrs. ! source CKAN 18
  22. 22. Attribute name Entity with variable, time-dependent attributes Fixed attrs. ! source CKAN 18
  23. 23. Attribute name Value (always varchar) Entity with variable, time-dependent attributes Fixed attrs. ! source CKAN 18
  24. 24. Attribute name Timestamps Value (always varchar) Entity with variable, time-dependent attributes Fixed attrs. ! source CKAN 18
  25. 25. Invenio • Software behing Zenodo, a data publishing portal • Static metadata model • Very complex relational schema generated by business logic code • Tight coupling between DB and code • Open-Source 19
  26. 26. 1 2 20
  27. 27. 541 Tables No FKs !21
  28. 28. !22
  29. 29. !22
  30. 30. Ontologies Semantic annotation for richer metadata 23
  31. 31. 24
  32. 32. ! ! ! ! ! ! http://dendro.fe.up.pt/project/ datanotes/data/base %20data.xls 24
  33. 33. ! ! ! ! http://dendro.fe.up.pt/ project/datanotes/data nie:isLogicalPartOf ! ! ! ! ! ! http://dendro.fe.up.pt/project/ datanotes/data/base %20data.xls 24
  34. 34. ! ! ! ! http://dendro.fe.up.pt/ project/datanotes/data nie:isLogicalPartOf rdf:type nie:File ! ! ! ! ! ! http://dendro.fe.up.pt/project/ datanotes/data/base %20data.xls 24
  35. 35. ! ! ! ! http://dendro.fe.up.pt/ project/datanotes/data nie:isLogicalPartOf “Base data of the DCB experiments” dc:title rdf:type nie:File ! ! ! ! ! ! http://dendro.fe.up.pt/project/ datanotes/data/base %20data.xls 24
  36. 36. ! ! ! ! http://dendro.fe.up.pt/ project/datanotes/data nie:isLogicalPartOf “Base data of the DCB experiments” dc:title base data.xls nie:title rdf:type nie:File ! ! ! ! ! ! http://dendro.fe.up.pt/project/ datanotes/data/base %20data.xls 24
  37. 37. ! ! ! ! http://dendro.fe.up.pt/ project/datanotes/data nie:isLogicalPartOf “Base data of the DCB experiments” dc:title base data.xls nie:title rdf:type nie:File base data.xls dcb:initialCrackLength ! ! ! ! ! ! http://dendro.fe.up.pt/project/ datanotes/data/base %20data.xls 24
  38. 38. Semantic MediaWiki • Semantic extension of MediaWiki, the code behind Wikipedia • Semantic Links between pages • Uses ontologies • Strong emphasis on page versioning • DB schema built around the time dimension 25
  39. 39. Loading an ontology 26
  40. 40. Describing a resource 27
  41. 41. Semantic Forms From DataNotes + UPBox http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20data%20management%20environment%20for%20the%20long%20tail%20of%20research%20data.pdf 28
  42. 42. Semantic Forms From DataNotes + UPBox http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20data%20management%20environment%20for%20the%20long%20tail%20of%20research%20data.pdf 29
  43. 43. Semantic Forms From DataNotes + UPBox http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20data%20management%20environment%20for%20the%20long%20tail%20of%20research%20data.pdf 30
  44. 44. 31 ! source MediaWiki
  45. 45. “Old Versions” aka “copy everything and add a timestamp” 31 ! source MediaWiki
  46. 46. ! source MediaWiki now imagine we want images of different kinds, with different attributes… 32
  47. 47. Redundancy… Relational Database (MySQL) Triple Store (Apache Jena) Mapping Logic 33
  48. 48. CKAN DSpace Invenio Semantic MediaWiki Time Flexible attributes Wide use DB-code coupling 34
  49. 49. Issues review • Entities with unknown attributes at time of modeling • Time-variant attribute values • Inheritance / sub-classing • Hierarchies (parents of parents of parents…) • Need for external documentation 35
  50. 50. Dendro a graph-based data management platform 36
  51. 51. Graph databases • Represent entities (Users, Products, Places…) as vertexes (entity types are called classes) • Connections between them are directed graph edges (edge types are called properties) ! • The meaning of these connections is expressed in ontologies that can be shared and reused 37
  52. 52. Getting all my Projects • Will fetch all the projects created by the user • Will also return their attributes (“database columns”) • Different projects may have different attributes 38
  53. 53. Inference • Transitive Properties • Subclasses • Multiple Inheritance •Resource can be a Folder and a Dataset at the same time) 39
  54. 54. Loading an ontology • Load ontology straight from the web • No platform-specific syntax (like in SMW) 40
  55. 55. Nothing comes for free • Aggregation operators slow • No ACID properties • Transactions are not supported in standard SPARQL • (“SPARQL 1.1 Query/Update Services should be atomic but that they are not required to be atomic.”) • Graph DBMS Solutions are in early stages (many bugs, many “beta”s, many mailing lists…) 41
  56. 56. Dendro • Dropbox and File/Folder description platform • Variable descriptions • Time-dependent values • Directory structures (hierarchy) • Need for simple querying… 42
  57. 57. nie:isLogicalPartOf Pn Dn 280mm “DCB Base Data” 120 Dn-1 dcb:initialCrackLength dc:title dcb:specimenWidth dc:isReferencedBy Fn 120 dc:title dcb:specimenWidth dc:isVersionOf Added property instance 01/01/2014 ^^xsd:date dc:created 01/01/2014 ^^xsd:date dc:modified Changed modification timestamp Revision creation timestamp Un dc:creator Current dataset version Past Revisions ddr:pertainsTo Change recording C ddr:initial CrackLen gth ddr:changedDescriptor “add” ddr:operation “DCB Base Data” 43
  58. 58. nie:isLogicalPartOf Pn Dn 280mm “DCB Base Data” 120 Dn-1 dcb:initialCrackLength dc:title dcb:specimenWidth dc:isReferencedBy Fn 120 dc:title dcb:specimenWidth dc:isVersionOf Added property instance 01/01/2014 ^^xsd:date dc:created 01/01/2014 ^^xsd:date dc:modified Changed modification timestamp Revision creation timestamp Un dc:creator Current dataset version Past Revisions ddr:pertainsTo Change recording C ddr:initial CrackLen gth ddr:changedDescriptor “add” ddr:operation “DCB Base Data” 43
  59. 59. nie:isLogicalPartOf Pn Dn 280mm “DCB Base Data” 120 Dn-1 dcb:initialCrackLength dc:title dcb:specimenWidth dc:isReferencedBy Fn 120 dc:title dcb:specimenWidth dc:isVersionOf Added property instance 01/01/2014 ^^xsd:date dc:created 01/01/2014 ^^xsd:date dc:modified Changed modification timestamp Revision creation timestamp Un dc:creator Current dataset version Past Revisions ddr:pertainsTo Change recording C ddr:initial CrackLen gth ddr:changedDescriptor “add” ddr:operation “DCB Base Data” 43
  60. 60. nie:isLogicalPartOf Pn Dn 280mm “DCB Base Data” 120 Dn-1 dcb:initialCrackLength dc:title dcb:specimenWidth dc:isReferencedBy Fn 120 dc:title dcb:specimenWidth dc:isVersionOf Added property instance 01/01/2014 ^^xsd:date dc:created 01/01/2014 ^^xsd:date dc:modified Changed modification timestamp Revision creation timestamp Un dc:creator Current dataset version Past Revisions ddr:pertainsTo Change recording C ddr:initial CrackLen gth ddr:changedDescriptor “add” ddr:operation “DCB Base Data” 43
  61. 61. nie:isLogicalPartOf Pn Dn 280mm “DCB Base Data” 120 Dn-1 dcb:initialCrackLength dc:title dcb:specimenWidth dc:isReferencedBy Fn 120 dc:title dcb:specimenWidth dc:isVersionOf Added property instance 01/01/2014 ^^xsd:date dc:created 01/01/2014 ^^xsd:date dc:modified Changed modification timestamp Revision creation timestamp Un dc:creator Current dataset version Past Revisions ddr:pertainsTo Change recording C ddr:initial CrackLen gth ddr:changedDescriptor “add” ddr:operation “DCB Base Data” 43
  62. 62. nie:isLogicalPartOf Pn Dn 280mm “DCB Base Data” 120 Dn-1 dcb:initialCrackLength dc:title dcb:specimenWidth dc:isReferencedBy Fn 120 dc:title dcb:specimenWidth dc:isVersionOf Added property instance 01/01/2014 ^^xsd:date dc:created 01/01/2014 ^^xsd:date dc:modified Changed modification timestamp Revision creation timestamp Un dc:creator Current dataset version Past Revisions ddr:pertainsTo Change recording C ddr:initial CrackLen gth ddr:changedDescriptor “add” ddr:operation “DCB Base Data” 43
  63. 63. Demo Dendroβ 44
  64. 64. Conclusions • Recording rich metadata requires data model flexibility • Unknown attributes, time-variant information or hierarchies can be hard to model in a relational database • Several current solutions make compromises due to their relational database layer 45
  65. 65. Conclusions (cont’d) • Graph-based models are more flexible and easily expansible through ontology loading • Ontologies are shareable on the web, and document the database “schema” • Queries become simpler due to the graph model’s ability to easily model challenging scenarios for RDBs • Dendro is a collaborative data management platform fully built on a graph model 46
  66. 66. João Rocha da Silva is an Informatics Engineering PhD student at the Faculty of Engineering of the University of Porto. He specializes on research data management, applying the latest Semantic Web Technologies to the adequate preservation and discovery of research data assets.! ! He is also an experienced freelancer iOS Developer with several Apps published on the App Store, and a self- taught DIY mechanic with a special interest in classic cars, particularly his 1987 Toyota Corolla GT Twin Cam, also known as Hachi-Roku or AE86.! Research Data Management and Semantic Web Researcher, Web & iPhone Developer João Rocha da Silva! João Correia Lopes is an Assistant Professor in Informatics Engineering at Universidade do Porto and a researcher at INESC TEC. He has graduated in Electrical Engineering in the University of Porto in 1984 and holds a PhD in Computing Science by Glasgow University in1997. His teaching includes undergraduate and graduate courses in databases and web applications, software engineering and object-oriented programming, markup languages and semantic web. He has been involved in research projects in the area of long-term preservation, service-oriented architectures and e-Science. Currently his main research interests are e-Science and the management of research data. Cristina Ribeiro is an Assistant Professor in Informatics Engineering at Universidade do Porto and a researcher at INESC TEC. She has graduated in Electrical Engineering, holds a Master in Electrical and Computer Engineering and a Ph.D. in Informatics. Her teaching includes undergraduate and graduate courses in information retrieval, digital libraries, knowledge representation and markup languages. She has been involved in research projects in the areas of cultural heritage, multimedia databases and information retrieval. Currently her main research interests are information retrieval, digital preservation and the management of research data. Assistant Professor in Informatics Engineering at Universidade do Porto, Researcher at INESC TECCristina Ribeiro! Assistant Professor in Informatics Engineering at Universidade do Porto, Researcher at INESC TEC João Correia Lopes!

×