Incremental Export of Relational
Database Contents into RDF Graphs
Nikolaos Konstantinou, Dimitris Kouis, Nikolas Mitrou
B...
Outline
• Introduction
• Proposed Approach and Measurements
• Discussion and Conclusions
24th International Conference on ...
Introduction
• Information collection, maintenance and update is not always taking
place directly at a triplestore, but at...
Mapping Relational Data to RDF
• No one-size-fits-all approach
• Synchronous Vs Asynchronous RDF Views
• Real-time Vs Ad h...
Incremental Export into RDF (1/2)
• Problem
• Avoid dumping the whole database contents every time
• In cases when few dat...
Incremental Export into RDF (2/2)
• Incremental transformation
• Each time the transformation is executed, not all of the ...
R2RML and Triples Maps
• RDB to RDF Mapping Language
• A W3C Recommendation, as of 2012
• Triples Map: a reusable mapping
...
The R2RML Parser
• An R2RML implementation
• Command-line tool that can export relational database contents as
RDF graphs,...
Outline
• Introduction
• Proposed Approach and Measurements
• Discussion and Conclusions
94th International Conference on ...
Information Flow
• Parse the source database contents into result sets
• According to the R2RML Mapping File, the Parser g...
Incremental RDF Triple Generation
• Basic challenge
• Discover, since the last time the incremental RDF
generation took pl...
Keeping Track
• Reification
• Allows assertions about RDF statements
• “Reified” model
• A model that contains only reifie...
Proposed Approach
• For each Triples Map in the Mapping Document
• Decide whether we have to produce the resulting triples...
Measurements Setup
• An Ubuntu server, 2GHz dual-core, 4GB RAM
• Oracle Java 1.7, Postgresql 9.1, Mysql 5.5.32
• 7 DSpace ...
Query Sets
• Complicated
• 3 expensive JOIN conditions
among 4 tables
• 4 WHERE clauses
• Simplified
• 2 JOIN conditions a...
Measurements (1/3)
• Export to the Hard Disk
• Simple and complicated queries,
initial export
• Initial incremental dumps ...
Measurements (2/3)
• 12 Triples Maps
a. non-incremental mapping
transformation
b. incremental, for the initial time
c. 0/1...
Measurements (3/3)
• Similar overall behavior in cases of up
to 3 million triples
• Poor relational database performance
•...
Outline
• Introduction
• Proposed Approach and Measurements
• Discussion and Conclusions
194th International Conference on...
Discussion (1/2)
• The approach is efficient when data freshness is not crucial and/or
selection queries over the contents...
Discussion (2/2)
• On the efficiency of the approach for storing on the Hard Disk
• Good results for mappings (or queries)...
Room for Improvement
• Hashing Result sets is expensive
• Requires re-run of the query, adds an “expensive” ORDER BY claus...
Thank you for your attention!
Questions?
234th International Conference on Web Intelligence, Mining and Semantics
Upcoming SlideShare
Loading in …5
×

Incremental Export of Relational Database Contents into RDF Graphs

901 views

Published on

In addition to tools offering RDF views over databases, a variety of tools exist that allow exporting database contents into RDF graphs; tools proven that in many cases demonstrate better performance than the former. However, in cases when database contents are exported into RDF, it is not always optimal or even necessary to dump the whole database contents every time. In this paper, the problem of incremental generation and storage of the resulting RDF graph is investigated. An implementation of the R2RML standard is used in order to express mappings that associate tuples from the source database to triples in the resulting RDF graph. Next, a methodology is proposed that enables incremental generation and storage of an RDF graph based on a source relational database, and it is evaluated through a set of performance measurements. Finally, a discussion is presented regarding the authors’ most important findings and conclusions.

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
901
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Incremental Export of Relational Database Contents into RDF Graphs

  1. 1. Incremental Export of Relational Database Contents into RDF Graphs Nikolaos Konstantinou, Dimitris Kouis, Nikolas Mitrou By Dr. Nikolaos Konstantinou National Technical University of Athens School of Electrical and Computer Engineering Multimedia, Communications & Web Technologies 4th International Conference on Web Intelligence, Mining and Semantics
  2. 2. Outline • Introduction • Proposed Approach and Measurements • Discussion and Conclusions 24th International Conference on Web Intelligence, Mining and Semantics
  3. 3. Introduction • Information collection, maintenance and update is not always taking place directly at a triplestore, but at a RDBMS • Triplestores are often kept as an alternative content delivery channel • It can be difficult to change established methodologies and systems • Newer technologies need to operate side-by-side to existing ones before migration 34th International Conference on Web Intelligence, Mining and Semantics
  4. 4. Mapping Relational Data to RDF • No one-size-fits-all approach • Synchronous Vs Asynchronous RDF Views • Real-time Vs Ad hoc RDF Views • Real-time SPARQL-to-SQL Vs Querying the RDF dump using SPARQL • Queries on the RDF dump are faster in certain conditions, compared to round-trips to the database • Difference in the performance more visible when SPARQL queries involve numerous triple patterns (which translate to expensive JOIN statements) • In this paper, we focus on the asynchronous approach • Exporting (dumping) relational database contents into an RDF graph 44th International Conference on Web Intelligence, Mining and Semantics
  5. 5. Incremental Export into RDF (1/2) • Problem • Avoid dumping the whole database contents every time • In cases when few data change in the source database, it is not necessary to dump the entire database • Approach • Every time the RDF export is materialized • Detect the changes in the source database or the mapping definition • Insert/delete/update only the necessary triples, in order to reflect these changes in the resulting RDF graph 54th International Conference on Web Intelligence, Mining and Semantics
  6. 6. Incremental Export into RDF (2/2) • Incremental transformation • Each time the transformation is executed, not all of the initial information that lies in the database should be transformed into RDF, but only the one that changed • Incremental storage • Storing (persisting) to the destination RDF graph only the triples that were modified and not the whole graph • Only when the resulting RDF graph is stored in a relational database or using Jena TDB • Regardless to whether the transformation took place fully or incrementally 64th International Conference on Web Intelligence, Mining and Semantics
  7. 7. R2RML and Triples Maps • RDB to RDF Mapping Language • A W3C Recommendation, as of 2012 • Triples Map: a reusable mapping definition • Specifies a rule for translating each row of a logical table to zero or more RDF triples • A logical table is a tabular SQL query result set that is to be mapped to RDF triples • Execution of a triples map generates the triples that originate from a specific result set (logical table) 74th International Conference on Web Intelligence, Mining and Semantics
  8. 8. The R2RML Parser • An R2RML implementation • Command-line tool that can export relational database contents as RDF graphs, based on an R2RML mapping document • Open-source (CC BY-NC), written in Java • Publicly available at https://github.com/nkons/r2rml-parser • Tested against MySQL and PostgreSQL • Output can be written in RDF/OWL • N3, Turtle, N-Triple, TTL, RDF/XML notation, or Jena TDB backend • Covers most (not all) of the R2RML constructs (see the wiki) • Does not offer SPARQL-to-SQL translations 84th International Conference on Web Intelligence, Mining and Semantics
  9. 9. Outline • Introduction • Proposed Approach and Measurements • Discussion and Conclusions 94th International Conference on Web Intelligence, Mining and Semantics
  10. 10. Information Flow • Parse the source database contents into result sets • According to the R2RML Mapping File, the Parser generates a set of instructions to the Generator • The Generator instantiates in-memory the resulting RDF graph • Persist the generated RDF graph into • An RDF file in the Hard Disk, or • In Jena’s relational database, or • In Jena’s TDB (Tuple Data Base, a custom implementation Of B+ Trees) • Log the results Parser Generator Mapping fileSource database RDF graph Hard Disk Target database TDB R2RML Parser 104th International Conference on Web Intelligence, Mining and Semantics
  11. 11. Incremental RDF Triple Generation • Basic challenge • Discover, since the last time the incremental RDF generation took place • Which database tuples were modified • Which Triples Maps were modified • Then, perform the mapping only for this altered subset • Ideally, we should detect the exact changed database cells and modify only the respectively generated elements in the RDF graph • However, using R2RML, the atom of the mapping definition becomes the Triples Map a b c d e . . . . . . . . . . . . . . . . . . . . . . . . . s p O . . . . . . . . . . . . . . . . . . . . . . . . . . . . Triples Map 4th International Conference on Web Intelligence, Mining and Semantics Result set Triples 11
  12. 12. Keeping Track • Reification • Allows assertions about RDF statements • “Reified” model • A model that contains only reified statements • Stores the Triples Map URI that produced each triple • Logging • Store MD5 hashes of • Triples Maps, SELECT queries, Result sets • A change in any of the hashes triggers execution of the Triples Map subject property object subject property object blank node rdf:subject rdf:type rdf:predicate rdf:object rdf:Statement dc:source Triple Map URI <http://data.example.org/repository/person/1> foaf:name "John Smith" . [] a rdf:Statement ; rdf:subject <http://data.example.org/repository/person/1> ; rdf:predicate foaf:name ; rdf:object "John Smith" ; dc:source map:persons .
  13. 13. Proposed Approach • For each Triples Map in the Mapping Document • Decide whether we have to produce the resulting triples, based on the logged MD5 hashes • Dumping to the Hard Disk • Initially, generate a number of RDF triples • RDF triples are logged as reified statements, followed by a provenance note • Incremental generation • In subsequent executions, modify the existing reified model, by reflecting only the changes in the source database • Dumping to the Database or TDB • No log is needed, storage is incremental by default 134th International Conference on Web Intelligence, Mining and Semantics
  14. 14. Measurements Setup • An Ubuntu server, 2GHz dual-core, 4GB RAM • Oracle Java 1.7, Postgresql 9.1, Mysql 5.5.32 • 7 DSpace (dspace.org) repositories • 1k, 5k, 10k, 50k, 100k, 500k, 1m items, respectively • A set of complicated, a set of simplified, and a set of simple queries • In order to deal with database caching effects, the queries were run several times, prior to performing the measurements 144th International Conference on Web Intelligence, Mining and Semantics
  15. 15. Query Sets • Complicated • 3 expensive JOIN conditions among 4 tables • 4 WHERE clauses • Simplified • 2 JOIN conditions among 3 tables • 2 WHERE clauses • Simple • No JOIN or WHERE conditions SELECT i.item_id AS item_id, mv.text_value AS text_value FROM item AS i, metadatavalue AS mv, metadataschemaregistry AS msr, metadatafieldregistry AS mfr WHERE msr.metadata_schema_id=mfr.metadata_schema_id AND mfr.metadata_field_id=mv.metadata_field_id AND mv.text_value is not null AND i.item_id=mv.item_id AND msr.namespace='http://dublincore.org/documents/dcmi- terms/' AND mfr.element='coverage' AND mfr.qualifier='spatial' SELECT i.item_id AS item_id, mv.text_value AS text_value FROM item AS i, metadatavalue AS mv, metadatafieldregistry AS mfr WHERE mfr.metadata_field_id=mv.metadata_field_id AND i.item_id=mv.item_id AND mfr.element='coverage' AND mfr.qualifier='spatial' SELECT "language", "netid", "phone", "sub_frequency","last_active", "self_registered", "require_certificate", "can_log_in", "lastname", "firstname", "digest_algorithm", "salt", "password", "email", "eperson_id" FROM "eperson" ORDER BY "language"
  16. 16. Measurements (1/3) • Export to the Hard Disk • Simple and complicated queries, initial export • Initial incremental dumps take more time than non-incremental, as the reified model also has to be created 0 100 200 300 400 500 600 700 1000 5000 10000 source database rows non-ncremental incemental 0 5 10 15 20 25 30 35 40 3000 15000 30000 150000 300000 result model triples non-incremental incremental 164th International Conference on Web Intelligence, Mining and Semantics
  17. 17. Measurements (2/3) • 12 Triples Maps a. non-incremental mapping transformation b. incremental, for the initial time c. 0/12 d. 1/12 e. 3/12 f. 6/12 g. 9/12 h. 12/12 0 100 200 300 400 500 600 700 a b c d e f g h 1000 items 5000 items 10000 items 0 50 100 150 200 250 300 350 a b c d e f g h complicated mappings simpler mappings 174th International Conference on Web Intelligence, Mining and Semantics Data change
  18. 18. Measurements (3/3) • Similar overall behavior in cases of up to 3 million triples • Poor relational database performance • Jena TDB is the optimal approach regarding scalability 0 100 200 300 400 500 600 700 800 900 a b c d e f g h RDF file database jena TDB 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 b c d e f g h database jena TDB 184th International Conference on Web Intelligence, Mining and Semantics ~180k triples ~1.8m triples
  19. 19. Outline • Introduction • Proposed Approach and Measurements • Discussion and Conclusions 194th International Conference on Web Intelligence, Mining and Semantics
  20. 20. Discussion (1/2) • The approach is efficient when data freshness is not crucial and/or selection queries over the contents are more frequent than the updates • The task of exposing database contents as RDF could be considered similar to the task of maintaining search indexes next to text content • Third party software systems can operate completely based on the exported graph • E.g. using Fuseki, Sesame, Virtuoso • Updates can be pushed or pulled from the database • TDB is the optimal solution regarding scaling • Caution is still needed in producing de-referenceable URIs 204th International Conference on Web Intelligence, Mining and Semantics
  21. 21. Discussion (2/2) • On the efficiency of the approach for storing on the Hard Disk • Good results for mappings (or queries) that include (or lead to) expensive SQL queries • E.g. with numerous JOIN statements • For changes that can affect as much as ¾ of the source data • Limitations • By physical memory • Scales up to several millions of triples, does not qualify as “Big Data” • Formatting of the logged reified model did affect performance • RDF/XML and TTL try to pretty-print the result, consuming extra resources, N-TRIPLES is the optimal 214th International Conference on Web Intelligence, Mining and Semantics
  22. 22. Room for Improvement • Hashing Result sets is expensive • Requires re-run of the query, adds an “expensive” ORDER BY clause • Further study the impact of SQL complexity on the performance • Reification is currently being reconsidered in RDF 1.1 semantics • Named graphs being the successor • Investigation of two-way updates • Send changes from the triplestore back to the database 224th International Conference on Web Intelligence, Mining and Semantics
  23. 23. Thank you for your attention! Questions? 234th International Conference on Web Intelligence, Mining and Semantics

×