RDF-Gen: Generating RDF from streaming and archival data
1. RDF-Gen: Generating RDF from
Streaming and Archival Data
Georgios M. Santipantakis, Konstantinos I. Kotis, George A. Vouros, Christos Doulkeridis
Department of Digital Systems
University of Piraeus, Greece
3. Problem Definition
Given a set of data sources, archival or streaming, in various formats, we want a
framework capable to generate ontology-annotated RDF graphs, with high throughput
and low latency.
We mainly focus on the following Objectives:
O1 Inherently support the RDF generation of both streaming and archival datasets.
O2 Provide facilities for close-to-source data processing tasks, e.g. for data cleansing,
data manipulation and conversion, and generation of URIs.
O3 Support close-to-source link discovery functionality.
O4 Demonstrate computational efficiency in terms of high throughput and low data-
generation latency.
O5 Demonstrate the scalability which is necessary for the transformation of big data.
O6 Demonstrate extensibility, in the sense that (i) it can integrate custom data
processing and manipulation functions, and (ii) it can be instantiated to new data
formats.
O7 Support reusability of solutions across data sources of the same domain.
4. Related work
1. Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik Mannens, and Rik Van de Walle. 2014. RML: A Generic Language for Integrated
RDF Mappings of Heterogeneous Data. In Proceedings of the 7th Workshop on Linked Data on the Web (CEUR Workshop Proceedings)
2. Maxime Lefrançois, Antoine Zimmermann, and Noorani Bakerally. 2017. A SPARQL Extension for Generating RDF from Heterogeneous Formats. In
Proceedings of ESWC 2017.
3. Jason Slepicka, Chengye Yin, Pedro Szekely, and Craig Knoblock. 2015. KR2RML: An Alternative Interpretation of R2RML for Heterogenous Sources. In
Proceedings of the 6th International Workshop on Consuming Linked Data (COLD 2015)
4. Ademar Crotti Junior, Christophe Debruyne, and Declan O’Sullivan. 2016. Incorporating Functions in Mappings to Facilitate the Uplift of CSV Files into RDF. In
The Semantic Web - ESWC 2016 Satellite Events
5. Franćois Scharffe, Ghislain Atemezing, Raphaël Troncy, Fabien Gandon, Serena Villata, Bénédicte Bucher, Fayćl Hamdi, Laurent Bihanic, Gabriel Képéklian,
Franck Cotton, Jérôme Euzenat, Zhengjie Fan, Pierre-Yves Vandenbussche, and Bernard Vatant. 2012. Enabling linked data publication with the Datalift
platform. In Proceedings of AAAI 2012, 26th Conference on Artificial Intelligence
RML [1]
SPARQL-Generate [2]
KR2RML [3]
RMLProcessor [4]
DataLift [5]
RDF-Gen
5. O1 O2 O3 O4 O5 O6 O7
RML [1]
SPARQL-Generate [2]
KR2RML [3]
RMLProcessor [4]
DataLift [5]
RDF-Gen
Related work
O1 Inherently support the RDF generation of both streaming and archival datasets.
O2 Provide facilities for close-to-source data processing tasks, e.g. for data cleansing, data manipulation and
conversion, and generation of URIs.
O3 Support close-to-source link discovery functionality.
O4 Demonstrate computational efficiency in terms of high throughput and low data-generation latency.
O5 Demonstrate the scalability which is necessary for the transformation of big data.
O6 Demonstrate extensibility, in the sense that (i) it can integrate custom data processing and manipulation
functions, and (ii) it can be instantiated to new data formats.
O7 Support reusability of solutions across data sources of the same domain.
6. • RDF-Gen consists of the following components:
• Data Connectors
• Triple Generator
• Link Discovery
RDF-Gen Architecture
7. RDF-Gen (Data Connectors)
• Configuration:
• Connector type for given data source (file, stream, database, remote SPARQL endpoint, etc),
• Data source URI (local or remote)
• Data source specific attributes, e.g. user credentials for database sources, etc
• Tasks:
• Apply data cleaning dropping outliers on the fly (w.r.t. rules given in configuration)
• Employ mappings between XML nodes (applies only on XML data sources) found in separate
XML files
• Output:
• Iterate through data source
records and output a uniform
vector of values
10. • Configuration:
• a vector of variables V,
• a RDF Graph template G conformed to the given Ontology Scheme, and
• a set of functions (available to all instances of the Triple Generator)
• Tasks:
• consumes the vector of values provided by Data Connectors,
• generates triples simply by binding variables to their corresponding values using the graph template
and
• evaluates the pre-compiled functions (if any in the Template) with bound values as arguments
• enables linking of resources from different data sources w.r.t. their values (i.e. common functions
constructing URIs, will generate the
same URIs for same values
processed by different RDF-Gen
instances on different data
sources)
• Output:
• A set of triples corresponding to
the consumed vector of values
RDF-Gen (Triple Generator)
15. RDF-Gen (Link Discovery)
• Configuration:
• URI of the streams to be consumed S (i.e. this or remote RDF-Gen instances, local or
remote archived RDF triples)
• Link(s) L to be discovered (as specified in given Ontology Scheme), under given in
configuration conditions C
• Data organizing method to be applied M
• Tasks:
• Organize resources in S according to M, and evaluate conditions C for each pair of
candidates in S, to discover links in L
• Output:
• A set of triples reporting the
linked resources
16. Evaluation
• Several data sets have been evaluated in the datAcron project.
• We present evaluation results for three different data sets, for typical or
large volumes of data varying between 100 and 1,000,000 entries:
• An artificial data set of Persons, generated by GenerateData.com, mapping 8
properties
• A real-life archival data set of aircrafts, mapping 9 properties
• Aircraft surveillance streaming data set, mapping 5 properties
• We compare RDF-Gen to state-of-the-art RML and SPARQL-Generate.
Configurations and executables used for the experiments are currently available at:
https://github.com/datAcron-project/RDF-Gen/
20. Conclusions
• This work proposes a new approach towards generating RDF knowledge
graphs from multiple heterogeneous streaming and archival data, in a
uniform, efficient and scalable way.
• Separating the Data Connector from the Triple Generator, the RDF-Gen
approach outperforms the state of the art tools RML and SPARQL-
Generate, in terms of throughput, scalability and usability.
• This is achieved by implementing data access and close-to-the-sources data
processing facilities in the Data Connectors, providing data in a record-by-record
approach to the Triple Generators, which use graph templates as a generic way to
map data to RDF.
• RDF-Gen needs no further knowledge of a specific vocabulary, and it can be
used by anyone who can write simple SPARQL queries.
• It requires no underlying SPARQL engine, and it inherently supports
distribution of processing and the exploitation of streaming data sources.
21. Outlook
• Future work includes (but not limited to) employing/extend RDF-Gen to:
• Implement a stream mashup, combining streams into a single stream of RDF triples
• Link contents of live open streams, and serve real time Linked Open Data to the
web
• Implement generic templates for commonly used Content Management Systems,
to allow client-side conversion of web content. RDF-Gen has been successfully used
as a web site crawler (non-CMS content) on https://doc8643.com to retrieve
aircraft model specifications as RDF triples and integrate them with our aircraft
data sets
• Introduce a fully automated construction of the mappings/templates, (i.e. which
will provide a set of suggestions of data-to-vocabularies mappings, variables, and
bindings to data)
22. Thank you!
For documentation, tutorials and source code please visit:
http://datacron-project.eu/
Acknowledgment
This work is supported by project datAcron, which has received funding from the European
Union’s Horizon 2020 research and innovation programme under grant agreement No 687591
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/
(c) AI-Group/UNIVERSITY OF PIRAEUS RESEARCH CENTER (UPRC)
Question, comments, suggestions to
gsant@unipi.gr