SlideShare a Scribd company logo
1 of 22
RDF-Gen: Generating RDF from
Streaming and Archival Data
Georgios M. Santipantakis, Konstantinos I. Kotis, George A. Vouros, Christos Doulkeridis
Department of Digital Systems
University of Piraeus, Greece
Contents
• Problem Definition
• Related work
• RDF-Gen
• Experimental Results
• Outlook
Problem Definition
Given a set of data sources, archival or streaming, in various formats, we want a
framework capable to generate ontology-annotated RDF graphs, with high throughput
and low latency.
We mainly focus on the following Objectives:
O1 Inherently support the RDF generation of both streaming and archival datasets.
O2 Provide facilities for close-to-source data processing tasks, e.g. for data cleansing,
data manipulation and conversion, and generation of URIs.
O3 Support close-to-source link discovery functionality.
O4 Demonstrate computational efficiency in terms of high throughput and low data-
generation latency.
O5 Demonstrate the scalability which is necessary for the transformation of big data.
O6 Demonstrate extensibility, in the sense that (i) it can integrate custom data
processing and manipulation functions, and (ii) it can be instantiated to new data
formats.
O7 Support reusability of solutions across data sources of the same domain.
Related work
1. Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik Mannens, and Rik Van de Walle. 2014. RML: A Generic Language for Integrated
RDF Mappings of Heterogeneous Data. In Proceedings of the 7th Workshop on Linked Data on the Web (CEUR Workshop Proceedings)
2. Maxime Lefrançois, Antoine Zimmermann, and Noorani Bakerally. 2017. A SPARQL Extension for Generating RDF from Heterogeneous Formats. In
Proceedings of ESWC 2017.
3. Jason Slepicka, Chengye Yin, Pedro Szekely, and Craig Knoblock. 2015. KR2RML: An Alternative Interpretation of R2RML for Heterogenous Sources. In
Proceedings of the 6th International Workshop on Consuming Linked Data (COLD 2015)
4. Ademar Crotti Junior, Christophe Debruyne, and Declan O’Sullivan. 2016. Incorporating Functions in Mappings to Facilitate the Uplift of CSV Files into RDF. In
The Semantic Web - ESWC 2016 Satellite Events
5. Franćois Scharffe, Ghislain Atemezing, Raphaël Troncy, Fabien Gandon, Serena Villata, Bénédicte Bucher, Fayćl Hamdi, Laurent Bihanic, Gabriel Képéklian,
Franck Cotton, Jérôme Euzenat, Zhengjie Fan, Pierre-Yves Vandenbussche, and Bernard Vatant. 2012. Enabling linked data publication with the Datalift
platform. In Proceedings of AAAI 2012, 26th Conference on Artificial Intelligence
RML [1]
SPARQL-Generate [2]
KR2RML [3]
RMLProcessor [4]
DataLift [5]
RDF-Gen
O1 O2 O3 O4 O5 O6 O7
RML [1]     
SPARQL-Generate [2]  
KR2RML [3]     
RMLProcessor [4]   
DataLift [5]  
RDF-Gen       
Related work
O1 Inherently support the RDF generation of both streaming and archival datasets.
O2 Provide facilities for close-to-source data processing tasks, e.g. for data cleansing, data manipulation and
conversion, and generation of URIs.
O3 Support close-to-source link discovery functionality.
O4 Demonstrate computational efficiency in terms of high throughput and low data-generation latency.
O5 Demonstrate the scalability which is necessary for the transformation of big data.
O6 Demonstrate extensibility, in the sense that (i) it can integrate custom data processing and manipulation
functions, and (ii) it can be instantiated to new data formats.
O7 Support reusability of solutions across data sources of the same domain.
• RDF-Gen consists of the following components:
• Data Connectors
• Triple Generator
• Link Discovery
RDF-Gen Architecture
RDF-Gen (Data Connectors)
• Configuration:
• Connector type for given data source (file, stream, database, remote SPARQL endpoint, etc),
• Data source URI (local or remote)
• Data source specific attributes, e.g. user credentials for database sources, etc
• Tasks:
• Apply data cleaning dropping outliers on the fly (w.r.t. rules given in configuration)
• Employ mappings between XML nodes (applies only on XML data sources) found in separate
XML files
• Output:
• Iterate through data source
records and output a uniform
vector of values
RDF-Gen (Example)
RDF-Gen (Example)
• Configuration:
• a vector of variables V,
• a RDF Graph template G conformed to the given Ontology Scheme, and
• a set of functions (available to all instances of the Triple Generator)
• Tasks:
• consumes the vector of values provided by Data Connectors,
• generates triples simply by binding variables to their corresponding values using the graph template
and
• evaluates the pre-compiled functions (if any in the Template) with bound values as arguments
• enables linking of resources from different data sources w.r.t. their values (i.e. common functions
constructing URIs, will generate the
same URIs for same values
processed by different RDF-Gen
instances on different data
sources)
• Output:
• A set of triples corresponding to
the consumed vector of values
RDF-Gen (Triple Generator)
RDF-Gen (Example)
RDF-Gen (Example)
RDF-Gen (Example)
RDF-Gen (Example)
RDF-Gen (Link Discovery)
• Configuration:
• URI of the streams to be consumed S (i.e. this or remote RDF-Gen instances, local or
remote archived RDF triples)
• Link(s) L to be discovered (as specified in given Ontology Scheme), under given in
configuration conditions C
• Data organizing method to be applied M
• Tasks:
• Organize resources in S according to M, and evaluate conditions C for each pair of
candidates in S, to discover links in L
• Output:
• A set of triples reporting the
linked resources
Evaluation
• Several data sets have been evaluated in the datAcron project.
• We present evaluation results for three different data sets, for typical or
large volumes of data varying between 100 and 1,000,000 entries:
• An artificial data set of Persons, generated by GenerateData.com, mapping 8
properties
• A real-life archival data set of aircrafts, mapping 9 properties
• Aircraft surveillance streaming data set, mapping 5 properties
• We compare RDF-Gen to state-of-the-art RML and SPARQL-Generate.
Configurations and executables used for the experiments are currently available at:
https://github.com/datAcron-project/RDF-Gen/
Evaluation
Mirco-average throughput:
the number of records processed per second, as the ratio of
Total Number of Records to the Total Processing Time
Processing time for
surveillance data sets varying
from 100 to 100,000 records
Evaluation
Processing time for
surveillance data sets varying
from 105 to 106 records
Evaluation
Conclusions
• This work proposes a new approach towards generating RDF knowledge
graphs from multiple heterogeneous streaming and archival data, in a
uniform, efficient and scalable way.
• Separating the Data Connector from the Triple Generator, the RDF-Gen
approach outperforms the state of the art tools RML and SPARQL-
Generate, in terms of throughput, scalability and usability.
• This is achieved by implementing data access and close-to-the-sources data
processing facilities in the Data Connectors, providing data in a record-by-record
approach to the Triple Generators, which use graph templates as a generic way to
map data to RDF.
• RDF-Gen needs no further knowledge of a specific vocabulary, and it can be
used by anyone who can write simple SPARQL queries.
• It requires no underlying SPARQL engine, and it inherently supports
distribution of processing and the exploitation of streaming data sources.
Outlook
• Future work includes (but not limited to) employing/extend RDF-Gen to:
• Implement a stream mashup, combining streams into a single stream of RDF triples
• Link contents of live open streams, and serve real time Linked Open Data to the
web
• Implement generic templates for commonly used Content Management Systems,
to allow client-side conversion of web content. RDF-Gen has been successfully used
as a web site crawler (non-CMS content) on https://doc8643.com to retrieve
aircraft model specifications as RDF triples and integrate them with our aircraft
data sets
• Introduce a fully automated construction of the mappings/templates, (i.e. which
will provide a set of suggestions of data-to-vocabularies mappings, variables, and
bindings to data)
Thank you!
For documentation, tutorials and source code please visit:
http://datacron-project.eu/
Acknowledgment
This work is supported by project datAcron, which has received funding from the European
Union’s Horizon 2020 research and innovation programme under grant agreement No 687591
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/
(c) AI-Group/UNIVERSITY OF PIRAEUS RESEARCH CENTER (UPRC)
Question, comments, suggestions to
gsant@unipi.gr

More Related Content

What's hot

Triplewave: a step towards RDF Stream Processing on the Web
Triplewave: a step towards RDF Stream Processing on the WebTriplewave: a step towards RDF Stream Processing on the Web
Triplewave: a step towards RDF Stream Processing on the WebDaniele Dell'Aglio
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolWi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolLaura Po
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)Ankit Rathi
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data VisualizationLaura Po
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webMahdi Atawneh
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsJean-Paul Calbimonte
 
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Nikolaos Konstantinou
 
A Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXA Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXStuart Chalk
 
LarKC Tutorial at ISWC 2009 - Data Model
LarKC Tutorial at ISWC 2009 - Data ModelLarKC Tutorial at ISWC 2009 - Data Model
LarKC Tutorial at ISWC 2009 - Data ModelLarKC
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAdnan Akhter
 
Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014aceas13tern
 
Geospatial Querying in Apache Marmotta - Apache Big Data North America 2016
Geospatial Querying in Apache Marmotta -  Apache Big Data North America 2016Geospatial Querying in Apache Marmotta -  Apache Big Data North America 2016
Geospatial Querying in Apache Marmotta - Apache Big Data North America 2016Sergio Fernández
 

What's hot (20)

Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
Triplewave: a step towards RDF Stream Processing on the Web
Triplewave: a step towards RDF Stream Processing on the WebTriplewave: a step towards RDF Stream Processing on the Web
Triplewave: a step towards RDF Stream Processing on the Web
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolWi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX tool
 
Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked Data
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
 
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...
 
A Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXA Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSX
 
LarKC Tutorial at ISWC 2009 - Data Model
LarKC Tutorial at ISWC 2009 - Data ModelLarKC Tutorial at ISWC 2009 - Data Model
LarKC Tutorial at ISWC 2009 - Data Model
 
Timbuctoo 2 EASY
Timbuctoo 2 EASYTimbuctoo 2 EASY
Timbuctoo 2 EASY
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014
 
Geospatial Querying in Apache Marmotta - Apache Big Data North America 2016
Geospatial Querying in Apache Marmotta -  Apache Big Data North America 2016Geospatial Querying in Apache Marmotta -  Apache Big Data North America 2016
Geospatial Querying in Apache Marmotta - Apache Big Data North America 2016
 

Similar to RDF-Gen: Generating RDF from streaming and archival data

Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsAlejandro Llaves
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data ApplicationsEUCLID project
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchChimezie Ogbuji
 
Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphsSören Auer
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...giuseppe_futia
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANSvty
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiOllieShoresna
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
Intro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensIntro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensStoitsis Giannis
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeSumant Tambe
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRathachai Chawuthai
 
Karma is a tool! Managing your Data
Karma is a tool! Managing your DataKarma is a tool! Managing your Data
Karma is a tool! Managing your DataVioleta Ilik
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.Shyjal Raazi
 

Similar to RDF-Gen: Generating RDF from streaming and archival data (20)

Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data Applications
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes research
 
Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphs
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANS
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Intro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensIntro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-Athens
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/Subscribe
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
Karma is a tool! Managing your Data
Karma is a tool! Managing your DataKarma is a tool! Managing your Data
Karma is a tool! Managing your Data
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

RDF-Gen: Generating RDF from streaming and archival data

  • 1. RDF-Gen: Generating RDF from Streaming and Archival Data Georgios M. Santipantakis, Konstantinos I. Kotis, George A. Vouros, Christos Doulkeridis Department of Digital Systems University of Piraeus, Greece
  • 2. Contents • Problem Definition • Related work • RDF-Gen • Experimental Results • Outlook
  • 3. Problem Definition Given a set of data sources, archival or streaming, in various formats, we want a framework capable to generate ontology-annotated RDF graphs, with high throughput and low latency. We mainly focus on the following Objectives: O1 Inherently support the RDF generation of both streaming and archival datasets. O2 Provide facilities for close-to-source data processing tasks, e.g. for data cleansing, data manipulation and conversion, and generation of URIs. O3 Support close-to-source link discovery functionality. O4 Demonstrate computational efficiency in terms of high throughput and low data- generation latency. O5 Demonstrate the scalability which is necessary for the transformation of big data. O6 Demonstrate extensibility, in the sense that (i) it can integrate custom data processing and manipulation functions, and (ii) it can be instantiated to new data formats. O7 Support reusability of solutions across data sources of the same domain.
  • 4. Related work 1. Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik Mannens, and Rik Van de Walle. 2014. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In Proceedings of the 7th Workshop on Linked Data on the Web (CEUR Workshop Proceedings) 2. Maxime Lefrançois, Antoine Zimmermann, and Noorani Bakerally. 2017. A SPARQL Extension for Generating RDF from Heterogeneous Formats. In Proceedings of ESWC 2017. 3. Jason Slepicka, Chengye Yin, Pedro Szekely, and Craig Knoblock. 2015. KR2RML: An Alternative Interpretation of R2RML for Heterogenous Sources. In Proceedings of the 6th International Workshop on Consuming Linked Data (COLD 2015) 4. Ademar Crotti Junior, Christophe Debruyne, and Declan O’Sullivan. 2016. Incorporating Functions in Mappings to Facilitate the Uplift of CSV Files into RDF. In The Semantic Web - ESWC 2016 Satellite Events 5. Franćois Scharffe, Ghislain Atemezing, Raphaël Troncy, Fabien Gandon, Serena Villata, Bénédicte Bucher, Fayćl Hamdi, Laurent Bihanic, Gabriel Képéklian, Franck Cotton, Jérôme Euzenat, Zhengjie Fan, Pierre-Yves Vandenbussche, and Bernard Vatant. 2012. Enabling linked data publication with the Datalift platform. In Proceedings of AAAI 2012, 26th Conference on Artificial Intelligence RML [1] SPARQL-Generate [2] KR2RML [3] RMLProcessor [4] DataLift [5] RDF-Gen
  • 5. O1 O2 O3 O4 O5 O6 O7 RML [1]      SPARQL-Generate [2]   KR2RML [3]      RMLProcessor [4]    DataLift [5]   RDF-Gen        Related work O1 Inherently support the RDF generation of both streaming and archival datasets. O2 Provide facilities for close-to-source data processing tasks, e.g. for data cleansing, data manipulation and conversion, and generation of URIs. O3 Support close-to-source link discovery functionality. O4 Demonstrate computational efficiency in terms of high throughput and low data-generation latency. O5 Demonstrate the scalability which is necessary for the transformation of big data. O6 Demonstrate extensibility, in the sense that (i) it can integrate custom data processing and manipulation functions, and (ii) it can be instantiated to new data formats. O7 Support reusability of solutions across data sources of the same domain.
  • 6. • RDF-Gen consists of the following components: • Data Connectors • Triple Generator • Link Discovery RDF-Gen Architecture
  • 7. RDF-Gen (Data Connectors) • Configuration: • Connector type for given data source (file, stream, database, remote SPARQL endpoint, etc), • Data source URI (local or remote) • Data source specific attributes, e.g. user credentials for database sources, etc • Tasks: • Apply data cleaning dropping outliers on the fly (w.r.t. rules given in configuration) • Employ mappings between XML nodes (applies only on XML data sources) found in separate XML files • Output: • Iterate through data source records and output a uniform vector of values
  • 10. • Configuration: • a vector of variables V, • a RDF Graph template G conformed to the given Ontology Scheme, and • a set of functions (available to all instances of the Triple Generator) • Tasks: • consumes the vector of values provided by Data Connectors, • generates triples simply by binding variables to their corresponding values using the graph template and • evaluates the pre-compiled functions (if any in the Template) with bound values as arguments • enables linking of resources from different data sources w.r.t. their values (i.e. common functions constructing URIs, will generate the same URIs for same values processed by different RDF-Gen instances on different data sources) • Output: • A set of triples corresponding to the consumed vector of values RDF-Gen (Triple Generator)
  • 15. RDF-Gen (Link Discovery) • Configuration: • URI of the streams to be consumed S (i.e. this or remote RDF-Gen instances, local or remote archived RDF triples) • Link(s) L to be discovered (as specified in given Ontology Scheme), under given in configuration conditions C • Data organizing method to be applied M • Tasks: • Organize resources in S according to M, and evaluate conditions C for each pair of candidates in S, to discover links in L • Output: • A set of triples reporting the linked resources
  • 16. Evaluation • Several data sets have been evaluated in the datAcron project. • We present evaluation results for three different data sets, for typical or large volumes of data varying between 100 and 1,000,000 entries: • An artificial data set of Persons, generated by GenerateData.com, mapping 8 properties • A real-life archival data set of aircrafts, mapping 9 properties • Aircraft surveillance streaming data set, mapping 5 properties • We compare RDF-Gen to state-of-the-art RML and SPARQL-Generate. Configurations and executables used for the experiments are currently available at: https://github.com/datAcron-project/RDF-Gen/
  • 17. Evaluation Mirco-average throughput: the number of records processed per second, as the ratio of Total Number of Records to the Total Processing Time
  • 18. Processing time for surveillance data sets varying from 100 to 100,000 records Evaluation
  • 19. Processing time for surveillance data sets varying from 105 to 106 records Evaluation
  • 20. Conclusions • This work proposes a new approach towards generating RDF knowledge graphs from multiple heterogeneous streaming and archival data, in a uniform, efficient and scalable way. • Separating the Data Connector from the Triple Generator, the RDF-Gen approach outperforms the state of the art tools RML and SPARQL- Generate, in terms of throughput, scalability and usability. • This is achieved by implementing data access and close-to-the-sources data processing facilities in the Data Connectors, providing data in a record-by-record approach to the Triple Generators, which use graph templates as a generic way to map data to RDF. • RDF-Gen needs no further knowledge of a specific vocabulary, and it can be used by anyone who can write simple SPARQL queries. • It requires no underlying SPARQL engine, and it inherently supports distribution of processing and the exploitation of streaming data sources.
  • 21. Outlook • Future work includes (but not limited to) employing/extend RDF-Gen to: • Implement a stream mashup, combining streams into a single stream of RDF triples • Link contents of live open streams, and serve real time Linked Open Data to the web • Implement generic templates for commonly used Content Management Systems, to allow client-side conversion of web content. RDF-Gen has been successfully used as a web site crawler (non-CMS content) on https://doc8643.com to retrieve aircraft model specifications as RDF triples and integrate them with our aircraft data sets • Introduce a fully automated construction of the mappings/templates, (i.e. which will provide a set of suggestions of data-to-vocabularies mappings, variables, and bindings to data)
  • 22. Thank you! For documentation, tutorials and source code please visit: http://datacron-project.eu/ Acknowledgment This work is supported by project datAcron, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 687591 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ (c) AI-Group/UNIVERSITY OF PIRAEUS RESEARCH CENTER (UPRC) Question, comments, suggestions to gsant@unipi.gr