SlideShare a Scribd company logo
Towards efficient 
processing of 
RDF data streams 
Alejandro Llaves 
Javier D. Fernández 
Oscar Corcho 
Ontology Engineering Group 
Universidad Politécnica de Madrid 
Madrid, Spain 
allaves@fi.upm.es 
OrdRing workshop - ISWC 2014 
Riva del Garda 
October 20th, 2014
Efficient?? Scalable? 
http://blog.mikiobraun.de/
Outline 
 Introduction 
 Background: Storm and Lambda Architecture 
 Efficient processing of queries over RDF streams 
 Architecture overview 
 Storm-based operators for querying RDF streams 
 Adaptive query processing for data streams 
 RDF stream compression 
 Conclusions & future work 
 Open questions
Introduction 
 Origins of Linked Stream Data 
 Extracting information from data streams is complex: 
heterogeneity, rate of generation, volume, provenance,… 
 Challenges 
 C1. Efficient processing of user queries over RDF streams 
 C2. Continuous transmission of data increases latency 
 C3. Integration of historical and real-time data with background 
knowledge 
Source: http://webnotations.com
Background 
Storm - http://storm.incubator.apache.org/ 
 Distributed system for real-time processing of streams 
 Why Storm? 
 Simple processing model (parallelization) 
 Open source community backing the project 
 Used by relevant companies, e.g. Twitter. 
Lambda Architecture (Marz 2013) 
 Batch layer: stores ALL the incoming data in an immutable 
master dataset and pre-computes batch views on historic data. 
 Serving layer: indexes views on the master dataset. 
 Real-time processing layer: requests data views depending on 
incoming queries.
Efficient processing of RDF streams 
Goal: to develop a stream processing engine capable of 
adapting to variable conditions, such as changing rates of 
input data, failure of processing nodes, or distribution of 
workload, while serving complex continuous queries. 
Methodology 
 State of the art of (RDF) stream processing 
 Evaluate how to parallelize SPARQLStream queries 
 Implementation of RDF query operators 
 Optimize parallelization for common queries 
 Design self-adaptive strategies that allow the engine to 
react in front of changes
Architecture overview
Storm-based operators for querying RDF streams 
 Triple2Graph operator 
 Time Window operator 
 Simple Join operator 
 Projection operator 
Simple Join Stream<Set<Tuple>> 
(join attribute) 
Stream<Set<Tuple>, Set<Tuple>> 
Windowing Stream<Set<Graph>> 
(window size, 
emission time) 
Stream<Graph, t> 
Project Stream<Tuple<o1, o2,..., on>> 
(input, output) 
Stream<Tuple<i1, i2,..., in>> 
Triple2Graph Stream<Graph, t> 
(graph starter) 
Stream<s, p, o>
Storm-based operators for querying RDF streams 
Project 
Project 
Project 
Project 
Storm topology example (4 nodes) 
SELECT ?obs.value ?sensors.location 
FROM NAMED STREAM <obs> [60 SEC TO NOW] 
FROM NAMED STREAM <sensors> [60 SEC TO NOW] 
WHERE obs.sensorId = sensors.id ; 
Simple Join 
Simple Join 
Simple Join 
Simple Join 
Windowing 
<t0, t2...> 
Windowing 
<t1, t3...> 
Windowing 
<t0, t2...> 
Windowing 
<t1, t3...> 
SPOUT 
<obs> 
Triple2Graph 
Output 
SPOUT 
<sensors> Triple2Graph 
t0 
t1 
t2 
t3
RDF stream compression (1/2) 
Efficient RDF Interchange (ERI) format 
 Based on Efficient XML Interchange (EXI) 
 Main assumption: RDF streams have regular structure and 
are redundant 
 Information encoded at 2 levels 
 Structural dictionary 
 Presets (values) 
 Example: SSN observations
Where can we apply RDF stream compression?
Conclusions and future work 
Conclusions 
 We have addressed challenges C1 (scalability) and C2 (transmission) 
 Catalogue of Storm-based operators to parallelize query processing over RDF 
streams. 
 New format for RDF stream compression called ERI. 
 Challenge C3 (integration) involves storage of historical data and the 
deployment of batch and serving layers OR the migration to a more 
general system, e.g. Apache Spark. 
Future work 
 Finish the implementation of RDF query operators 
 Test the parallelization of a set of common queries → SRBench 
 Adaptive strategies based on Adaptive Query Processing 
 Evaluation → Benchmarking: comparison to CQELS Cloud 
 Integration of ERI into our engine
Open questions 
 How does the order of tuple arrival affect the 
parallelization of join processing tasks? 
 Are the spatial (or spatio-temporal) properties of a 
tuple a dimension to have into account for ordering? 
In such case, what influence does it have on 
reasoning tasks? And on parallelization tasks? 
 How does the out-of-order tuples affect the 
processing of streams? In case of discarding out-of-order 
tuples, how to communicate this in the 
results?
Thanks! 
The research leading to this results has received funding from the 
EU's Seventh Framework Programme (FP7/2007-2013) under 
grant agreement no. 257641, PlanetData network of excellence, 
from Ministerio de Economía y Competitividad (Spain) under the 
project “4V: Volumen, Velocidad, Variedad y Validez en la Gestión 
Innovadora de Datos” (TIN2013-46238-C4-2-R), and has been 
supported by an AWS in Education Research Grant award. 
Alejandro Llaves 
allaves@fi.upm.es
Adaptive Query Processing (AQP) for data streams 
 Traditional databases include a query optimizer that 
designs the execution plan based on the data 
statistics. 
 AQP (Deshpande 2007) techniques allow adjusting 
the query execution plan to varying conditions of 
the data input, the incoming queries, and the 
system.
RDF stream compression (2/2) 
Evaluation 
 Datasets: streaming, statistical, and general static. 
 Compression ratio, compression time, and parsing 
throughput (transmission + decompression) 
 Comparison to other formats, such as N-Triples, Turtle, 
RDSZ, HDT, with different configurations of ERI w.r.t. 
transmitted data block (1K – 4K) and the presence of 
dictionary. 
 Conclusion: ERI produces state-of-the-art compression for 
RDF streams and excels for regularly-structured static RDF 
datasets. ERI compression ratios remain competitive in 
general datasets and the time overheads for ERI 
processing are relatively low. 
http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf

More Related Content

What's hot

Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Willy Marroquin (WillyDevNET)
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
Jean-Paul Calbimonte
 
RDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementationsRDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementations
Jean-Paul Calbimonte
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
Revolution Analytics
 
RSP4J: An API for RDF Stream Processing
RSP4J: An API for RDF Stream ProcessingRSP4J: An API for RDF Stream Processing
RSP4J: An API for RDF Stream Processing
Riccardo Tommasini
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
 
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Yuanyuan Tian
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
National Institute of Informatics
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)Ankit Rathi
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
Chetan Khatri
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
Revolution Analytics
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
Carol McDonald
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
Jean-Paul Calbimonte
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Matt Stubbs
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Till Blume
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Revolution Analytics
 
Summary of the Stream Reasoning workshop at ISWC 2016
Summary of the Stream Reasoning workshop at ISWC 2016Summary of the Stream Reasoning workshop at ISWC 2016
Summary of the Stream Reasoning workshop at ISWC 2016
Daniele Dell'Aglio
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
jeykottalam
 
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
A Hierarchical approach towards Efficient and Expressive Stream ReasoningA Hierarchical approach towards Efficient and Expressive Stream Reasoning
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
Riccardo Tommasini
 

What's hot (20)

Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
 
RDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementationsRDF Stream Processing Tutorial: RSP implementations
RDF Stream Processing Tutorial: RSP implementations
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
RSP4J: An API for RDF Stream Processing
RSP4J: An API for RDF Stream ProcessingRSP4J: An API for RDF Stream Processing
RSP4J: An API for RDF Stream Processing
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Summary of the Stream Reasoning workshop at ISWC 2016
Summary of the Stream Reasoning workshop at ISWC 2016Summary of the Stream Reasoning workshop at ISWC 2016
Summary of the Stream Reasoning workshop at ISWC 2016
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
A Hierarchical approach towards Efficient and Expressive Stream ReasoningA Hierarchical approach towards Efficient and Expressive Stream Reasoning
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
 

Similar to Towards efficient processing of RDF data streams

Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...
Nikolaos Konstantinou
 
Web Spa
Web SpaWeb Spa
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Scientific
Scientific Scientific
Scientific
marpierc
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
Gezim Sejdiu
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Heuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) enginesHeuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) engines
William Aruga
 
My thesis
My thesisMy thesis
My thesis
William Aruga
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/Subscribe
Sumant Tambe
 
What we do to improve scalability in our RDF processing system
What we do to improve scalability in our RDF processing systemWhat we do to improve scalability in our RDF processing system
What we do to improve scalability in our RDF processing system
Alejandro Llaves
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developments
Maxime Lefrançois
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
OllieShoresna
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
PlanetData Network of Excellence
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
Mila, Université de Montréal
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 

Similar to Towards efficient processing of RDF data streams (20)

Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...
 
Web Spa
Web SpaWeb Spa
Web Spa
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Scientific
Scientific Scientific
Scientific
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Heuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) enginesHeuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) engines
 
My thesis
My thesisMy thesis
My thesis
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/Subscribe
 
What we do to improve scalability in our RDF processing system
What we do to improve scalability in our RDF processing systemWhat we do to improve scalability in our RDF processing system
What we do to improve scalability in our RDF processing system
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developments
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 

Recently uploaded

一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 

Recently uploaded (20)

一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 

Towards efficient processing of RDF data streams

  • 1. Towards efficient processing of RDF data streams Alejandro Llaves Javier D. Fernández Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Madrid, Spain allaves@fi.upm.es OrdRing workshop - ISWC 2014 Riva del Garda October 20th, 2014
  • 3. Outline  Introduction  Background: Storm and Lambda Architecture  Efficient processing of queries over RDF streams  Architecture overview  Storm-based operators for querying RDF streams  Adaptive query processing for data streams  RDF stream compression  Conclusions & future work  Open questions
  • 4. Introduction  Origins of Linked Stream Data  Extracting information from data streams is complex: heterogeneity, rate of generation, volume, provenance,…  Challenges  C1. Efficient processing of user queries over RDF streams  C2. Continuous transmission of data increases latency  C3. Integration of historical and real-time data with background knowledge Source: http://webnotations.com
  • 5. Background Storm - http://storm.incubator.apache.org/  Distributed system for real-time processing of streams  Why Storm?  Simple processing model (parallelization)  Open source community backing the project  Used by relevant companies, e.g. Twitter. Lambda Architecture (Marz 2013)  Batch layer: stores ALL the incoming data in an immutable master dataset and pre-computes batch views on historic data.  Serving layer: indexes views on the master dataset.  Real-time processing layer: requests data views depending on incoming queries.
  • 6. Efficient processing of RDF streams Goal: to develop a stream processing engine capable of adapting to variable conditions, such as changing rates of input data, failure of processing nodes, or distribution of workload, while serving complex continuous queries. Methodology  State of the art of (RDF) stream processing  Evaluate how to parallelize SPARQLStream queries  Implementation of RDF query operators  Optimize parallelization for common queries  Design self-adaptive strategies that allow the engine to react in front of changes
  • 8. Storm-based operators for querying RDF streams  Triple2Graph operator  Time Window operator  Simple Join operator  Projection operator Simple Join Stream<Set<Tuple>> (join attribute) Stream<Set<Tuple>, Set<Tuple>> Windowing Stream<Set<Graph>> (window size, emission time) Stream<Graph, t> Project Stream<Tuple<o1, o2,..., on>> (input, output) Stream<Tuple<i1, i2,..., in>> Triple2Graph Stream<Graph, t> (graph starter) Stream<s, p, o>
  • 9. Storm-based operators for querying RDF streams Project Project Project Project Storm topology example (4 nodes) SELECT ?obs.value ?sensors.location FROM NAMED STREAM <obs> [60 SEC TO NOW] FROM NAMED STREAM <sensors> [60 SEC TO NOW] WHERE obs.sensorId = sensors.id ; Simple Join Simple Join Simple Join Simple Join Windowing <t0, t2...> Windowing <t1, t3...> Windowing <t0, t2...> Windowing <t1, t3...> SPOUT <obs> Triple2Graph Output SPOUT <sensors> Triple2Graph t0 t1 t2 t3
  • 10. RDF stream compression (1/2) Efficient RDF Interchange (ERI) format  Based on Efficient XML Interchange (EXI)  Main assumption: RDF streams have regular structure and are redundant  Information encoded at 2 levels  Structural dictionary  Presets (values)  Example: SSN observations
  • 11. Where can we apply RDF stream compression?
  • 12. Conclusions and future work Conclusions  We have addressed challenges C1 (scalability) and C2 (transmission)  Catalogue of Storm-based operators to parallelize query processing over RDF streams.  New format for RDF stream compression called ERI.  Challenge C3 (integration) involves storage of historical data and the deployment of batch and serving layers OR the migration to a more general system, e.g. Apache Spark. Future work  Finish the implementation of RDF query operators  Test the parallelization of a set of common queries → SRBench  Adaptive strategies based on Adaptive Query Processing  Evaluation → Benchmarking: comparison to CQELS Cloud  Integration of ERI into our engine
  • 13. Open questions  How does the order of tuple arrival affect the parallelization of join processing tasks?  Are the spatial (or spatio-temporal) properties of a tuple a dimension to have into account for ordering? In such case, what influence does it have on reasoning tasks? And on parallelization tasks?  How does the out-of-order tuples affect the processing of streams? In case of discarding out-of-order tuples, how to communicate this in the results?
  • 14. Thanks! The research leading to this results has received funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 257641, PlanetData network of excellence, from Ministerio de Economía y Competitividad (Spain) under the project “4V: Volumen, Velocidad, Variedad y Validez en la Gestión Innovadora de Datos” (TIN2013-46238-C4-2-R), and has been supported by an AWS in Education Research Grant award. Alejandro Llaves allaves@fi.upm.es
  • 15. Adaptive Query Processing (AQP) for data streams  Traditional databases include a query optimizer that designs the execution plan based on the data statistics.  AQP (Deshpande 2007) techniques allow adjusting the query execution plan to varying conditions of the data input, the incoming queries, and the system.
  • 16. RDF stream compression (2/2) Evaluation  Datasets: streaming, statistical, and general static.  Compression ratio, compression time, and parsing throughput (transmission + decompression)  Comparison to other formats, such as N-Triples, Turtle, RDSZ, HDT, with different configurations of ERI w.r.t. transmitted data block (1K – 4K) and the presence of dictionary.  Conclusion: ERI produces state-of-the-art compression for RDF streams and excels for regularly-structured static RDF datasets. ERI compression ratios remain competitive in general datasets and the time overheads for ERI processing are relatively low. http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf