On the need for a W3C community group on RDF Stream Processing

  • 547 views
Uploaded on

by Oscar Corcho …

by Oscar Corcho

@ISWC2013 Workshop on Ordering and Reasoning, Sydney, 22/10/2013

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
547
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. On the need for a W3C community group on RDF Stream Processing ISWC2013 Workshop on Ordering and Reasoning, Sydney, 22/10/2013 Oscar Corcho ocorcho@fi.upm.es, ocorcho@localidata.com @ocorcho http://www.slideshare.net/ocorcho/
  • 2. Disclaimer… This presentation expresses my view but not necessarily the one from the rest of the group (although I hope that it is similar) <<Texto libre: proyecto, speaker, etc.>> 2
  • 3. Acknowledgements • All those that I have “stolen” slides, material and ideas from • • • • • Emanuele Della Valle Daniele Dell’Aglio Marco Balduini Jean Paul Calbimonte And many others who have already started contributing… <<Texto libre: proyecto, speaker, etc.>> 3
  • 4. Why setting up a community group? In RDF Stream models (timestamps, events, time intervals, triple-based, graph-based …) In RDF Stream query languages (windows, stream selection, CEP-based operators, …) Heterogeneity In implementations (RDF native, query rewriting, continuous query registration, scalability, static vs streaming data…) <<Texto libre: proyecto, speaker, etc.>> 4 In operational semantics (tick, window content, report)
  • 5. You may think that we do not like heterogeneity… <<Texto libre: proyecto, speaker, etc.>> 5
  • 6. But at least I love it… • However, we need to tell people what to expect with each system, and smooth differences when they are not crucial…… <<Texto libre: proyecto, speaker, etc.>> 6
  • 7. The solution… • Let’s create a W3C community group… • • • • • To understand better those differences The requirements on which we are based And explain to others … And maybe get some “recommendation” out <<Texto libre: proyecto, speaker, etc.>> 7
  • 8. The W3C RDF Stream Processing Comm. Group • http://www.w3.org/community/rsp/ <<Texto libre: proyecto, speaker, etc.>> 8
  • 9. W3C RSP Community Group mission “The mission of the RDF Stream Processing Community Group (RSP) is to define a common model for producing, transmitting and continuously querying RDF Streams. This includes extensions to both RDF and SPARQL for representing streaming data, as well as their semantics. Moreover this work envisions an ecosystem of streaming and static RDF data sources whose data can be combined through standard models, languages and protocols. Complementary to related work in the area of databases, this Community Group looks at the dynamic properties of graph-based data, i.e., graphs that are produced over time and which may change their shape and data over time.” <<Texto libre: proyecto, speaker, etc.>> 9
  • 10. Use cases • We have started collecting them • And I hope that by the end of my talk you will consider contributing some more… <<Texto libre: proyecto, speaker, etc.>> 10
  • 11. A template to describe use cases (I) • Streaming Information • • • • • • Type: Environmental data: temperatures, pressures, salinity, acidity, fluid velocities etc, Nature: • Relational Stream: yes • Text stream: no Origin: Data is produced by sensors in oil wells and on oil and gas platforms equipments. Each oil platform has an average of 400.000. Frequency of update: • from sub-second to minutes • In triples/minute: [10000-10] t/min Quality: It varies, due to instrument/sensor issues Management /access • Technology in use: Dedicated (relational and proprietary) stores • Problems: The ability of users to access data from different sources is limited by an insufficient description of the context • Means of improvement: Add context (metadata) to the data so it become meaningful and use reasoning techniques to process that metadata <<Texto libre: proyecto, speaker, etc.>> 11
  • 12. A template to describe use cases (II) • [optional] Static Information required to interpret the streaming information • • • • • Type: Topology of the sensor network, position of each sensor, the descriptions of the oil platform Origin: Oil and gas production operations Dimension: • 100s of MB as PostGIS dump • In triples: 10^8 Quality: Good Management / access • Technology in use: RDBMS, proprietary technologies • Available Ontologies and Vocabularies: Reference Semantic Model (RSM), based on ISO 15926 <<Texto libre: proyecto, speaker, etc.>> 12
  • 13. A tale of four heterogeneities ISWC2013 Workshop on Ordering and Reasoning, Sydney, 22/10/2013 Oscar Corcho ocorcho@fi.upm.es, ocorcho@localidata.com @ocorcho http://www.slideshare.net/ocorcho/
  • 14. Heterogeneity #1: Representing RDF Streams <<Texto libre: proyecto, speaker, etc.>> 14
  • 15. What is an RDF stream? • Several possibilities: • An RDF stream is an infinite sequence of timestamped events (triples or graphs), where timestamps are nondecreasing … <eventi,ti > <eventi+1,ti+1 > <eventi+2,ti+2 > … • An RDF stream is an infinite sequence of triple occurrences <<s,p,o>,tα,tω> where <s,p,o> is an RDF triple and tα and tω are the start and end of the interval • How are timestamps assigned?
  • 16. Some examples… • What would be the best/possible RDF stream representation for the following types of problems? • Does Alice meet Bob before Carl? • Who does Carl meet first? :alice :isWith :bob :alice :isWith :carl e1 :diana :isWith :carl :bob :isWith :diana e2 e3 e4 • How many people has Alice met in the last 5m? • Does Diana meet Bob and then Carl within 5m? 1 3 6 9 t • Which are the meetings the last less than 5m? • Which are the meetings with conflicts? :alice :isWith :bob :alice :isWith :carl :bob :isWith :diana :diana :isWith :carl e4 e2 e1 <<Texto libre: proyecto, speaker, etc.>> e3 16
  • 17. Data types for semantic streams - Summary • Multiple notions of RDF stream proposed • Ordered sequence (implicit timestamp) • One timestamp per triple (point in time semantics) • Two timestamps per triple (interval base semantics) • Comparison between existing approaches System Time model # of timestamps INSTANS triple Implicit 0 C-SPARQL triple Point in time 1 SPARQLstream triple Point in time 1 CQELS triple Point in time 1 Sparkwave triple Point in time 1 Streaming Linked Data RDF graph Point in time 1 ETALIS • Data item triple Interval 2 More investigation is required to agree on an RDF stream model 17
  • 18. Heterogeneity #2: RDF Stream processors <<Texto libre: proyecto, speaker, etc.>> 18
  • 19. Existing RDF Stream Processing systems • C-SPARQL: RDF Store + Stream processor • Combined architecture C-SPARQL query sta translator tic stre amin RDF Store g Stream processor continuous results • CQELS: Implemented from scratch. Focus on performance • Native + adaptive joins for static-data and streaming data CQELS query Native RSP continuous results • CQELS-Cloud: Reusing Storm • Paper presentation on Thursday CQELS query Storm topology continuous results
  • 20. Existing RSP systems • EP-SPARQL: Complex-event detection • SEQ, EQUALS operators EP-SPARQL query translator Prolog engine continuous results • SPARQLStream: Ontology-based stream query answering • Virtual RDF views, using R2RML mappings • SPARQL stream queries over the original data streams. SPARQLStream query rewriter DSMS/CEP R2RML mappings • Instans: RETE-based evaluation continuous results
  • 21. Query languages for semantic streams - Summary • Different architectural choices • It is not clear when each choice is best for which type of use case • Wrappers over existing systems • C-SPARQL, ETALIS, SPARQLstream , CQELS-Cloud • Better reliability and maintainability? • Native implementations • CQELS, Streaming Linked Data, INSTANS • Better scalability: optimizations that are not possible in other systems • Different operational semantics • See later 21
  • 22. Heterogeneity #3: Querying RDF Streams <<Texto libre: proyecto, speaker, etc.>> 22
  • 23. Querying data streams (from CQL to SPARQL-X) stream-to-relation (S2R) Relation s Streams infinite unbounded bag … <s,τ> … relation-to-relation (R2R) relation-to-stream (R2S) Stream <s1> <s2> <s3> finite bag Relati on R(t) Mapping: T  R S2R Window operators RDF Streams SPARQL operators RDF R2S operators
  • 24. Output: relation • Case 1: the output is a set of timestamped mappings a … ?b… [t1] a … ?b… SELECT ?a ?b … FROM …. WHERE …. queries CONSTRUCT {?a :prop ?b } FROM …. WHERE …. a … ?b… [t3] a … ?b… [t5] RS P a … ?b… [t7] bindings  <… :prop … > [t1]  <… :prop … >  <… :prop … > [t3]  <… :prop … > [t5]  <… :prop … > [t7] triples
  • 25. Output: stream • Case 2: the output is a stream • R2S operators CONSTRUCT RSTREAM {?a :prop ?b } FROM …. WHERE …. query RS P stream …  <… :prop … > [t1]  <… :prop … > [t1] <… :prop … > [t3] <… :prop … > [t5] < …:prop … > [t7] …  ISTREAM: stream out data in the last step that wasn’t on the previous step  DSTREAM: stream out data in the previous step that isn’t in the last step  RSTREAM: stream out all data in the last step
  • 26. Other operators • Sequence operators and CEP world e4 S e1 e2 e3 1 3 6 Sequence 9 Simultaneous  SEQ: joins eti,tf and e’ti’,tf’ if e’ occurs after e  EQUALS: joins eti,tf and e’ti’,tf’ if they occur simultaneously  OPTIONALSEQ, OPTIONALEQUALS: Optional join variants
  • 27. Query languages for semantic streams - Summary • Comparison between existing approaches System S2R R2R Time-aware R2S INSTANS Based on time events SPARQL update Based on time events Ins only C-SPARQL Engine Logical and triple-based SPARQL 1.1 query timestamp function Batch only SPARQLstream Logical and triple-based SPARQL 1.1 query no Ins, batch, del CQELS Logical and triple-based SPARQL 1.1 query no Ins only Sparkwave Logical SPARQL 1.0 no Ins only Streaming Linked Data Logical and graph-based SPARQL 1.1 no Batch only ETALIS no SPARQL 1.0 • Is it time to converge on a 27 SEQ, PAR, AND, OR, DURING, STARTS, standard? NOT, EQUALS, MEETS, FINISHES Ins only
  • 28. Query languages for semantic streams - Issues • Different syntax for S2R operator • Semantics of query languages is similar, but not identical • Lack of R2S operator in some cases • Different support for time-aware operators 28
  • 29. Classification of existing systems
  • 30. Heterogeneity #4: Operational Semantics <<Texto libre: proyecto, speaker, etc.>> 30
  • 31. Operational Semantics Where are both alice and bob in the last 5s? hall :hall sIn : :i isIn e : :alic :bob S e :alic hen :kitc :isIn S1 S2 S3 S4 1 3 6 :bob hen :kitc :isIn 9 System 1: System 2: :hall [5] :hall [3] t :kitchen [10] :kitchen [9] Both correct? ISWC 2013 evaluation track for "On Correctness in RDF stream processor benchmarking" by Daniele Dell’Aglio, Jean-Paul Calbimonte, Marco Balduini, Oscar Corcho and Emanuele Della Valle
  • 32. Conclusions… <<Texto libre: proyecto, speaker, etc.>> 32
  • 33. Next steps in the community group… • Agree on an RDF model? • • • • Metamodel? Timestamps in graphs? Timestamp intervals Compatibility with normal (static) RDF • Additional operators for SPARQL? • Windows (not only time based?) • CEP operators • Semantics • Go Web • Volatile URIs • Serialization: terse, compact • Protocols: HTTP, Websockets?
  • 34. On the need for a W3C community group on RDF Stream Processing ISWC2013 Workshop on Ordering and Reasoning, Sydney, 22/10/2013 Oscar Corcho ocorcho@fi.upm.es, ocorcho@localidata.com @ocorcho http://www.slideshare.net/ocorcho/