Brief report about the contents of the Stream Reasoning workshop at SIWC 2016. Additional info about the event are available at: http://streamreasoning.org/events/sr2016
Triplewave: a step towards RDF Stream Processing on the WebDaniele Dell'Aglio
The slides of my talk at INSIGHT Centre for Data Analytics (in NUI Galway) where I presented TripleWave (http://streamreasoning.github.io/TripleWave/), an open-source framework to create and publish streams of RDF data.
Brief report about the contents of the Stream Reasoning workshop at SIWC 2016. Additional info about the event are available at: http://streamreasoning.org/events/sr2016
Triplewave: a step towards RDF Stream Processing on the WebDaniele Dell'Aglio
The slides of my talk at INSIGHT Centre for Data Analytics (in NUI Galway) where I presented TripleWave (http://streamreasoning.github.io/TripleWave/), an open-source framework to create and publish streams of RDF data.
Heaven: A Framework for Systematic Comparative Research Approach for RSP EnginesRiccardo Tommasini
Benchmarks like LSBench, SRBench, CSRBench and, more recently, CityBench satisfy the growing need of shared datasets, ontologies and queries to evaluate window-based RDF Stream Processing (RSP) engines. However, no clear winner emerges out of the evaluation. In this paper, we claim that the RSP community needs to adopt a Systematic Comparative Research Approach (SCRA) if it wants to move a step forward. To this end, we propose a framework that enables SCRA for window based RSP engines. The contributions of this paper are: (i) the requirements to satisfy for tools that aim at enabling SCRA; (ii) the architecture of a facility to design and execute experiment guaranteeing repeatability, reproducibility and comparability; (iii) Heaven – a proof of concept implementation of such architecture that we released as open source –; (iv) two RSP engine implementations, also open source, that we propose as baselines for the comparative research (i.e., they can serve as terms of comparison in future works). We prove Heaven effectiveness using the baselines by: (i) showing that top-down hypothesis verification is not straight forward even in controlled conditions and (ii) providing examples of bottom-up comparative analysis.
A Hierarchical approach towards Efficient and Expressive Stream ReasoningRiccardo Tommasini
Abstract. Many approaches have been proposed for Stream Reasoning (SR). Some of them combine information flow processing (IFP) tech- niques and semantic technologies to make sense in real-time of noisy, vast and heterogeneous data streams that come from complex domains. More recent works shown the presence of a trade-off between through- put and reasoning expressiveness. Indeed, systems with IFP-like perfor- mance are not really expressive (e.g. up to an RDFS subset) and vice versa. For static data, Information Integration (II) systems approached the problem already. The idea consists in spreading the reasoning com- plexity over different layers of an hierarchical architecture and treating it where it is easier to do. Is it possible realize an expressive and efficient stream reasoning (E2SR), by defining a hierarchical approach that adapts II techniques to the streaming scenario? In this paper, I discuss my plan towards E2SR, the intuition of adapting Information Integration tech- niques to the streaming scenario and the need of Stream Reasoning of comparative analysis to support its technological progress.
Knowledge Discovery tools using Linked Data techniques - {resentation for the Linked Data 4 Knowledge Discovery Workshop at ECML/PKDD2015 conference - http://events.kmi.open.ac.uk/ld4kd2015/ -
Information-Rich Programming in F# with Semantic DataSteffen Staab
Programming with rich data frequently implies that one
needs to search for, understand, integrate and program with
new data - with each of these steps constituting a major
obstacle to successful data use.
In this talk we will explain and demonstrate how our approach,
LITEQ - Language Integrated Types, Extensions and Queries for
RDF Graphs, which is realized as part of the F# / Visual Studio-
environment, supports the software developer. Using the extended
IDE the developer may now
a. explore new, previously unseen data sources,
which are either natively in RDF or mapped into RDF;
b. use the exploration of schemata and data in order to
construct types and objects in the F# environment;
c. automatically map between data and programming language objects in
order to make them persistent in the data source;
d. have extended typing functionality added to the F#
environment and resulting from the exploration of the data source
and its mapping into F#.
Core to this approach is the novel node path query language, NPQL,
that allows for interactive, intuitive exploration of data schemata and
data proper as well as for the mapping and definition
of types, object collections and individual objects.
Beyond the existing type provider mechanism for F#
our approach also allows for property-based navigation
and runtime querying for data objects.
Save queries as annotations. A method for the digital preservation of queries on a Hebrew Text database with linguistic information in it. These queries form the data for interpretations by biblical scholars. Sharing those queries as Open Annotation enables researchers to communicate their (intermediate) results.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
R is a programming language and software environment for statistical computing and graphics.
The R language is widely used among statisticians and data miners for developing statistical
software and data analysis.
RStudio IDE is a powerful and productive user interface for R.
It’s free and open source, and available on Windows, Mac, and Linux.
The aim of the EU FP 7 Large-Scale Integrating Project LarKC is to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web. The LarKC platform is available at larkc.sourceforge.net. This talk, is part of a tutorial for early users of the LarKC platform, and describes the data model used within LarKC.
First impressions of SparkR: our own machine learning algorithmInfoFarm
In june 2015, SparkR was first integrated into SparkR. At InfoFarm we strive to stay on top of new technologies, hence we have tried it out and implemented a few machine learning algorithms as well.
LDQL: A Query Language for the Web of Linked DataOlaf Hartig
I used this slideset to present our research paper at the 14th Int. Semantic Web Conference (ISWC 2015). Find a preprint of the paper here:
http://olafhartig.de/files/HartigPerez_ISWC2015_Preprint.pdf
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
Presentation done* at the 13th International Semantic Web Conference (ISWC) in which we approach a compressed format to represent RDF Data Streams. See the original article at: http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf
* Presented by Alejandro Llaves (http://www.slideshare.net/allaves)
Overview of the SPARQL-Generate language and latest developmentsMaxime Lefrançois
SPARQL-Generate is an extension of SPARQL 1.1 for querying not only RDF datasets but also documents in arbitrary formats. The solution bindings can then be used to output RDF (SPARQL-Generate) or text (SPARQL-Template)
Anyone familiar with SPARQL can easily learn SPARQL-Generate; Learning SPARQL-Generate helps you learning SPARQL.
The open-source implementation (Apache 2 license) is based on Apache Jena and can be used to execute transformations from a combination of RDF and any kind of documents in XML, JSON, CSV, HTML, GeoJSON, CBOR, streams of messages using WebSocket or MQTT... (easily extensible)
Recent extensions and improvement include:
- heavy refactoring to support parallelization
- more expressive iterators and functions
- simple generation of RDF lists
- support of aggregates
- generation of HDT (thanks Ana for the use case)
- partial implementation of STTL for the generation of Text (https://ns.inria.fr/sparql-template/)
- partial implementation of LDScript (http://ns.inria.fr/sparql-extension/)
- integration of all these types of rules to decouple or compose queries, e.g.:
- call a SPARQL-Generate query in the SPARQL FROM clause
- plug a SPARQL-Generate or a SPARQL-Template query to the output of a SPARQL-
Select function
- a Sublime Text package for local development
The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.
Presentation on RDF Stream Processing models given at the SR4LD tutorial (ISWC 2013) -- updated version at: http://www.slideshare.net/dellaglio/rsp2014-01rspmodelsss
Heaven: A Framework for Systematic Comparative Research Approach for RSP EnginesRiccardo Tommasini
Benchmarks like LSBench, SRBench, CSRBench and, more recently, CityBench satisfy the growing need of shared datasets, ontologies and queries to evaluate window-based RDF Stream Processing (RSP) engines. However, no clear winner emerges out of the evaluation. In this paper, we claim that the RSP community needs to adopt a Systematic Comparative Research Approach (SCRA) if it wants to move a step forward. To this end, we propose a framework that enables SCRA for window based RSP engines. The contributions of this paper are: (i) the requirements to satisfy for tools that aim at enabling SCRA; (ii) the architecture of a facility to design and execute experiment guaranteeing repeatability, reproducibility and comparability; (iii) Heaven – a proof of concept implementation of such architecture that we released as open source –; (iv) two RSP engine implementations, also open source, that we propose as baselines for the comparative research (i.e., they can serve as terms of comparison in future works). We prove Heaven effectiveness using the baselines by: (i) showing that top-down hypothesis verification is not straight forward even in controlled conditions and (ii) providing examples of bottom-up comparative analysis.
A Hierarchical approach towards Efficient and Expressive Stream ReasoningRiccardo Tommasini
Abstract. Many approaches have been proposed for Stream Reasoning (SR). Some of them combine information flow processing (IFP) tech- niques and semantic technologies to make sense in real-time of noisy, vast and heterogeneous data streams that come from complex domains. More recent works shown the presence of a trade-off between through- put and reasoning expressiveness. Indeed, systems with IFP-like perfor- mance are not really expressive (e.g. up to an RDFS subset) and vice versa. For static data, Information Integration (II) systems approached the problem already. The idea consists in spreading the reasoning com- plexity over different layers of an hierarchical architecture and treating it where it is easier to do. Is it possible realize an expressive and efficient stream reasoning (E2SR), by defining a hierarchical approach that adapts II techniques to the streaming scenario? In this paper, I discuss my plan towards E2SR, the intuition of adapting Information Integration tech- niques to the streaming scenario and the need of Stream Reasoning of comparative analysis to support its technological progress.
Knowledge Discovery tools using Linked Data techniques - {resentation for the Linked Data 4 Knowledge Discovery Workshop at ECML/PKDD2015 conference - http://events.kmi.open.ac.uk/ld4kd2015/ -
Information-Rich Programming in F# with Semantic DataSteffen Staab
Programming with rich data frequently implies that one
needs to search for, understand, integrate and program with
new data - with each of these steps constituting a major
obstacle to successful data use.
In this talk we will explain and demonstrate how our approach,
LITEQ - Language Integrated Types, Extensions and Queries for
RDF Graphs, which is realized as part of the F# / Visual Studio-
environment, supports the software developer. Using the extended
IDE the developer may now
a. explore new, previously unseen data sources,
which are either natively in RDF or mapped into RDF;
b. use the exploration of schemata and data in order to
construct types and objects in the F# environment;
c. automatically map between data and programming language objects in
order to make them persistent in the data source;
d. have extended typing functionality added to the F#
environment and resulting from the exploration of the data source
and its mapping into F#.
Core to this approach is the novel node path query language, NPQL,
that allows for interactive, intuitive exploration of data schemata and
data proper as well as for the mapping and definition
of types, object collections and individual objects.
Beyond the existing type provider mechanism for F#
our approach also allows for property-based navigation
and runtime querying for data objects.
Save queries as annotations. A method for the digital preservation of queries on a Hebrew Text database with linguistic information in it. These queries form the data for interpretations by biblical scholars. Sharing those queries as Open Annotation enables researchers to communicate their (intermediate) results.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
R is a programming language and software environment for statistical computing and graphics.
The R language is widely used among statisticians and data miners for developing statistical
software and data analysis.
RStudio IDE is a powerful and productive user interface for R.
It’s free and open source, and available on Windows, Mac, and Linux.
The aim of the EU FP 7 Large-Scale Integrating Project LarKC is to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web. The LarKC platform is available at larkc.sourceforge.net. This talk, is part of a tutorial for early users of the LarKC platform, and describes the data model used within LarKC.
First impressions of SparkR: our own machine learning algorithmInfoFarm
In june 2015, SparkR was first integrated into SparkR. At InfoFarm we strive to stay on top of new technologies, hence we have tried it out and implemented a few machine learning algorithms as well.
LDQL: A Query Language for the Web of Linked DataOlaf Hartig
I used this slideset to present our research paper at the 14th Int. Semantic Web Conference (ISWC 2015). Find a preprint of the paper here:
http://olafhartig.de/files/HartigPerez_ISWC2015_Preprint.pdf
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
Presentation done* at the 13th International Semantic Web Conference (ISWC) in which we approach a compressed format to represent RDF Data Streams. See the original article at: http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf
* Presented by Alejandro Llaves (http://www.slideshare.net/allaves)
Overview of the SPARQL-Generate language and latest developmentsMaxime Lefrançois
SPARQL-Generate is an extension of SPARQL 1.1 for querying not only RDF datasets but also documents in arbitrary formats. The solution bindings can then be used to output RDF (SPARQL-Generate) or text (SPARQL-Template)
Anyone familiar with SPARQL can easily learn SPARQL-Generate; Learning SPARQL-Generate helps you learning SPARQL.
The open-source implementation (Apache 2 license) is based on Apache Jena and can be used to execute transformations from a combination of RDF and any kind of documents in XML, JSON, CSV, HTML, GeoJSON, CBOR, streams of messages using WebSocket or MQTT... (easily extensible)
Recent extensions and improvement include:
- heavy refactoring to support parallelization
- more expressive iterators and functions
- simple generation of RDF lists
- support of aggregates
- generation of HDT (thanks Ana for the use case)
- partial implementation of STTL for the generation of Text (https://ns.inria.fr/sparql-template/)
- partial implementation of LDScript (http://ns.inria.fr/sparql-extension/)
- integration of all these types of rules to decouple or compose queries, e.g.:
- call a SPARQL-Generate query in the SPARQL FROM clause
- plug a SPARQL-Generate or a SPARQL-Template query to the output of a SPARQL-
Select function
- a Sublime Text package for local development
The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.
Presentation on RDF Stream Processing models given at the SR4LD tutorial (ISWC 2013) -- updated version at: http://www.slideshare.net/dellaglio/rsp2014-01rspmodelsss
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
The aim of the EU FP 7 Large-Scale Integrating Project LarKC is to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web. The LarKC platform is available at larkc.sourceforge.net. This talk, is part of a tutorial for early users of the LarKC platform, and introduces the platform and the project in general.
Towards efficient processing of RDF data streamsAlejandro Llaves
Presentation of short paper submitted to OrdRing workshop, held at ISWC 2014 - http://streamreasoning.org/events/ordring2014.
In the last years, there has been an increase in the amount of real-time data generated. Sensors attached to things are transforming how we interact with our environment. Extracting meaningful information from these streams of data is essential for some application areas and requires processing systems that scale to varying conditions in data sources, complex queries, and system failures. This paper describes ongoing research on the development of a scalable RDF streaming engine.
Towards efficient processing of RDF data streamsAlejandro Llaves
In the last years, there has been an increase in the amount of real-time data generated. Sensors attached to things are transforming how we interact with our environment. Extracting meaningful information from these streams of data is essential for some application areas and requires processing systems that scale to varying conditions in data sources, complex queries, and system failures. This paper describes ongoing research on the development of a scalable RDF streaming engine.
Presented at OrdRing workshop, International Semantic Web Conference 2014.
http://streamreasoning.org/events/ordring2014
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
Streaming Day: an overview of Stream Reasoning
Logical reasoning in real time on multiple, heterogeneous, gigantic and inevitably noisy data streams in order to support the decision process of extremely large numbers of concurrent users.
-- S. Ceri, E. Della Valle, F. van Harmelen and H. Stuckenschmidt, 2010
An exploration of a possible pipeline for RDF datasets from Timbuctoo instances to the digital archive EASY.
- Get, verify, ingest archive and disseminate (linked) data and metadata.
- What are the implications for an archive: serving linked data over (longer periods of) time
- Practical stuff.
Mining and Managing Large-scale Linked Open DataMOVING Project
Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...NETWAYS
How to store billions of time series points and access them within a few milliseconds? Chronix!
Chronix is a young but mature open source project that allows one for example to store about 15 GB (csv) of time series in 238 MB with average query times of 21 ms. Chronix is built on top of Apache Solr a bulletproof distributed NoSQL database with impressive search capabilities. In this code-intense session we show how Chronix achieves its efficiency in both respects by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with (pre-computed) attributes, and by specialized query functions.
Similar to OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Processing (20)
Organisational Interoperability in Practice at Universidad Politécnica de MadridOscar Corcho
Presentation on EOSC Interoperability Framework in relation to Organisational Interoperability, and how it can be applied to a Research Performing Organisation such as UPM
Open Data (and Software, and other Research Artefacts) -A proper managementOscar Corcho
Presentation at the event "Let's do it together: How to implement Open Science Practices in Research Projects" (29/11/2019), organised by Universidad Politécnica de Madrid, where we discuss on the need to take into account not only open access or open research data, but also all the other artefacts that are a result of our research processes.
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosOscar Corcho
Esta presentación se ha realizado en el contexto de la Jornada sobre difusión, accesibilidad y reutilización de la estadística y cartografía oficial (http://www.juntadeandalucia.es/institutodeestadisticaycartografia/blog/2019/11/jornada-plan/), organizada por el Instituto de Estadística y Cartografía de Andalucía.
Ontology Engineering at Scale for Open City Data SharingOscar Corcho
Seminar at the School of Informatics, The University of Edinburgh.
In this talk we will present how we are applying ontology engineering principles and tools for the development of a set of shared vocabularies across municipalities in Spain, so that they can start homogenising the generation and publication of open data that may be useful for their own internal reuse as well as for third parties who want to develop applications reusing open data once and deploy them for all municipalities. We will discuss on the main challenges for ontology engineering that arise in this setting, as well as present the work that we have done to integrate ontology development tools into common software development infrastructure used by those who are not experts in Ontology Engineering.
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Oscar Corcho
Presentación sobre iniciativas de Open Data Internacionales y nacionales, realizada en el contexto del Curso de Verano de la Universidad de Extremadura "BigData y Machine Learning junto a fuentes de datos abiertos para especializar el sector agroganadero", el 25/09/2018
Presentación general sobre contaminación lumínica, en español, del proyecto STARS4ALL (www.stars4all.eu). Generada por el consorcio del proyecto, con especial agradecimiento a Lucía García (@shekda) por generar la primera versión en inglés, y Miquel Serra-Ricart, por realizar su traducción inicial.
Towards Reproducible Science: a few building blocks from my personal experienceOscar Corcho
Invited keynote given at the Second International Workshop on Semantics for BioDiversity (http://fusion.cs.uni-jena.de/s4biodiv2017/), held in conjunction with ISWC2017 (https://iswc2017.semanticweb.org/)
Publishing Linked Statistical Data: Aragón, a case studyOscar Corcho
Presentation at the Semstats2017 workshop (http://semstats.org/2017/) for the paper "Publishing Linked Statistical Data: Aragón, a Case Study", by Oscar Corcho, Idafen Santana-Pérez, Hugo Lafuente, David Portolés, César Cano, Alfredo Peris, José María Subero.
An initial analysis of topic-based similarity among scientific documents base...Oscar Corcho
Presentation given at the SemSci2017 workshop (https://semsci.github.io/semSci2017/), for the paper "An Initial Analysis of Topic-based Similarity among Scientific Documents Based on their Rhetorical Discourse Parts" http://ceur-ws.org/Vol-1931/paper-03.pdf
Introductory talk on the usage of Linked Data for official statistics, given at the ESS (Linked) Open Data Workshop 2017, in Malta, January 2017.
In this introductory talk we will discuss the main foundations for the application of Linked Data principles into official statistics. We will briefly introduce what Linked Data is, as well as the main principles, languages and technologies behind it (URIs, RDF, SPARQL). We will also discuss about the different formats in which data can be made available on the Web (e.g., RDF Turtle, JSON-LD, CSV on the Web). We will then move into providing a detailed presentation, with step by step examples based on existing Linked Statistical Data sources, of the W3C recommendation RDF DataCube, which is the basis for the dissemination of statistical data as Linked Data. Finally, we will provide some examples of applications, and the opportunities that this approach offers for the development of the proofs of concepts selected by Eurostat and to be discussed during the meeting.
Aplicando los principios de Linked Data en AEMETOscar Corcho
Presentación realizada en uno de los paneles de la jornada sobre datos abiertos organizada por AEMET el 13 de diciembre del 2016, sobre la aplicación de los principios de Linked Data la API REST de AEMET
Ojo Al Data 100 - Call for sharing session at IODC 2016Oscar Corcho
This is the presentation of the #ojoaldata100 initiative (http://ojoaldata100.okfn.es) for the selection of 100 datasets that every city should be publishing in their open data portal. This presentation was used in a call for sharing session at the 4th International Open Data Conference (IODC2016).
Educando sobre datos abiertos: desde el colegio a la universidadOscar Corcho
Presentación realizada en la mesa 3 del evento Aporta 2016, uno de los pre-eventos de la semana de los datos abiertos en Madrid. Realizada el 3 de octubre del 2016.
http://datos.gob.es/encuentro-aporta?q=node/654503
Generación de datos estadísticos enlazados del Instituto Aragonés de EstadísticaOscar Corcho
En esta presentación mostramos el trabajo realizado para la generación y publicación de datos enlazados a partir de los datos de estadística local del Instituto Aragonés de Estadística
Presentación de la red de excelencia de Open Data y Smart CitiesOscar Corcho
Presentación general de la red de excelencia de Open Data y Smart Cities (http://www.opencitydata.es), realizada en Medialab-Prado el 18 de febrero de 2016
Why do they call it Linked Data when they want to say...?Oscar Corcho
The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.
Linked Statistical Data: does it actually pay off?Oscar Corcho
Invited keynote at the ISWC2015 Workshop on Semantics and Statistics (SemStats 2015). http://semstats.github.io/2015/
The release of the W3C RDF Data Cube recommendation was a significant milestone towards improving the maturity of the area of Linked Statistical Data. Many Data Cube-based datasets have been released since then. Tools for the generation and exploitation of such datasets have also appeared. While the benefits for the usage of RDF Data Cube and the generation of Linked Data in this area seem to be clear, there are still many challenges associated to the generation and exploitation of such data. In this talk we will reflect about them, based on our experience on generating and exploiting such type of data, and hopefully provoke some discussion about what the next steps should be.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GridMate - End to end testing is a critical piece to ensure quality and avoid...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Processing
1. On the need for a W3C
community group on RDF
Stream Processing
ISWC2013 Workshop on Ordering and Reasoning,
Sydney, 22/10/2013
Oscar Corcho
ocorcho@fi.upm.es, ocorcho@localidata.com
@ocorcho
http://www.slideshare.net/ocorcho/
2. Disclaimer…
This presentation expresses my view but not necessarily the one from
the rest of the group (although I hope that it is similar)
<<Texto libre: proyecto, speaker, etc.>>
2
3. Acknowledgements
• All those that I have “stolen” slides, material and
ideas from
•
•
•
•
•
Emanuele Della Valle
Daniele Dell’Aglio
Marco Balduini
Jean Paul Calbimonte
And many others who
have already started
contributing…
<<Texto libre: proyecto, speaker, etc.>>
3
4. Why setting up a community group?
In RDF Stream models
(timestamps, events, time
intervals, triple-based, graph-based …)
In RDF Stream query languages
(windows, stream selection,
CEP-based operators, …)
Heterogeneity
In implementations
(RDF native, query rewriting,
continuous query registration,
scalability, static vs streaming data…)
<<Texto libre: proyecto, speaker, etc.>>
4
In operational semantics
(tick, window content, report)
5. You may think that we do not like heterogeneity…
<<Texto libre: proyecto, speaker, etc.>>
5
6. But at least I love it…
• However, we need to tell people what to expect with
each system, and smooth differences when they
are not crucial……
<<Texto libre: proyecto, speaker, etc.>>
6
7. The solution…
• Let’s create a W3C community group…
•
•
•
•
•
To understand better those differences
The requirements on which we are based
And explain to others
…
And maybe get some “recommendation” out
<<Texto libre: proyecto, speaker, etc.>>
7
8. The W3C RDF Stream Processing Comm. Group
• http://www.w3.org/community/rsp/
<<Texto libre: proyecto, speaker, etc.>>
8
9. W3C RSP Community Group mission
“The mission of the RDF Stream Processing
Community Group (RSP) is to define a common model
for producing, transmitting and continuously querying
RDF Streams. This includes extensions to both RDF
and SPARQL for representing streaming data, as well
as their semantics. Moreover this work envisions an
ecosystem of streaming and static RDF data sources
whose data can be combined through standard models,
languages and protocols. Complementary to related
work in the area of databases, this Community Group
looks at the dynamic properties of graph-based data,
i.e., graphs that are produced over time and which may
change their shape and data over time.”
<<Texto libre: proyecto, speaker, etc.>>
9
10. Use cases
• We have started collecting them
• And I hope that by the end of my talk you will
consider contributing some more…
<<Texto libre: proyecto, speaker, etc.>>
10
11. A template to describe use cases (I)
•
Streaming Information
•
•
•
•
•
•
Type: Environmental data: temperatures, pressures, salinity, acidity, fluid
velocities etc,
Nature:
• Relational Stream: yes
• Text stream: no
Origin: Data is produced by sensors in oil wells and on oil and gas
platforms equipments. Each oil platform has an average of 400.000.
Frequency of update:
• from sub-second to minutes
• In triples/minute: [10000-10] t/min
Quality: It varies, due to instrument/sensor issues
Management /access
• Technology in use: Dedicated (relational and proprietary) stores
• Problems: The ability of users to access data from different sources is
limited by an insufficient description of the context
• Means of improvement: Add context (metadata) to the data so it
become meaningful and use reasoning techniques to process that
metadata
<<Texto libre: proyecto, speaker, etc.>>
11
12. A template to describe use cases (II)
•
[optional] Static Information required to interpret the streaming
information
•
•
•
•
•
Type: Topology of the sensor network, position of each sensor, the
descriptions of the oil platform
Origin: Oil and gas production operations
Dimension:
• 100s of MB as PostGIS dump
• In triples: 10^8
Quality: Good
Management / access
• Technology in use: RDBMS, proprietary technologies
• Available Ontologies and Vocabularies: Reference Semantic Model
(RSM), based on ISO 15926
<<Texto libre: proyecto, speaker, etc.>>
12
13. A tale of four heterogeneities
ISWC2013 Workshop on Ordering and Reasoning,
Sydney, 22/10/2013
Oscar Corcho
ocorcho@fi.upm.es, ocorcho@localidata.com
@ocorcho
http://www.slideshare.net/ocorcho/
15. What is an RDF stream?
• Several possibilities:
• An RDF stream is an infinite sequence of timestamped
events (triples or graphs), where timestamps are nondecreasing
…
<eventi,ti >
<eventi+1,ti+1 >
<eventi+2,ti+2 >
…
• An RDF stream is an infinite sequence of triple occurrences
<<s,p,o>,tα,tω> where <s,p,o> is an RDF triple and tα and tω
are the start and end of the interval
• How are timestamps assigned?
16. Some examples…
• What would be the best/possible RDF stream
representation for the following types of problems?
• Does Alice meet Bob before Carl?
• Who does Carl meet first?
:alice :isWith :bob
:alice :isWith :carl
e1
:diana :isWith :carl
:bob :isWith :diana
e2
e3
e4
• How many people has Alice met in the last 5m?
• Does Diana meet Bob and then Carl within 5m?
1
3
6
9
t
• Which are the meetings the last less than 5m?
• Which are the meetings with conflicts?
:alice :isWith :bob
:alice :isWith :carl
:bob :isWith :diana
:diana :isWith :carl
e4
e2
e1
<<Texto libre: proyecto, speaker, etc.>>
e3
16
17. Data types for semantic streams - Summary
•
Multiple notions of RDF stream proposed
• Ordered sequence (implicit timestamp)
• One timestamp per triple (point in time semantics)
• Two timestamps per triple (interval base semantics)
•
Comparison between existing approaches
System
Time model
# of timestamps
INSTANS
triple
Implicit
0
C-SPARQL
triple
Point in time
1
SPARQLstream
triple
Point in time
1
CQELS
triple
Point in time
1
Sparkwave
triple
Point in time
1
Streaming Linked Data
RDF graph
Point in time
1
ETALIS
•
Data item
triple
Interval
2
More investigation is required to agree on an RDF stream model
17
19. Existing RDF Stream Processing systems
• C-SPARQL: RDF Store + Stream processor
• Combined architecture
C-SPARQL
query
sta
translator
tic
stre
amin
RDF Store
g
Stream
processor
continuous
results
• CQELS: Implemented from scratch. Focus on performance
• Native + adaptive joins for static-data and streaming data
CQELS
query
Native RSP
continuous
results
• CQELS-Cloud: Reusing Storm
• Paper presentation on Thursday
CQELS
query
Storm
topology
continuous
results
20. Existing RSP systems
• EP-SPARQL: Complex-event detection
• SEQ, EQUALS operators
EP-SPARQL
query
translator
Prolog
engine
continuous
results
• SPARQLStream: Ontology-based stream query
answering
• Virtual RDF views, using R2RML mappings
• SPARQL stream queries over the original data streams.
SPARQLStream
query
rewriter
DSMS/CEP
R2RML mappings
• Instans: RETE-based evaluation
continuous
results
21. Query languages for semantic streams - Summary
• Different architectural choices
• It is not clear when each choice is best for which type of use
case
• Wrappers over existing systems
• C-SPARQL, ETALIS, SPARQLstream , CQELS-Cloud
• Better reliability and maintainability?
• Native implementations
• CQELS, Streaming Linked Data, INSTANS
• Better scalability: optimizations that are not possible
in other systems
• Different operational semantics
• See later
21
23. Querying data streams (from CQL to SPARQL-X)
stream-to-relation (S2R)
Relation
s
Streams
infinite
unbounded
bag
…
<s,τ>
…
relation-to-relation (R2R)
relation-to-stream (R2S)
Stream
<s1>
<s2>
<s3>
finite
bag
Relati on R(t)
Mapping: T R
S2R Window operators
RDF
Streams
SPARQL operators
RDF
R2S operators
24. Output: relation
• Case 1: the output is a set of timestamped mappings
a … ?b… [t1]
a … ?b…
SELECT ?a ?b …
FROM ….
WHERE ….
queries
CONSTRUCT {?a :prop ?b }
FROM ….
WHERE ….
a … ?b… [t3]
a … ?b… [t5]
RS
P
a … ?b… [t7]
bindings
<… :prop … > [t1]
<… :prop … >
<… :prop … > [t3]
<… :prop … > [t5]
<… :prop … > [t7]
triples
25. Output: stream
• Case 2: the output is a stream
• R2S operators
CONSTRUCT RSTREAM {?a :prop ?b }
FROM ….
WHERE ….
query
RS
P
stream
…
<… :prop … > [t1]
<… :prop … > [t1]
<… :prop … > [t3]
<… :prop … > [t5]
< …:prop … > [t7]
…
ISTREAM: stream out data in the last step that wasn’t on the previous step
DSTREAM: stream out data in the previous step that isn’t in the last step
RSTREAM: stream out all data in the last step
26. Other operators
• Sequence operators and CEP world
e4
S
e1
e2
e3
1
3
6
Sequence
9
Simultaneous
SEQ: joins eti,tf and e’ti’,tf’ if e’ occurs after e
EQUALS: joins eti,tf and e’ti’,tf’ if they occur simultaneously
OPTIONALSEQ, OPTIONALEQUALS: Optional join variants
27. Query languages for semantic streams - Summary
•
Comparison between existing approaches
System
S2R
R2R
Time-aware
R2S
INSTANS
Based on
time events
SPARQL
update
Based on time events
Ins only
C-SPARQL
Engine
Logical and
triple-based
SPARQL 1.1
query
timestamp function
Batch only
SPARQLstream
Logical and
triple-based
SPARQL 1.1
query
no
Ins, batch,
del
CQELS
Logical and
triple-based
SPARQL 1.1
query
no
Ins only
Sparkwave
Logical
SPARQL 1.0
no
Ins only
Streaming Linked
Data
Logical and
graph-based
SPARQL 1.1
no
Batch only
ETALIS
no
SPARQL 1.0
• Is it time to converge on a
27
SEQ, PAR, AND, OR,
DURING, STARTS,
standard? NOT,
EQUALS,
MEETS, FINISHES
Ins only
28. Query languages for semantic streams - Issues
• Different syntax for S2R operator
• Semantics of query languages is similar, but not
identical
• Lack of R2S operator in some cases
• Different support for time-aware operators
28
31. Operational Semantics
Where are both alice and bob in the last 5s?
hall
:hall
sIn :
:i
isIn
e
:
:alic
:bob
S
e
:alic
hen
:kitc
:isIn
S1
S2
S3
S4
1
3
6
:bob
hen
:kitc
:isIn
9
System 1:
System 2:
:hall [5]
:hall [3]
t
:kitchen [10]
:kitchen [9]
Both correct?
ISWC 2013 evaluation track for "On Correctness in RDF stream
processor benchmarking" by Daniele Dell’Aglio, Jean-Paul
Calbimonte, Marco Balduini, Oscar Corcho and Emanuele Della Valle
33. Next steps in the community group…
• Agree on an RDF model?
•
•
•
•
Metamodel?
Timestamps in graphs?
Timestamp intervals
Compatibility with normal (static) RDF
• Additional operators for SPARQL?
• Windows (not only time based?)
• CEP operators
• Semantics
• Go Web
• Volatile URIs
• Serialization: terse, compact
• Protocols: HTTP, Websockets?
34. On the need for a W3C
community group on RDF
Stream Processing
ISWC2013 Workshop on Ordering and Reasoning,
Sydney, 22/10/2013
Oscar Corcho
ocorcho@fi.upm.es, ocorcho@localidata.com
@ocorcho
http://www.slideshare.net/ocorcho/