Polar Domain Discovery with Sparkler - EarthCubeKaranjeet Singh
Polar Deep Insights with Domain Discovery and Sparkler (Spark Crawler). Presented at EarthCube All Hands Meeting 2017! #ECAHM2017 #USCDataScience #IRDS
This document discusses RDF stream processing and the role of semantics. It begins by outlining common sources of streaming data on the internet of things. It then discusses challenges of querying streaming data and existing approaches like CQL. Existing RDF stream processing systems are classified based on their query capabilities and use of time windows and reasoning. The role of linked data principles and HTTP URIs for representing streaming sensor data is discussed. Finally, requirements for reactive stream processing systems are outlined, including keeping data moving, integrating stored and streaming data, and responding instantaneously. The document argues that building relevant RDF stream processing systems requires going beyond existing requirements to address data heterogeneity, stream reasoning, and optimization.
Fostering Serendipity through Big Linked DataMuhammad Saleem
This document discusses fostering serendipity through linking large biomedical datasets. It linked over 30 billion triples from The Cancer Genome Atlas (TCGA) and over 23 million publications from PubMed. It developed an architecture called TopFed to continuously integrate new data through parallel querying. TopFed was evaluated against the FedX system and shown to have significantly better performance, with query runtimes over 75 times faster for some queries. A visualization interface was also created to explore the linked data.
This document discusses query rewriting in RDF stream processing. It presents StreamQR, a system that incorporates query rewriting techniques with an RDF stream processor (RSP) to answer queries over ontologies in streams. StreamQR rewrites queries using an ontology and registers the rewritten queries with an RSP. It achieves throughput comparable to no rewriting even for queries with many rewritings. StreamQR performance is evaluated under different workloads and compared to an approach using incremental reasoning. Query rewriting allows efficient query answering over ontologies in RSPs.
Many Task Applications for Grids and SupercomputersIan Foster
The document discusses how new supercomputing applications are increasingly focused on "logistical" issues like executing many communication-intensive tasks over large shared datasets, rather than "heroic" computations of a single task. It argues that new programming models and tools are needed to efficiently manage large numbers of tasks, complex data dependencies, and failures at extreme scales of petascale and exascale computers. Examples of applications that could benefit include parameter studies, ensemble simulations, data analysis, and scientific workflows involving millions of tasks.
Achieving time effective federated information from scalable rdf data using s...తేజ దండిభట్ల
This document discusses achieving time effective federated information from scalable RDF data using SPARQL queries. It aims to retrieve federated data from heterogeneous databases represented as a single RDF data file using SPARQL queries as a global web service quickly. Key points include integrating data from different sources into RDF format, using SPARQL queries to access the federated RDF data, and analyzing response times for queries on large RDF datasets.
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
Lisa Johnson's talk at the #ICG13 GigaScience Prize Track: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Shenzhen, 26th October 2018
Polar Domain Discovery with Sparkler - EarthCubeKaranjeet Singh
Polar Deep Insights with Domain Discovery and Sparkler (Spark Crawler). Presented at EarthCube All Hands Meeting 2017! #ECAHM2017 #USCDataScience #IRDS
This document discusses RDF stream processing and the role of semantics. It begins by outlining common sources of streaming data on the internet of things. It then discusses challenges of querying streaming data and existing approaches like CQL. Existing RDF stream processing systems are classified based on their query capabilities and use of time windows and reasoning. The role of linked data principles and HTTP URIs for representing streaming sensor data is discussed. Finally, requirements for reactive stream processing systems are outlined, including keeping data moving, integrating stored and streaming data, and responding instantaneously. The document argues that building relevant RDF stream processing systems requires going beyond existing requirements to address data heterogeneity, stream reasoning, and optimization.
Fostering Serendipity through Big Linked DataMuhammad Saleem
This document discusses fostering serendipity through linking large biomedical datasets. It linked over 30 billion triples from The Cancer Genome Atlas (TCGA) and over 23 million publications from PubMed. It developed an architecture called TopFed to continuously integrate new data through parallel querying. TopFed was evaluated against the FedX system and shown to have significantly better performance, with query runtimes over 75 times faster for some queries. A visualization interface was also created to explore the linked data.
This document discusses query rewriting in RDF stream processing. It presents StreamQR, a system that incorporates query rewriting techniques with an RDF stream processor (RSP) to answer queries over ontologies in streams. StreamQR rewrites queries using an ontology and registers the rewritten queries with an RSP. It achieves throughput comparable to no rewriting even for queries with many rewritings. StreamQR performance is evaluated under different workloads and compared to an approach using incremental reasoning. Query rewriting allows efficient query answering over ontologies in RSPs.
Many Task Applications for Grids and SupercomputersIan Foster
The document discusses how new supercomputing applications are increasingly focused on "logistical" issues like executing many communication-intensive tasks over large shared datasets, rather than "heroic" computations of a single task. It argues that new programming models and tools are needed to efficiently manage large numbers of tasks, complex data dependencies, and failures at extreme scales of petascale and exascale computers. Examples of applications that could benefit include parameter studies, ensemble simulations, data analysis, and scientific workflows involving millions of tasks.
Achieving time effective federated information from scalable rdf data using s...తేజ దండిభట్ల
This document discusses achieving time effective federated information from scalable RDF data using SPARQL queries. It aims to retrieve federated data from heterogeneous databases represented as a single RDF data file using SPARQL queries as a global web service quickly. Key points include integrating data from different sources into RDF format, using SPARQL queries to access the federated RDF data, and analyzing response times for queries on large RDF datasets.
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
Lisa Johnson's talk at the #ICG13 GigaScience Prize Track: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Shenzhen, 26th October 2018
The document discusses querying live linked data from millions of diverse data sources on the web. It presents different approaches for source selection when querying over dynamic linked data, including using indexes, data summaries, and direct execution. Evaluation of the approaches shows that combining querying of static RDF stores and the live web through source selection dynamics can improve query time and return fresher results.
MapReduce can effectively scale three large-scale MSR studies to clusters with more machines. A software evolution study using J-REX saw a 9x speedup on an 18-machine cluster. Log analysis using JACK saw a 6x speedup. Code clone detection using CCFinder that previously took 58 hours was able to complete in 58 hours on an 18-machine cluster. Two main challenges of migrating MSR studies to MapReduce are the locality of the analysis (local, semi-local, or global) and granularity of analysis (fine-grained or coarse-grained). Other challenges include locating a suitable cluster, managing large amounts of data during analysis, and recovering from errors.
Mining and Untangling Change Genealogies (PhD Defense Talk)Kim Herzig
The document discusses mining software repositories to analyze code history and detect patterns. It describes representing code changes as change operations like adding or removing method definitions. These are used to build change genealogies modeling dependencies between changes. Change genealogies can be model checked using CTL to extract rules describing likely cause-effect chains of changes. These rules are evaluated on projects to predict with over 60% precision which future changes may occur based on current changes. The approach ensures predictions are based on structural dependencies between changes.
This document provides an overview of RDF stream processing and existing RDF stream processing engines. It discusses RDF streams and how sensor data can be represented as RDF streams. It also summarizes some existing RDF stream processing query languages and systems, including C-SPARQL, and the features they support like continuous execution, operators, and time-based windows. The document is intended as a tutorial for developers on working with RDF stream processing.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
This document discusses using Apache Spark and ADAM to perform scalable genomic analysis. It provides an overview of genomics and challenges with existing approaches. ADAM uses Apache Spark and Parquet to efficiently store and query large genomic datasets. The document demonstrates clustering genomic data from the 1000 Genomes Project to predict populations, showing ADAM and Spark can handle large genomic workloads. It concludes these tools provide scalable genomic data processing but future work is needed to implement more advanced algorithms.
We are living in the world of “Big Data”. “Big Data” is mainly expressed with three Vs – Volume, Velocity and Variety. The presentation will discuss how Big Data impacts us and how SAS programmers can use SAS skills in Big Data environment
The presentation will introduce Big Data Storage solution – Hadoop and NoSQL. In Hadoop, the presentation will discuss two major Hadoop capabilities - Hadoop Distributed File System (HDFS) and Map/Reduce (parallel computing in Hadoop). The presentation will show how SAS can work with Hadoop using HDFS LIBNAME, FILENAME, SAS/ACCESS to Hadoop HIVE and SAS GRID Managers to Hadoop YARN. The presentation will also introduce the concepts of NoSQL database for a big data solution.
The presentation will also introduce how SAS can work with the variety of data format, especially XML and JSON. The presentation will show the use case of converting XML documents to SAS datasets using LIBNAME XMLV2 XMLMAP statement. The presentation will also introduce REST API to extract data through internets and will demonstrate how SAS PROC HTTP can move the data through REST API.
This document provides a summary of the BigData ecosystem. It lists various distributed filesystems, NoSQL databases, data models, distributed programming frameworks, data ingestion tools, scheduling tools, system development tools, service programming tools, and machine learning tools that are part of the BigData ecosystem. It also defines the size of bytes, kilobytes, megabytes, gigabytes, terabytes, petabytes, and exabytes. Some related links on open data, NoSQL databases, traditional databases vs NoSQL, and the role of SQL in big data are also included.
Opportunities for X-Ray science in future computing architecturesIan Foster
The world of computing continues to evolve rapidly. In just the past 10 years, we have seen the emergence of petascale supercomputing, cloud computing that provides on-demand computing and storage with considerable economies of scale, software-as-a-service methods that permit outsourcing of complex processes, and grid computing that enables federation of resources across institutional boundaries. These trends shown no signs of slowing down: the next 10 years will surely see exascale, new cloud offerings, and terabit networks. In this talk I review various of these developments and discuss their potential implications for a X-ray science and X-ray facilities.
This document discusses using RESTdesc to enable automated composition of sensor web APIs. RESTdesc can be used to describe the functionality of web APIs for sensors like temperature, location, and pressure sensors. These descriptions are modeled as rules that can be chained together using semantic web reasoning. The author has tested this approach and found that RESTdesc composition scales well, with chains of over 500 APIs completing in under 2 seconds. This allows for automated composition of sensor web APIs to answer complex queries.
The Materials Project is an open initiative that makes calculated materials property data publicly available to accelerate materials innovation. It has calculated properties for over 30,000 materials using over 10 million CPU hours. The project provides a Python library and API to access and analyze materials data, as well as a workflow manager to run calculations on supercomputers. It aims to calculate all known inorganic materials and establish collaborations to develop new materials design tools.
This document summarizes Jean-Paul Calbimonte's presentation on connecting stream reasoners on the web. It discusses representing data streams as RDF and using RDF stream processing systems. Key points include:
- RDF streams can be represented as sequences of timestamped RDF graphs.
- The W3C RSP community group is working to standardize RDF stream models and query languages.
- Producing RDF streams involves mapping live data sources to RDF and adding timestamps.
- Consuming RDF streams involves discovering stream metadata and endpoints to access the streams.
- Systems like TripleWave demonstrate approaches for spreading RDF streams on the web.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
The document summarizes a system for integrating crop data and meteorological data using a standardized data exchange framework. The system uses a metadata database and broker service called MetBroker to provide consistent access to heterogeneous weather databases. Crop data from different sources can be uploaded and integrated into a central database. The system then allows users to query the integrated crop and weather data and analyze relationships to support applications like crop modeling.
Accelerating Science with Cloud Technologies in the ABoVE Science CloudGlobus
This document summarizes the use of the ABoVE Science Cloud (ASC) to support research for the Arctic-Boreal Vulnerability Experiment (ABoVE). The ASC provides researchers with large datasets, computing resources, and tools to process and analyze remote sensing and model data related to Alaska and northern Canada. Several examples are given of projects using the ASC, including analyzing satellite imagery to map forest structure, tracking surface water changes over time, characterizing fire history, and modeling future forest composition under climate change. The ASC aims to facilitate collaboration by allowing scientists to access common datasets and run computationally-intensive processes in the cloud without having to directly transfer large amounts of data.
Triplewave: a step towards RDF Stream Processing on the WebDaniele Dell'Aglio
The slides of my talk at INSIGHT Centre for Data Analytics (in NUI Galway) where I presented TripleWave (http://streamreasoning.github.io/TripleWave/), an open-source framework to create and publish streams of RDF data.
The document summarizes changes at The HDF Group, including new staff members and their roles. It also outlines recent and upcoming HDF software releases, new operating system and compiler support, tools for HDF and netCDF interoperability, ongoing research projects involving parallel I/O and analysis, and potential projects of interest involving scientific domains like plasma physics, particle accelerators, and digital twin technology.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
The PRP is a partnership of more than 50 institutions, led by researchers at UC San Diego and UC Berkeley and includes the National Science Foundation, Department of Energy, and multiple research universities in the US and around the world. The PRP builds on the optical backbone of Pacific Wave, a joint project of CENIC and the Pacific Northwest GigaPOP (PNWGP) to create a seamless research platform that encourages collaboration on a broad range of data-intensive fields and projects.
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Ben Blaiszik from University of Chicago and Argonne National Laboratory Data Science and Learning Division.
The document summarizes an open genomic data project called OpenFlyData that links and integrates gene expression data from multiple sources using semantic web technologies. It describes how RDF and SPARQL are used to query linked data from sources like FlyBase, BDGP and FlyTED. It also discusses applications built on top of the linked data as well as performance and challenges of the system.
This document discusses various approaches for building applications that consume linked data from multiple datasets on the web. It describes characteristics of linked data applications and generic applications like linked data browsers and search engines. It also covers domain-specific applications, faceted browsers, SPARQL endpoints, and techniques for accessing and querying linked data including follow-up queries, querying local caches, crawling data, federated query processing, and on-the-fly dereferencing of URIs. The advantages and disadvantages of each technique are discussed.
The document discusses querying live linked data from millions of diverse data sources on the web. It presents different approaches for source selection when querying over dynamic linked data, including using indexes, data summaries, and direct execution. Evaluation of the approaches shows that combining querying of static RDF stores and the live web through source selection dynamics can improve query time and return fresher results.
MapReduce can effectively scale three large-scale MSR studies to clusters with more machines. A software evolution study using J-REX saw a 9x speedup on an 18-machine cluster. Log analysis using JACK saw a 6x speedup. Code clone detection using CCFinder that previously took 58 hours was able to complete in 58 hours on an 18-machine cluster. Two main challenges of migrating MSR studies to MapReduce are the locality of the analysis (local, semi-local, or global) and granularity of analysis (fine-grained or coarse-grained). Other challenges include locating a suitable cluster, managing large amounts of data during analysis, and recovering from errors.
Mining and Untangling Change Genealogies (PhD Defense Talk)Kim Herzig
The document discusses mining software repositories to analyze code history and detect patterns. It describes representing code changes as change operations like adding or removing method definitions. These are used to build change genealogies modeling dependencies between changes. Change genealogies can be model checked using CTL to extract rules describing likely cause-effect chains of changes. These rules are evaluated on projects to predict with over 60% precision which future changes may occur based on current changes. The approach ensures predictions are based on structural dependencies between changes.
This document provides an overview of RDF stream processing and existing RDF stream processing engines. It discusses RDF streams and how sensor data can be represented as RDF streams. It also summarizes some existing RDF stream processing query languages and systems, including C-SPARQL, and the features they support like continuous execution, operators, and time-based windows. The document is intended as a tutorial for developers on working with RDF stream processing.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
This document discusses using Apache Spark and ADAM to perform scalable genomic analysis. It provides an overview of genomics and challenges with existing approaches. ADAM uses Apache Spark and Parquet to efficiently store and query large genomic datasets. The document demonstrates clustering genomic data from the 1000 Genomes Project to predict populations, showing ADAM and Spark can handle large genomic workloads. It concludes these tools provide scalable genomic data processing but future work is needed to implement more advanced algorithms.
We are living in the world of “Big Data”. “Big Data” is mainly expressed with three Vs – Volume, Velocity and Variety. The presentation will discuss how Big Data impacts us and how SAS programmers can use SAS skills in Big Data environment
The presentation will introduce Big Data Storage solution – Hadoop and NoSQL. In Hadoop, the presentation will discuss two major Hadoop capabilities - Hadoop Distributed File System (HDFS) and Map/Reduce (parallel computing in Hadoop). The presentation will show how SAS can work with Hadoop using HDFS LIBNAME, FILENAME, SAS/ACCESS to Hadoop HIVE and SAS GRID Managers to Hadoop YARN. The presentation will also introduce the concepts of NoSQL database for a big data solution.
The presentation will also introduce how SAS can work with the variety of data format, especially XML and JSON. The presentation will show the use case of converting XML documents to SAS datasets using LIBNAME XMLV2 XMLMAP statement. The presentation will also introduce REST API to extract data through internets and will demonstrate how SAS PROC HTTP can move the data through REST API.
This document provides a summary of the BigData ecosystem. It lists various distributed filesystems, NoSQL databases, data models, distributed programming frameworks, data ingestion tools, scheduling tools, system development tools, service programming tools, and machine learning tools that are part of the BigData ecosystem. It also defines the size of bytes, kilobytes, megabytes, gigabytes, terabytes, petabytes, and exabytes. Some related links on open data, NoSQL databases, traditional databases vs NoSQL, and the role of SQL in big data are also included.
Opportunities for X-Ray science in future computing architecturesIan Foster
The world of computing continues to evolve rapidly. In just the past 10 years, we have seen the emergence of petascale supercomputing, cloud computing that provides on-demand computing and storage with considerable economies of scale, software-as-a-service methods that permit outsourcing of complex processes, and grid computing that enables federation of resources across institutional boundaries. These trends shown no signs of slowing down: the next 10 years will surely see exascale, new cloud offerings, and terabit networks. In this talk I review various of these developments and discuss their potential implications for a X-ray science and X-ray facilities.
This document discusses using RESTdesc to enable automated composition of sensor web APIs. RESTdesc can be used to describe the functionality of web APIs for sensors like temperature, location, and pressure sensors. These descriptions are modeled as rules that can be chained together using semantic web reasoning. The author has tested this approach and found that RESTdesc composition scales well, with chains of over 500 APIs completing in under 2 seconds. This allows for automated composition of sensor web APIs to answer complex queries.
The Materials Project is an open initiative that makes calculated materials property data publicly available to accelerate materials innovation. It has calculated properties for over 30,000 materials using over 10 million CPU hours. The project provides a Python library and API to access and analyze materials data, as well as a workflow manager to run calculations on supercomputers. It aims to calculate all known inorganic materials and establish collaborations to develop new materials design tools.
This document summarizes Jean-Paul Calbimonte's presentation on connecting stream reasoners on the web. It discusses representing data streams as RDF and using RDF stream processing systems. Key points include:
- RDF streams can be represented as sequences of timestamped RDF graphs.
- The W3C RSP community group is working to standardize RDF stream models and query languages.
- Producing RDF streams involves mapping live data sources to RDF and adding timestamps.
- Consuming RDF streams involves discovering stream metadata and endpoints to access the streams.
- Systems like TripleWave demonstrate approaches for spreading RDF streams on the web.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
The document summarizes a system for integrating crop data and meteorological data using a standardized data exchange framework. The system uses a metadata database and broker service called MetBroker to provide consistent access to heterogeneous weather databases. Crop data from different sources can be uploaded and integrated into a central database. The system then allows users to query the integrated crop and weather data and analyze relationships to support applications like crop modeling.
Accelerating Science with Cloud Technologies in the ABoVE Science CloudGlobus
This document summarizes the use of the ABoVE Science Cloud (ASC) to support research for the Arctic-Boreal Vulnerability Experiment (ABoVE). The ASC provides researchers with large datasets, computing resources, and tools to process and analyze remote sensing and model data related to Alaska and northern Canada. Several examples are given of projects using the ASC, including analyzing satellite imagery to map forest structure, tracking surface water changes over time, characterizing fire history, and modeling future forest composition under climate change. The ASC aims to facilitate collaboration by allowing scientists to access common datasets and run computationally-intensive processes in the cloud without having to directly transfer large amounts of data.
Triplewave: a step towards RDF Stream Processing on the WebDaniele Dell'Aglio
The slides of my talk at INSIGHT Centre for Data Analytics (in NUI Galway) where I presented TripleWave (http://streamreasoning.github.io/TripleWave/), an open-source framework to create and publish streams of RDF data.
The document summarizes changes at The HDF Group, including new staff members and their roles. It also outlines recent and upcoming HDF software releases, new operating system and compiler support, tools for HDF and netCDF interoperability, ongoing research projects involving parallel I/O and analysis, and potential projects of interest involving scientific domains like plasma physics, particle accelerators, and digital twin technology.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
The PRP is a partnership of more than 50 institutions, led by researchers at UC San Diego and UC Berkeley and includes the National Science Foundation, Department of Energy, and multiple research universities in the US and around the world. The PRP builds on the optical backbone of Pacific Wave, a joint project of CENIC and the Pacific Northwest GigaPOP (PNWGP) to create a seamless research platform that encourages collaboration on a broad range of data-intensive fields and projects.
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Ben Blaiszik from University of Chicago and Argonne National Laboratory Data Science and Learning Division.
The document summarizes an open genomic data project called OpenFlyData that links and integrates gene expression data from multiple sources using semantic web technologies. It describes how RDF and SPARQL are used to query linked data from sources like FlyBase, BDGP and FlyTED. It also discusses applications built on top of the linked data as well as performance and challenges of the system.
This document discusses various approaches for building applications that consume linked data from multiple datasets on the web. It describes characteristics of linked data applications and generic applications like linked data browsers and search engines. It also covers domain-specific applications, faceted browsers, SPARQL endpoints, and techniques for accessing and querying linked data including follow-up queries, querying local caches, crawling data, federated query processing, and on-the-fly dereferencing of URIs. The advantages and disadvantages of each technique are discussed.
Producing, publishing and consuming linked data - CSHALS 2013François Belleau
This document discusses lessons learned from the Bio2RDF project for producing, publishing, and consuming linked data. It outlines three key lessons: 1) How to efficiently produce RDF using existing ETL tools like Talend to transform data formats into RDF triples; 2) How to publish linked data by designing URI patterns, offering SPARQL endpoints and associated tools, and registering data in public registries; 3) How to consume SPARQL endpoints by building semantic mashups using workflows to integrate data from multiple endpoints and then querying the mashup to answer questions.
Vinod Chachra discussed improving discovery systems through post-processing harvested data. He outlined key players like data providers, service providers, and users. The harvesting, enrichment, and indexing processes were described. Facets, knowledge bases, and branding were discussed as ways to enhance discovery. Chachra concluded that progress has been made but more work is needed, and data and service providers should collaborate on standards.
The document describes an approach and system called WIMU that indexes URIs and linked data sources to enable finding relevant RDF data sources for a given URI. WIMU indexes over 4 billion URIs and 668 thousand datasets. It ranks datasets based on the number of literals associated with a URI to determine where that URI is defined. The system was experimentally found to have high precision and provides a web interface and API for querying URI locations. Future work includes integrating WIMU with the LinkLion link discovery system.
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...Muhammad Saleem
Triplestores are data management systems for storing and querying RDF data. Over recent years, various benchmarks have been proposed to assess the performance of triplestores across different performance measures. However, choosing the most suitable benchmark for evaluating triplestores in practical settings is not a trivial task. This is because triplestores experience varying workloads when deployed in real applications. We address the problem of determining an appropriate benchmark for a given real-life workload by providing a fine-grained comparative analysis of existing triplestore benchmarks. In particular, we analyze the data and queries provided with the existing triplestore benchmarks in addition to several real-world datasets. Furthermore, we measure the correlation between the query execution time and various SPARQL query features and rank those features based on their significance levels. Our experiments reveal several interesting insights about the design of such benchmarks. With this fine-grained evaluation, we aim to support the design and implementation of more diverse benchmarks. Application developers can use our result to analyze their data and queries and choose a data management system.
This document discusses Bio2RDF, a project that converts life science databases into RDF and makes them accessible via SPARQL endpoints. It provides background on the need for data integration, describes how Bio2RDF was implemented including the conversion process and architecture, and outlines future goals like adding more datasets and developing new services.
Sustainable queryable access to Linked DataRuben Verborgh
This document discusses sustainable queryable access to Linked Data through the use of Triple Pattern Fragments (TPF). TPFs provide a low-cost interface that allows clients to query datasets through triple patterns. Intelligent clients can execute SPARQL queries over TPFs by breaking queries into triple patterns and aggregating the results. TPFs also enable federated querying across multiple datasets by treating them uniformly as fragments that can be retrieved. The document demonstrates federated querying over DBpedia, VIAF, and Harvard Library datasets using TPF interfaces.
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...CIARD Movement
Presentation delivered at the Agricultural Data Interoperability Interest Group -- Research Data Alliance (RDA) 4th Plenary Meeting -- Amsterdam, September 2014
This document discusses how semantic web technologies like RDF and SPARQL can help navigate complex bioinformatics databases. It describes a three step method for building a semantic mashup: 1) transform data from sources into RDF, 2) load the RDF into a triplestore, and 3) explore and query the dataset. As an example, it details how Bio2RDF transformed various database cross-reference resources into RDF and loaded them into Virtuoso to answer questions about namespace usage.
Finding knowledge, data and answers on the Semantic Webebiquity
Web search engines like Google have made us all smarter by providing ready access to the world's knowledge whenever we need to look up a fact, learn about a topic or evaluate opinions. The W3C's Semantic Web effort aims to make such knowledge more accessible to computer programs by publishing it in machine understandable form.
<p>
As the volume of Semantic Web data grows software agents will need their own search engines to help them find the relevant and trustworthy knowledge they need to perform their tasks. We will discuss the general issues underlying the indexing and retrieval of RDF based information and describe Swoogle, a crawler based search engine whose index contains information on over a million RDF documents.
<p>
We will illustrate its use in several Semantic Web related research projects at UMBC including a distributed platform for constructing end-to-end use cases that demonstrate the semantic web’s utility for integrating scientific data. We describe ELVIS (the Ecosystem Location Visualization and Information System), a suite of tools for constructing food webs for a given location, and Triple Shop, a SPARQL query interface which searches the Semantic Web for data relevant to a given query ELVIS functionality is exposed as a collection of web services, and all input and output data is expressed in OWL, thereby enabling its integration with Triple Shop and other semantic web resources.
Doctoral Examination at the Karlsruhe Institute of Technology (08.07.2016)Dr.-Ing. Thomas Hartmann
In this thesis, a validation framework is introduced that enables to consistently execute RDF-based constraint languages on RDF data and to formulate constraints of any type. The framework reduces the representation of constraints to the absolute minimum, is based on formal logics, consists of a small lightweight vocabulary, and ensures consistency regarding validation results and enables constraint transformations for each constraint type across RDF-based constraint languages.
The document discusses requirements and approaches for RDF stream processing (RSP). It covers the following key points in 3 sentences:
RSP aims to process continuous RDF streams to address scenarios like sensor data and social media. It involves querying streaming data, integrating streams with static data, and handling issues like imperfections. The document reviews existing RSP systems and languages, actor-based approaches, and the 8 requirements for real-time stream processing including keeping data moving, generating predictable outcomes, and responding instantaneously.
- The document discusses OpenFlyData, a project that integrates biological data from multiple sources using Semantic Web technologies like RDF and SPARQL. It describes applications that allow searching gene expression data across databases.
- Key challenges addressed are that biological data is scattered across sites and integration requires mapping heterogeneous identifiers. The architecture uses a SPARQL endpoint and mappings to expose data from sources like FlyBase, BDGP and FlyAtlas.
- Performance testing showed good query times for real-time user interaction, though some queries took seconds and text matching had issues without custom solutions. Future work aims to add sources and develop more applications.
FlyWeb is a project that integrates biological data from multiple sources using Semantic Web technologies. It allows users to search for gene expression images, sequences, publications and other data about genes. The summary includes:
- FlyWeb integrates data from sources like FlyBase, BDGP and FlyTED about Drosophila genes, linking gene names, expressions images, sequences and publications.
- It uses Semantic Web tools to create a unified application, accessing data through SPARQL queries to different SPARQL endpoints for each source.
- Challenges include mapping different gene name vocabularies and improving performance of case-insensitive text searches in SPARQL. Future work aims to add more data sources and
This document describes an RDF query reformulation algorithm. The algorithm takes as input a conjunctive RDF query and RDF Schema, and outputs a union of equivalent conjunctive queries. The code is written in Java and located on a source code repository. It contains several packages for reasoning, rules, signatures, and utilities. Libraries used include Jena and Xerces. Papers related to the algorithm are cited. Future work includes adding more robust tests of reformulated query content.
Using Architectures for Semantic Interoperability to Create Journal Clubs for...James Powell
This document describes a system for creating digital journal clubs for emergency response. The system harvests and semantically maps bibliographic metadata from various sources to expose focused collections. It augments the metadata with information on author relationships, georeferences and concepts. Tools enable exploration of collections through visualizations and maps. Social features allow users to tag, comment and collaborate, stored as semantic triples to enable interoperability. The system aims to provide responders with timely access to vetted information and collaboration tools to help address emergency situations.
Re-using Media on the Web: Media fragment re-mixing and playoutMediaMixerCommunity
A number of novel application ideas will be introduced based on the media fragment creation, specification and rights management technologies. Semantic search and retrieval allows us to organize sets of fragments by topical or conceptual relevance. These fragment sets can then be played out in a non-linear fashion to create a new media re-mix. We look at a server-client implementation supporting Media Fragments, before allowing the participants to take the sets of media they have selected and create their own re-mix.
Materials science experiments involve complex data that are often very heterogeneous and challenging to reproduce. Challenges with materials science data were observed, for example, in a previous study on harnessing lightweight design potentials via the Materials Data Space for which the data from materials sciences engineering experiments were generated using linked open data principles, e.g., Resource Description Framework (RDF) as the standard model for data interchange on the Web. However, detailed knowledge of formulating questions in the query language SPARQL is necessary to query the data. It was noticed that domain experts in Materials Science lack knowledge of querying the data using SPARQL queries. With this work, we aim to develop NaturalMSEQueries an approach for the material science domain expert where instead of SPARQL queries, the user can develop expressions in natural language, e.g., English, to query the data. This will significantly improve the usability of Semantic Web approaches in materials science and lower the adoption threshold of the methods for the domain experts. We plan to evaluate our approach, with varying amounts of data, from different sources. Furthermore, we want to compare with synthetic data to assess the quality of the implementation of our approach.
The document summarizes research on the intersection of materials science engineering (MSE) and semantic web technologies (SWT). It conducted a literature review identifying 20 key papers using SWT in MSE. The review found ontologies and converting tabular data to RDF were most common applications. It also presented several projects at the Federal Institute for Materials Research and Testing applying SWT to MSE challenges like visualizing methods, natural language queries, and accelerating materials discovery. Overall, the document aimed to illustrate SWT's impact on MSE and identify open challenges at their intersection.
Andre Valdestilhas, Tommaso Soru, and Axel-Cyrille Ngonga Ngomo propose CEDAL, a time-efficient approach for detecting erroneous links in large-scale link repositories. CEDAL uses union-find and graph partitioning to scale to millions of links in O(mlogn) time, improving over the state of the art which has O(n^2) complexity. Experiments show CEDAL outperforms existing approaches and is able to parallelize processing across CPU and GPU cores. The authors conclude CEDAL provides an efficient way to maintain link consistency in large knowledge bases.
Presently, an amount of publications in Machine Learning and Data Mining contexts are contributing to the improvement of algorithms and methods in their respective fields. However, with regard to publication and sharing of scientific experiment achievements, we still face problems on searching and ranking these methods. Scouring the Internet to search state-of-the-art information about specific contexts, such as Named Entity Recognition (NER), is often a time-consuming task. Besides, this process can lead to an incomplete investigation, either because search engines may return incomplete information or keywords may not be properly defined. To bridge this gap, we present WASOTA, a web portal specifically designed to share and readily present metadata about the state of the art on a specific domain, making the process of searching this information easier.
The document proves that the Most Frequent K Characters (MFKC) approach for measuring string similarity is both correct and complete. It does this by showing that the output of MFKC (A) is equal to the set of all string pairs with a similarity score above the threshold (A*). MFKC uses three filters (R1, R2, R3) to iteratively reduce the set of string pairs. It is shown that no pair discarded by the filters has a similarity above the threshold, proving completeness. Correctness follows from the definition of the final output A matching the definition of A*.
1) The document proposes reducing identifier heterogeneity in knowledge bases by developing a GUI that allows users to evaluate links between entities and suggest new links.
2) It presents a workflow involving importing data from multiple sources, normalizing identifiers, and allowing users to rate the quality and suggest improvements to links between entities.
3) Usability testing of the GUI indicated a high level of usability, and results from link ratings could be used in future work to further improve interlinking between knowledge bases.
1) The document proposes reducing identifier heterogeneity in knowledge bases by developing a GUI that allows users to evaluate links between entities and suggest new links.
2) It presents a workflow involving importing data from multiple sources, normalizing identifiers, and allowing users to rate the quality and suggest improvements to links between entities.
3) Results showed that 10.35% of links were transitive or redirects, and usability testing of the GUI indicated a high level of usability with an average SUS score of 82.
Emotion-oriented computing: Possible uses and resourcesAndré Valdestilhas
This article discusses the concepts of using Digital Television Affective Computing and Computer Vision. The proposal involves the union of some techniques such as capturing facial expressions through a video camera, use of accelerometers in ball and touch holograms to work a certain level of interactivity with the viewer. Some examples of uses of the proposal in question are described, such as control of the hearing, background content, among others.
This document discusses using semiotic profiles to design graphical user interfaces for social media data on mobile phones. It begins by outlining the challenges of limited screen size for mobile devices. It then introduces semiotic profiles based on icons, indexes, and symbols to provide an intuitive interface. The document proposes a semiotic profile can help organize large social media data for mobile phones. Future work is needed to analyze applications, assess mobility/usability, and develop prototypes using this approach.
Emotion-oriented computing: Possible uses and applicationsAndré Valdestilhas
This article discusses the concepts of using digital television affective computing and computer vision.
The proposal involves the union of some techniques such as capturing facial expressions through a video
camera, use of accelerometers in ball and touch holograms to work a certain level of interactivity with the
viewer. Some uses of the proposal in question are described, such as control of the hearing, background
content, among others. This article reveals numerous benefits that can be addressed with the use of
matters presented which can be applied in a broad context, such as for the blind in video games, among
others
Um estudo sobre localização de serviços sensíveis ao contexto para Televisão ...André Valdestilhas
O documento discute três abordagens para fornecer sensibilidade ao contexto em televisão digital móvel: Ginga-NCL, PlaceLab e ContexTV. Ginga-NCL permite aplicativos sensíveis ao contexto em dispositivos portáteis. PlaceLab usa sinais Wi-Fi e Bluetooth para estimar a localização do usuário. ContexTV usa comunicação sem fio entre dispositivos e um servidor para fornecer conteúdo personalizado de acordo com o contexto.
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...Suzanne Lagerweij
This is a workshop about communication and collaboration. We will experience how we can analyze the reasons for resistance to change (exercise 1) and practice how to improve our conversation style and be more in control and effective in the way we communicate (exercise 2).
This session will use Dave Gray’s Empathy Mapping, Argyris’ Ladder of Inference and The Four Rs from Agile Conversations (Squirrel and Fredrick).
Abstract:
Let’s talk about powerful conversations! We all know how to lead a constructive conversation, right? Then why is it so difficult to have those conversations with people at work, especially those in powerful positions that show resistance to change?
Learning to control and direct conversations takes understanding and practice.
We can combine our innate empathy with our analytical skills to gain a deeper understanding of complex situations at work. Join this session to learn how to prepare for difficult conversations and how to improve our agile conversations in order to be more influential without power. We will use Dave Gray’s Empathy Mapping, Argyris’ Ladder of Inference and The Four Rs from Agile Conversations (Squirrel and Fredrick).
In the session you will experience how preparing and reflecting on your conversation can help you be more influential at work. You will learn how to communicate more effectively with the people needed to achieve positive change. You will leave with a self-revised version of a difficult conversation and a practical model to use when you get back to work.
Come learn more on how to become a real influencer!
This presentation by OECD, OECD Secretariat, was made during the discussion “The Intersection between Competition and Data Privacy” held at the 143rd meeting of the OECD Competition Committee on 13 June 2024. More papers and presentations on the topic can be found at oe.cd/ibcdp.
This presentation was uploaded with the author’s consent.
This presentation by Thibault Schrepel, Associate Professor of Law at Vrije Universiteit Amsterdam University, was made during the discussion “Artificial Intelligence, Data and Competition” held at the 143rd meeting of the OECD Competition Committee on 12 June 2024. More papers and presentations on the topic can be found at oe.cd/aicomp.
This presentation was uploaded with the author’s consent.
Carrer goals.pptx and their importance in real lifeartemacademy2
Career goals serve as a roadmap for individuals, guiding them toward achieving long-term professional aspirations and personal fulfillment. Establishing clear career goals enables professionals to focus their efforts on developing specific skills, gaining relevant experience, and making strategic decisions that align with their desired career trajectory. By setting both short-term and long-term objectives, individuals can systematically track their progress, make necessary adjustments, and stay motivated. Short-term goals often include acquiring new qualifications, mastering particular competencies, or securing a specific role, while long-term goals might encompass reaching executive positions, becoming industry experts, or launching entrepreneurial ventures.
Moreover, having well-defined career goals fosters a sense of purpose and direction, enhancing job satisfaction and overall productivity. It encourages continuous learning and adaptation, as professionals remain attuned to industry trends and evolving job market demands. Career goals also facilitate better time management and resource allocation, as individuals prioritize tasks and opportunities that advance their professional growth. In addition, articulating career goals can aid in networking and mentorship, as it allows individuals to communicate their aspirations clearly to potential mentors, colleagues, and employers, thereby opening doors to valuable guidance and support. Ultimately, career goals are integral to personal and professional development, driving individuals toward sustained success and fulfillment in their chosen fields.
This presentation by OECD, OECD Secretariat, was made during the discussion “Competition and Regulation in Professions and Occupations” held at the 77th meeting of the OECD Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Why Psychological Safety Matters for Software Teams - ACE 2024 - Ben Linders.pdfBen Linders
Psychological safety in teams is important; team members must feel safe and able to communicate and collaborate effectively to deliver value. It’s also necessary to build long-lasting teams since things will happen and relationships will be strained.
But, how safe is a team? How can we determine if there are any factors that make the team unsafe or have an impact on the team’s culture?
In this mini-workshop, we’ll play games for psychological safety and team culture utilizing a deck of coaching cards, The Psychological Safety Cards. We will learn how to use gamification to gain a better understanding of what’s going on in teams. Individuals share what they have learned from working in teams, what has impacted the team’s safety and culture, and what has led to positive change.
Different game formats will be played in groups in parallel. Examples are an ice-breaker to get people talking about psychological safety, a constellation where people take positions about aspects of psychological safety in their team or organization, and collaborative card games where people work together to create an environment that fosters psychological safety.
This presentation by OECD, OECD Secretariat, was made during the discussion “Pro-competitive Industrial Policy” held at the 143rd meeting of the OECD Competition Committee on 12 June 2024. More papers and presentations on the topic can be found at oe.cd/pcip.
This presentation was uploaded with the author’s consent.
This presentation by OECD, OECD Secretariat, was made during the discussion “Artificial Intelligence, Data and Competition” held at the 143rd meeting of the OECD Competition Committee on 12 June 2024. More papers and presentations on the topic can be found at oe.cd/aicomp.
This presentation was uploaded with the author’s consent.
This presentation by Nathaniel Lane, Associate Professor in Economics at Oxford University, was made during the discussion “Pro-competitive Industrial Policy” held at the 143rd meeting of the OECD Competition Committee on 12 June 2024. More papers and presentations on the topic can be found at oe.cd/pcip.
This presentation was uploaded with the author’s consent.
The importance of sustainable and efficient computational practices in artificial intelligence (AI) and deep learning has become increasingly critical. This webinar focuses on the intersection of sustainability and AI, highlighting the significance of energy-efficient deep learning, innovative randomization techniques in neural networks, the potential of reservoir computing, and the cutting-edge realm of neuromorphic computing. This webinar aims to connect theoretical knowledge with practical applications and provide insights into how these innovative approaches can lead to more robust, efficient, and environmentally conscious AI systems.
Webinar Speaker: Prof. Claudio Gallicchio, Assistant Professor, University of Pisa
Claudio Gallicchio is an Assistant Professor at the Department of Computer Science of the University of Pisa, Italy. His research involves merging concepts from Deep Learning, Dynamical Systems, and Randomized Neural Systems, and he has co-authored over 100 scientific publications on the subject. He is the founder of the IEEE CIS Task Force on Reservoir Computing, and the co-founder and chair of the IEEE Task Force on Randomization-based Neural Networks and Learning Systems. He is an associate editor of IEEE Transactions on Neural Networks and Learning Systems (TNNLS).
This presentation by Yong Lim, Professor of Economic Law at Seoul National University School of Law, was made during the discussion “Artificial Intelligence, Data and Competition” held at the 143rd meeting of the OECD Competition Committee on 12 June 2024. More papers and presentations on the topic can be found at oe.cd/aicomp.
This presentation was uploaded with the author’s consent.
This presentation by Katharine Kemp, Associate Professor at the Faculty of Law & Justice at UNSW Sydney, was made during the discussion “The Intersection between Competition and Data Privacy” held at the 143rd meeting of the OECD Competition Committee on 13 June 2024. More papers and presentations on the topic can be found at oe.cd/ibcdp.
This presentation was uploaded with the author’s consent.
This presentation by Tim Capel, Director of the UK Information Commissioner’s Office Legal Service, was made during the discussion “The Intersection between Competition and Data Privacy” held at the 143rd meeting of the OECD Competition Committee on 13 June 2024. More papers and presentations on the topic can be found at oe.cd/ibcdp.
This presentation was uploaded with the author’s consent.
XP 2024 presentation: A New Look to Leadershipsamililja
Presentation slides from XP2024 conference, Bolzano IT. The slides describe a new view to leadership and combines it with anthro-complexity (aka cynefin).
This presentation by Professor Alex Robson, Deputy Chair of Australia’s Productivity Commission, was made during the discussion “Competition and Regulation in Professions and Occupations” held at the 77th meeting of the OECD Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
1. More Complete Resultset Retrieval from Large
Heterogeneous RDF Sources
Andre Valdestilhas Tommaso Soru Muhammad Saleem
AKSW Group, University of Leipzig, Germany
November 24, 2019
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 1 / 17
4. Motivation
Where to find RDF datasets?
9,960
raw RDF datasets658,206
Datasets (HDT files)
LODLaundromat
Which Dataset?
...
559
Endpoint
Different formats1
Query more than 221 billion triples (> 5 Terabytes)
1Serialization.
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 4 / 17
5. Example
Where to find RDF datasets?
Authors that have a paper type poster/demo in the proceedings of ISWC
20082
2Query from FEDBench
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 5 / 17
6. Example
Where to find RDF datasets?
Authors that have a paper type poster/demo in the proceedings of ISWC
20083
4 HDT datasets4
containing data that can answer the query
3Query from FEDBench
4Semantic Web Dog Food from LOD Laundromat datasets.
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 6 / 17
7. Motivation
Approaches
(+) Multiple SPARQL endpoints
(-) 90% are dump files
(+) Dereferenceable URIs
(-) 43% of the URIs are
non-dereferenceable
Endpoint HDT file file.rdf Dump_file_2
WIMU
Where is my URI?
(+) Data from non-dereferenceable
URIs
(-) No SPARQL query
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 7 / 17
8. The approach
A hybrid SPARQL query engine
Collect data from multiple SPARQL endpoints,
Data from RDF dumps including HDT files and use Link Traversal
Link Traversal, obtaining data from non-dereferenceable URIs using WIMUa
aWhere is my URI?(WIMU) http://wimu.aksw.org/
Resulting in
More complete results
Experiments with 3 state-of-the-art SPARQL query benchmarks,
LargeRDFBench, FedBench and FEASIBLE
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 8 / 17
9. The approach
Select ?p ?o
Where {<http://uri.com> ?p ?o}
Endpoint
hdt file
dump.bz2
file.rdf
...
http://uri1.com
http://uri2.com
http://uriN.com
Extract URIs
WIMU
1
2
3
Data Dumps
Query processor
Traversal Based
Query processor
Union of
the results
Source Filtering
SPARQL-a-lot
Query processor
SPARQL Endpoint
Query processor
wimuQ query
execution engine
Results
<subject1><predicate1><object1>
<subject2><predicate2><object2>
<subjectN><predicateN><objectN>
4
5
6
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 9 / 17
10. The approach
The source selection
Identify relevant datasets from WIMU
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 10 / 17
11. Evaluation
Hypothesis Identify automatically relevant sources from heterogeneous RDF
data, even with non-dereferenceable URIs, can improve the
resultset retrieval
Metrics Coverage and runtime
Approaches FedX (endpoints), SQUIN (Traversal-based), SPARQL-a-lot and
WIMU(dumps)
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 11 / 17
12. Evaluation
Experimental setup
Datasets 221.7 billion triples (>5 terabytes)
Queries 415 queries from FedBench, LargeRDFBench and FEASIBLE
Each query executed 5 times
Hardware 200 GB HD, 8GB RAM, 2.70GHZ single core processor
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 12 / 17
13. Evaluation
Coverage: Overall 76% queries with results(Zero results=non-public endpoints/data -
non-indexed)
CD LS LD Simple Comp Large Chs Dbpedia SWDF
FedBench | |LargeRDFBench Feasible
100
1000
10000
100000
Averagenumberofresults
onlogscale
EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps
wimuQ
Approaches and the best coverage
FedBench 55% endpoints
LargeRDFBench 81% wimuDumps
FEASIBLE 98% wimuDumps
Observation
The combination of those query
processing engines implies more resultset
retrieval
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 13 / 17
14. Evaluation
Number of datasets
More datasets discovered does not implies in more results
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 14 / 17
15. Evaluation
Runtime
Total Average 17 minutes across 3 benchmarks (wimuDumps 2 min, Endpoints
13 min, SPARQL-a-lot 58 sec, LinkTraversal-SQUIN 36 sec)
CD LS LD Simple Comp Large Chs Dbpedia SWDF
FedBench LargeRDFBench Feasible
| |
1
10
Averagerun-time(minutes)
onlogscale
EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps
wimuQ
Interesting wimuQ takes 91% of results from wimuDumps, only 7% from
SPARQL endpoints. Possible reason, SPARQL endpoint
federation split among multiple endpoints, network and number
of intermediate results influence in the runtime
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 15 / 17
16. Conclusion & Future works
Conclusion
A hybrid SPARQL query processing engine to execute SPARQL queries over a
large amount of heterogeneous RDF data
Evaluation on real world datasets using the state of the art of federated and
non-federated query benchmarks (FedBench, LargeRDFBench and
FEASIBLE)
We present the first federated SPARQL query processing engine that executes
SPARQL queries over a total of 221.7 billion triples
Future work
Add more URIs into WIMU index and use Triple Pattern Fragments
A Large Scale approach to study the relation and similarity among the
datasets
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 16 / 17
17. That’s all Folks!
Thanks!
Questions?Github repository: https://github.com/firmao/wimuT
Prototype: https://w3id.org/wimuq/
Contact: valdestilhas@informatik.uni-leipzig.de
Special thanks to my PhD. advisor Prof. Dr. rer. nat. Thomas Riechert
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 17 / 17