This document describes a half-day tutorial on adaptive semantic data management techniques for federations of endpoints. The tutorial covers background concepts, adaptive query processing techniques, RDF engines for federations, and includes hands-on exercises. It aims to address challenges in querying distributed semantic data given unpredictable conditions, lack of statistics, and changing characteristics of remote endpoints.
Exploration and Production Data Management trainingEaswaran Kanason
In collaboration with CGG DMS, this 4-days training course is developed to present the fundamental principles of efficient management of E & P Data. The course is presented as a series of modules over 4 days.
Each module will address a separate topic of E & P Data Management and also highlight the links between the various aspects that are considered. The course is led by data management specialist who is also a qualified trainer and taught by modern training techniques, combining informative discussions and practical exercises.
By the end of this course, delegates will have learned:
The value of effective E & P Data Management and the sources and formats of E & P data
To recognize the Lifecycle Stages of E & P Data starting with Data receipt
Accurate cataloguing, recognition of significant metadata and compliance to standards
Best practices for physical data handling and digital Data Management
To appreciate key technology solutions for E & P Data Management
Importance of Data Security and Data Quality
How to create operational workflows and procedures, and business continuity
Management data archives effectively, retention and destruction strategies
Awareness of legislative data retention requirements
The tutorial will be presented on May 27 2012 at the 9th Extended Semantic Web Conference (ESWC 2012).
Short description of the tutorial:
The tutorial describes the traditional optimize-then-execute paradigm implemented in existing RDF engines and its main drawbacks when a large volume of data needs to be remotely accessed. As a solution to overcome limitations of current query processing approaches, we will present existing adaptive query processing techniques defined in the context of database management systems, and their applicability to the Semantic Web. Also, we will describe current solutions that have been proposed in the context of the Semantic Web to access remote data. The target audience includes researchers and practitioners that develop or use query engines to consume Linked and Big Data through SPARQL endpoints. The participants will learn limitations of existing RDF query engines and how current techniques can be extended to access remote data from Linked Data sets, and hide delays caused by unpredictable data transfers and datasets availability. A hands-on session will allow attendees to evaluate the performance and robustness of existing approaches.
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
Exploration and Production Data Management trainingEaswaran Kanason
In collaboration with CGG DMS, this 4-days training course is developed to present the fundamental principles of efficient management of E & P Data. The course is presented as a series of modules over 4 days.
Each module will address a separate topic of E & P Data Management and also highlight the links between the various aspects that are considered. The course is led by data management specialist who is also a qualified trainer and taught by modern training techniques, combining informative discussions and practical exercises.
By the end of this course, delegates will have learned:
The value of effective E & P Data Management and the sources and formats of E & P data
To recognize the Lifecycle Stages of E & P Data starting with Data receipt
Accurate cataloguing, recognition of significant metadata and compliance to standards
Best practices for physical data handling and digital Data Management
To appreciate key technology solutions for E & P Data Management
Importance of Data Security and Data Quality
How to create operational workflows and procedures, and business continuity
Management data archives effectively, retention and destruction strategies
Awareness of legislative data retention requirements
The tutorial will be presented on May 27 2012 at the 9th Extended Semantic Web Conference (ESWC 2012).
Short description of the tutorial:
The tutorial describes the traditional optimize-then-execute paradigm implemented in existing RDF engines and its main drawbacks when a large volume of data needs to be remotely accessed. As a solution to overcome limitations of current query processing approaches, we will present existing adaptive query processing techniques defined in the context of database management systems, and their applicability to the Semantic Web. Also, we will describe current solutions that have been proposed in the context of the Semantic Web to access remote data. The target audience includes researchers and practitioners that develop or use query engines to consume Linked and Big Data through SPARQL endpoints. The participants will learn limitations of existing RDF query engines and how current techniques can be extended to access remote data from Linked Data sets, and hide delays caused by unpredictable data transfers and datasets availability. A hands-on session will allow attendees to evaluate the performance and robustness of existing approaches.
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
Presentation by IBM's Martin Nally and Mike O'Rourke from Innovate 2010, including discussion on tool integration challenges and IBM's two pronged approach for addressing (OSLC - open linked lifecycle data, Jazz - alm services platform).
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDataWorks Summit
Standard Bank is a leading South African bank with a vision to be the leading financial services organization in and for Africa. We will share our vision, greatest challenges, and most valuable lessons learned on our journey towards enterprise adoption of a big data strategy.
This includes our implementation of: a multi-tenant enterprise data lake, a real time streaming capability, appropriate data management and governance principles, a data science workbench, and a process for model productionisation to support data science teams across the Group and across Africa and Europe.
Speakers
Zakeera Mahomen, Standard Bank, Big Data Practice Lead
Kristel Sampson, Standard Bank, Platform Lead
Nowadays software systems are essential to the environment of most organizations, and their maintenance is a key point to support business dynamics. Thus, reverse engineering legacy systems for knowledge reuse has become a major concern in software industry. This article, based on a survey about reverse engineering tools, discusses a set of functional and nonfunctional requirements for an effective tool for reverse engineering, and observes that current tools only partly support these requirements. In addition, we define new requirements, based on our group’s experience and industry feedback, and present the architecture and implementation of LIFT: a Legacy InFormation retrieval Tool, developed based on these demands. Furthermore, we discuss the compliance of LIFT with the defined requirements. Finally, we applied the LIFT in a reverse engineering project of a 210KLOC NATURAL/ADABAS system of a financial institution and analyzed its effectiveness and scalability, comparing data with previous similar projects performed by the same institution.
“Semantic Technologies for Smart Services” diannepatricia
Rudi Studer, Full Professor in Applied Informatics at the Karlsruhe Institute of Technology (KIT), Institute AIFB, presentation “Semantic Technologies for Smart Services” as part of the Cognitive Systems Institute Speaker Series, December 15, 2016.
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Join Marcel Kurovski to explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Event: O'Reilly Artificial Intelligence Conference, New York, 18.04.2019
Speaker: Marcel Kurovski, inovex GmbH
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyMaribel Acosta Deibe
Summary of crowdsourcing studies to assess the quality of knowledge graphs and complete missing values. Results focus on findings over the DBpedia knowledge graph ( https://wiki.dbpedia.org/).
Related publications:
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., & Lehmann, J. Crowdsourcing Linked Data Quality Assessment. In International Semantic Web Conference (pp. 260-276), 2013.
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., & Lehmann, J. Detecting Linked Data Quality issues via Crowdsourcing: A DBpedia Study. Semantic Web Journal, 9(3), 303-335, 2018.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: A hybrid SPARQL engine to enhance query answers via crowdsourcing. In Proceedings of the 8th International Conference on Knowledge Capture (p. 11). 2015. Best Student Paper Award.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. Enhancing answer completeness of SPARQL queries via crowdsourcing. Journal of Web Semantics, 45, 41-62, 2017.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: An engine for enhancing answer completeness of SPARQL queries via crowdsourcing. Companion Volume of the Web Conference (pp. 501-505). 2018.
More Related Content
Similar to Adaptive Semantic Data Management Techniques for Federations of Endpoints
Presentation by IBM's Martin Nally and Mike O'Rourke from Innovate 2010, including discussion on tool integration challenges and IBM's two pronged approach for addressing (OSLC - open linked lifecycle data, Jazz - alm services platform).
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDataWorks Summit
Standard Bank is a leading South African bank with a vision to be the leading financial services organization in and for Africa. We will share our vision, greatest challenges, and most valuable lessons learned on our journey towards enterprise adoption of a big data strategy.
This includes our implementation of: a multi-tenant enterprise data lake, a real time streaming capability, appropriate data management and governance principles, a data science workbench, and a process for model productionisation to support data science teams across the Group and across Africa and Europe.
Speakers
Zakeera Mahomen, Standard Bank, Big Data Practice Lead
Kristel Sampson, Standard Bank, Platform Lead
Nowadays software systems are essential to the environment of most organizations, and their maintenance is a key point to support business dynamics. Thus, reverse engineering legacy systems for knowledge reuse has become a major concern in software industry. This article, based on a survey about reverse engineering tools, discusses a set of functional and nonfunctional requirements for an effective tool for reverse engineering, and observes that current tools only partly support these requirements. In addition, we define new requirements, based on our group’s experience and industry feedback, and present the architecture and implementation of LIFT: a Legacy InFormation retrieval Tool, developed based on these demands. Furthermore, we discuss the compliance of LIFT with the defined requirements. Finally, we applied the LIFT in a reverse engineering project of a 210KLOC NATURAL/ADABAS system of a financial institution and analyzed its effectiveness and scalability, comparing data with previous similar projects performed by the same institution.
“Semantic Technologies for Smart Services” diannepatricia
Rudi Studer, Full Professor in Applied Informatics at the Karlsruhe Institute of Technology (KIT), Institute AIFB, presentation “Semantic Technologies for Smart Services” as part of the Cognitive Systems Institute Speaker Series, December 15, 2016.
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Join Marcel Kurovski to explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Event: O'Reilly Artificial Intelligence Conference, New York, 18.04.2019
Speaker: Marcel Kurovski, inovex GmbH
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
Similar to Adaptive Semantic Data Management Techniques for Federations of Endpoints (20)
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyMaribel Acosta Deibe
Summary of crowdsourcing studies to assess the quality of knowledge graphs and complete missing values. Results focus on findings over the DBpedia knowledge graph ( https://wiki.dbpedia.org/).
Related publications:
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., & Lehmann, J. Crowdsourcing Linked Data Quality Assessment. In International Semantic Web Conference (pp. 260-276), 2013.
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., & Lehmann, J. Detecting Linked Data Quality issues via Crowdsourcing: A DBpedia Study. Semantic Web Journal, 9(3), 303-335, 2018.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: A hybrid SPARQL engine to enhance query answers via crowdsourcing. In Proceedings of the 8th International Conference on Knowledge Capture (p. 11). 2015. Best Student Paper Award.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. Enhancing answer completeness of SPARQL queries via crowdsourcing. Journal of Web Semantics, 45, 41-62, 2017.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: An engine for enhancing answer completeness of SPARQL queries via crowdsourcing. Companion Volume of the Web Conference (pp. 501-505). 2018.
HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowd...Maribel Acosta Deibe
We propose HARE, a SPARQL query engine that encompasses human-machine query processing to augment the completeness of query answers. We empirically assessed the e effectiveness of HARE on 50 SPARQL queries over DBpedia. Experimental results clearly show that our solution accurately enhances answer completeness.
This work was presented at The Web Conference 2018, Journal Track. ACM Open ToC Service: https://dl.acm.org/authorize?N655127
Reference:
Maribel Acosta, Elena Simperl, Fabian Flöck, and Maria-Esther Vidal. 2017. Enhancing answer completeness of SPARQL queries via crowdsourcing. Web Semantics: Science, Services and Agents on the World Wide Web (2017). https://doi.org/10.1016/j.websem.2017.07.001
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Maribel Acosta Deibe
During empirical evaluations of query processing techniques, metrics like execution time, time for the first answer, and throughput are usually reported. Albeit informative, these metrics are unable to quantify and evaluate the efficiency of a query engine over a certain time period – or diefficiency –, thus hampering the distinction of cutting- edge engines able to exhibit high-performance gradually. We tackle this issue and devise two experimental metrics named dief@t and dief@k, which allow for measuring the diefficiency during an elapsed time period t or while k answers are produced, respectively. The dief@t and dief@k measurement methods rely on the computation of the area under the curve of answer traces, and thus capturing the answer concentration over a time interval. We report experimental results of evaluating the behavior of a generic SPARQL query engine using both metrics. Observed results suggest that dief@t and dief@k are able to measure the performance of SPARQL query engines based on both the amount of answers produced by an engine and the time required to generate these answers.
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingMaribel Acosta Deibe
Best Student Paper Award at the 8th International Conference on Knowledge Capture (K-CAP 2015).
http://tinyurl.com/hare-paper
Abstract:
Due to the semi-structured nature of RDF data, missing values affect answer completeness of queries that are posed against RDF. To overcome this limitation, we present HARE, a novel hybrid query processing engine that brings together machine and human computation to execute SPARQL queries. We propose a model that exploits the characteristics of RDF in order to estimate the complete- ness of portions of a data set. The completeness model complemented by crowd knowledge is used by the HARE query engine to on-the-fly decide which parts of a query should be executed against the data set or via crowd computing. To evaluate HARE, we created and executed a collection of 50 SPARQL queries against the DBpedia data set. Experimental results clearly show that our solution accurately enhances answer completeness.
(The HARE logo is based on the art work by icons8 (https://icons8.com/)
Semantic Data Management in Graph Databases: ESWC 2014 TutorialMaribel Acosta Deibe
In this tutorial we present the basis of graph database frameworks and their applicability in semantic data management. The tutorial targets any conference attendee interested in learning about the current graph-based limited capabilities of existing RDF engines, existing graph database techniques, and extensions to RDF data management approaches in order to provide an efficient graph-based access to linked data.
The tutorial describes existing approaches to model graph databases and different techniques implemented in RDF and Database engines including their main drawbacks when a large volume of interconnected data needs to be traversed.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adaptive Semantic Data Management Techniques for Federations of Endpoints
1. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Semantic Data Management
Techniques for Federations of Endpoints-
Half-Day Tutorial at ESWC 2012
Maribel Acosta1,2, Cosmin Basca3, Gabriela Montoya1
Edna Ruckhaus1,Maria-Esther Vidal1
1 Universidad Sim´on Bol´ıvar, Venezuela
{macosta,gmontoya,ruckhaus,mvidal}@ldc.usb.ve
2 Institute AIFB, Karlsruhe Institute of Technology, Germany
Maribel.Acosta@aifb.uni-karlsruhe.de
3 Department of Informatics, University of Zurich, Switzerland
basca@ifi.uzh.ch
http://www.ldc.usb.ve/˜mvidal/TUTORIAL2012
May 28th, 2012
2. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Introduction
3. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Motivation
Cloud of Linked Data: a large number of
datasets.
Data locally or remotely accessed by RDF
engines and SPARQL endpoints.
RDF engines rely performance on physical
access and storage structures that are locally
stored, but:
loading the data and their links is not
always feasible.
The optimize-then-execute paradigm is
usually followed, but:
lack of statistics.
unpredictable conditions of remote
queries.
changing characteristics.
4. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Motivation
Cloud of Linked Data: a large number of
datasets.
Data locally or remotely accessed by RDF
engines and SPARQL endpoints.
RDF engines rely performance on physical
access and storage structures that are locally
stored, but:
loading the data and their links is not
always feasible.
The optimize-then-execute paradigm is
usually followed, but:
lack of statistics.
unpredictable conditions of remote
queries.
changing characteristics.
Techniques are not adaptable to unpredictable data transfers or data availability.
5. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Motivating Example
Clinical trials and drugs used for Breast, Colorectal, Ovarian, and Lung Cancer.
PREFIX linkedct: <http://data.linkedct.org/resource/linkedct/>
PREFIX foaf:<http://xmlns.com/foaf/0.1/>
PREFIX rdf:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia:<http://dbpedia.org/property/>
PREFIX elements:<http://purl.org/dc/elements/1.1/>
SELECT DISTINCT ?fn1 ?fn2 ?fn3 ?fn4 ?I ?DB ?ID1 ?ID2 ?L
WHERE {
?A1 foaf:page ?fn1.?A1 linkedct:condition ?A2.
?A2 linkedct:condition name ”Colorectal Cancer”.
?A1 linkedct:intervention ?I.
?A3 foaf:page ?fn3.?A3 linkedct:condition ?A4.
?A4 linkedct:condition name ”Breast Cancer”.
?A3 linkedct:intervention ?I.
?A5 foaf:page ?fn5.?A5 linkedct:condition ?A6.
?A6 linkedct:condition name ”Ovarian Cancer”.
?A5 linkedct:intervention?I.
?A7 foaf:page?fn7.?A7 linkedct:condition ?A8.
?A8 linkedct:condition name ”Lung Cancer”.
?A7 linkedct:intervention ?I
?I linkedct:intervention type ”Drug”.
?I rdf:seeAlso ?DB. ?WK foaf:primaryTopic ?DB.
?WK dbpedia:revisionId ?ID1.
?WK dbpedia:pageId ?ID2.?WK elements:language ?L.}
6. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Motivating Example
Clinical trials and drugs used for Breast, Colorectal, Ovarian, and Lung Cancer.
PREFIX linkedct: <http://data.linkedct.org/resource/linkedct/>
PREFIX foaf:<http://xmlns.com/foaf/0.1/>
PREFIX rdf:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia:<http://dbpedia.org/property/>
PREFIX elements:<http://purl.org/dc/elements/1.1/>
SELECT DISTINCT ?fn1 ?fn2 ?fn3 ?fn4 ?I ?DB ?ID1 ?ID2 ?L
WHERE {
?A1 foaf:page ?fn1.?A1 linkedct:condition ?A2.
?A2 linkedct:condition name ”Colorectal Cancer”.
?A1 linkedct:intervention ?I.
?A3 foaf:page ?fn3.?A3 linkedct:condition ?A4.
?A4 linkedct:condition name ”Breast Cancer”.
?A3 linkedct:intervention ?I.
?A5 foaf:page ?fn5.?A5 linkedct:condition ?A6.
?A6 linkedct:condition name ”Ovarian Cancer”.
?A5 linkedct:intervention?I.
?A7 foaf:page?fn7.?A7 linkedct:condition ?A8.
?A8 linkedct:condition name ”Lung Cancer”.
?A7 linkedct:intervention ?I
?I linkedct:intervention type ”Drug”.
?I rdf:seeAlso ?DB. ?WK foaf:primaryTopic ?DB.
?WK dbpedia:revisionId ?ID1.
?WK dbpedia:pageId ?ID2.?WK elements:language ?L.}
select ?fn1 ?I ?DB
?A1 foaf:page ?fn1.
?A1 linkedct:condition ?A2.
?A2 linkedct:condition_name "Breastl Cancer".
?A1 linkedct:intervention ?I.
?I rdf:seeAlso ?DB
select ?fn1 ?I ?DB
?A1 foaf:page ?fn1.
?A1 linkedct:condition ?A2.
?A2 linkedct:condition_name "Ovarian Cancer".
?A1 linkedct:intervention ?I.
?I rdf:seeAlso ?DB
select ?fn1 ?I ?DB
?A1 foaf:page ?fn1.
?A1 linkedct:condition ?A2.
?A2 linkedct:condition_name "Lung Cancer".
?A1 linkedct:intervention ?I.
?I rdf:seeAlso ?DB
select ?fn1 ?I ?DB
?A1 foaf:page ?fn1.
?A1 linkedct:condition ?A2.
?A2 linkedct:condition_name
"Colorectal Cancer".
?A1 linkedct:intervention ?I.
?I rdf:seeAlso ?DB
JOIN
JOIN
JOIN
select ?ID1 ?ID2 ?L
?WK foaf:primaryTopic t(?DB).
?WK dbpedia:revisionId ?ID1.
?WK dbpedia:pageId ?ID2.
?WK elements:language ?L.
JOIN
t(?DB}
A star-shaped plan; four star-shaped sub-queries against LinkedCT
endpoints and one against DBPedia endpoint.
Endpoints may time out before producing any answer.
Query too complex, and needs to be decomposed into simple sub-queries.
Adaptive query processing techniques need to be implemented to incrementally produce answers.
7. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Agenda
1 Basic Concepts and Background (20 minutes)
2 Adaptive Query Processing Techniques (30 minutes)
3 RDF Engines for Federations of Endpoints (30 minutes)
4 Questions and Discussion (5 minutes)
5 Coffee Break (30 minutes)
6 Hands-on (90 minutes)
Explanation of existing Benchmarks (10 minutes)
Evaluation (60 minutes)
Discussion of the Observed Results (15 minutes)
7 Summary and Closing (5 minutes)
8. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Background
9. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Introduction-Traditional Data Management
Systems[11, 16]
A Database Management System (DBMS) is a software designed
to:
consistently store and organize data,
Provide physical data structures to store large databases.
Stage large datasets between main memory and secondary
storage.
Protect data from inconsistencies.
and to provide a secure and concurrent access to the data.
Provide a query language.
Crash recovery.
Security and access control.
10. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
The Optimize-Then-Execute Paradigm
Traditional Architectures
Query Processing is divided into three
mayor steps:
Statistics generation.
Query optimization.
Query Execution.
Database
Catalog
Statistics
Generator
Query
Optimizer
Query
Execution
Engine
Query
RunStatistics
Collected
Statistics
11. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
The Optimize-then-Execute Paradigm
Traditional Query Processing techniques:
Parse a declarative query.
Generate an intermediate representation of the query.
Produce an efficient logical and physical plan; minimize disk
I/O access.
Execute the query plan without making runtime decisions.
12. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Query Optimizer Architecture[11, 16, 22]
Planner
Rewriter
Algebraic
Space
Method-
Structure
Space
Cost Model
Size-
Distribution
Estimator
Dictionary/Catalog
13. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Query Optimizer Architecture[11, 16, 22]
Planner
Rewriter
Algebraic
Space
Method-
Structure
Space
Cost Model
Size-
Distribution
Estimator
Dictionary/Catalog
Rewriter
Applies algebraic
transformations to the query to
produce a more efficient
equivalent query; some
transformations are: flatten out
queries, using views, etc.
Nested queries are decomposed
into query blocks.
14. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Query Optimizer Architecture[11, 16, 22]
Planner
Rewriter
Algebraic
Space
Method-
Structure
Space
Cost Model
Size-
Distribution
Estimator
Dictionary/Catalog
Planner
Traverses the space of possible
execution plans following a
particular search strategy, e.g.,
dynamic programming,
randomized algorithms.
Query blocks are optimized one at
a time.
All available access methods are
considered for each relation.
All the ways to join the relations
one-at-a-time; permutations of
relations and different join
methods are considered.
15. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Query Optimizer Architecture[11, 16, 22]
Planner
Rewriter
Algebraic
Space
Method-
Structure
Space
Cost Model
Size-
Distribution
Estimator
Dictionary/Catalog
Algebraic Space
Set of algebraic rules that
restrict the space of plans
transformations, and guide the
planner into the space of
efficient logical plans.
16. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Query Optimizer Architecture[11, 16, 22]
Planner
Rewriter
Algebraic
Space
Method-
Structure
Space
Cost Model
Size-
Distribution
Estimator
Dictionary/Catalog
Method-Structure Space
Set of rules that restrict the
space of physical
implementations of each logical
plan, and guide the planner
into the space of efficient
physical plans.
17. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Query Optimizer Architecture[11, 16, 22]
Planner
Rewriter
Algebraic
Space
Method-
Structure
Space
Cost Model
Size-
Distribution
Estimator
Dictionary/Catalog
Cost Model
Set of arithmetic rules or
statistical techniques, to
estimate the execution cost
and cardinality of logical and
physical plans.
18. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Query Optimizer Architecture[11, 16, 22]
Planner
Rewriter
Algebraic
Space
Method-
Structure
Space
Cost Model
Size-
Distribution
Estimator
Dictionary/Catalog
Size-Distribution Estimator
Set of arithmetic rules or
statistical techniques to
estimate the cardinalities of
dataset to be accessed as well
as its distributions, and the
selectivity of a query select
condition.
19. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Query Optimizer Architecture[11, 16, 22]
Planner
Rewriter
Algebraic
Space
Method-
Structure
Space
Cost Model
Size-
Distribution
Estimator
Dictionary/Catalog
Dictionary
Set of statistics that
characterize the dataset, e.g.,
number of different values of
an attribute, or different
subjects, properties, or values;
distributions, etc. Stored
statistics depend on the type of
size-distribution estimator.
20. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Join Ordering: Search Space
EMP DEPT
ACCOUNT
BANK
bno=bno
ano=ano
dno=dno
Left-linear plan
21. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Join Ordering: Search Space
EMP DEPT
ACCOUNT
BANK
bno=bno
ano=ano
dno=dno
Left-linear plan
DEPT EMP
BANK ano=ano
dno=dno
bno=bno
ACCOUNT
Right-linear plan
22. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Join Ordering: Search Space
EMP DEPT
ACCOUNT
BANK
bno=bno
ano=ano
dno=dno
Left-linear plan
DEPT EMP
BANK ano=ano
dno=dno
bno=bno
ACCOUNT
Right-linear plan
DEPTEMP BANK
ano=ano
dno=dno bno=bno
ACCOUNT
Bushy Plan
23. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Index Nested Loop Join
Indices exist on the join attributes
of the inner relation of the join.
Cost depends on the type of index
and on the clustering:
To find matching tuples:
clustered index: 1 I/O,
unclustered: up to 1 I/O
per matching tuple.
SSN NAME
230
120
100
John Smith
Mary Williams
Jose Gomez
99 Peter Wigle
SSN COURSE GRADE
120 CS6019 B
100 CS6019 A
230 CS6019 B
100 CS6020 A
120 CS6020 B
230 CS6020 B
100 CS6021 A
120 CS6021 B
Index on SSN
Index Nested Loop Join by SSN
24. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Sort Merge Join
Sorts input tables on
the join attributes;
Scans sorted tables;
Merges on join
attributes.
Cost is optimal in case
input tables are sorted.
SSN NAME
230
120
100
John Smith
Mary Williams
Jose Gomez
SSN COURSE GRADE
120 CS6019 B
100 CS6019 A
230 CS6019 B
100 CS6020 A
120 CS6020 B
230 CS6020 B
100 CS6021 A
120 CS6021 B
99 Peter Wigle
Sort Merge by SSN
25. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Sort Merge Join
Sorts input tables on
the join attributes;
Scans sorted tables;
Merges on join
attributes.
Cost is optimal in case
input tables are sorted.
SSN NAME
230
120
100
John Smith
Mary Williams
Jose Gomez
SSN COURSE GRADE
120 CS6019 B
100 CS6019 A
230 CS6019 B
100 CS6020 A
120 CS6020 B
230 CS6020 B
100 CS6021 A
120 CS6021 B
99 Peter Wigle
Sort Merge by SSN
SSN NAME
230
120
100
John Smith
Mary Williams
Jose Gomez
SSN COURSE GRADE
120
CS6019
B
100 CS6019 A
230
CS6019 B
100 CS6020 A
120 CS6020 B
230 CS6020
B
100 CS6021 A
120 CS6021
B
99 Peter Wigle
Sorting by SSN
26. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Sort Merge Join
Sorts input tables on
the join attributes;
Scans sorted tables;
Merges on join
attributes.
Cost is optimal in case
input tables are sorted.
SSN NAME
230
120
100
John Smith
Mary Williams
Jose Gomez
SSN COURSE GRADE
120 CS6019 B
100 CS6019 A
230 CS6019 B
100 CS6020 A
120 CS6020 B
230 CS6020 B
100 CS6021 A
120 CS6021 B
99 Peter Wigle
Sort Merge by SSN
SSN NAME
230
120
100
John Smith
Mary Williams
Jose Gomez
SSN COURSE GRADE
120
CS6019
B
100 CS6019 A
230
CS6019 B
100 CS6020 A
120 CS6020 B
230 CS6020
B
100 CS6021 A
120 CS6021
B
99 Peter Wigle
Sorting by SSN
SSN COURSE GRADE
120
CS6019
B
100 CS6019 A
230
CS6019 B
100 CS6020 A
120 CS6020 B
230 CS6020
B
100 CS6021 A
120 CS6021
B
NAME
Jose Gomez
Jose Gomez
Jose Gomez
Mary Williams
Mary Williams
Mary Williams
John Smith
John Smith
Merging by SSN
27. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Hash Join
Two fold physical operator:
Partition: input tables are partitioned using a
hash function h1; only tuples in the
same partitions of two tables are
joined.
Probing: each partition of the outer table is
loaded in a hash table using a hash
function h2; the corresponding
partition of the inner table is scanned
to search for matches.
Tables
Partitions
A Join B
Partition
Phase
Probing
Phase
Hash Tables
Joined tuples
28. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Example Hash Join
Original Relations
Disk
INPUT
OUTPUT
1
2
B-1
hash
function
h1
1
2
B-1
Disk
Partitions
... ...
Partition Phase
29. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Example Hash Join
Original Relations
Disk
INPUT
OUTPUT
1
2
B-1
hash
function
h1
1
2
B-1
Disk
Partitions
... ...
Partition Phase
Partitions Input Tables
Disk
Output Buffer
hash
function
h2
Disk
Join Results
... ...
Input Buffer
...
Hash table for partition
h2
Probing Phase
30. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Execution Engine-Blocking Operators
Require more than one pass over the input to create
the output.
Hold intermediate state to compute the output.
Some blocking operators:
Project: must sort and then merge the input.
Sort Merge Join: must sort the inputs and then merge
them.
Hash Join: must partition the inputs, temporary
store the input in a hash table, probe
hashed data many times. Input Data
Materialization
Intermediate
Structures
Blocking
Operator
31. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Example-Blocking Operators
A B
D
Hash Join
Merge Join
Index Nested Loop Join
C
C.a Hashtable
SORT SORT
Materialization points
Blocking Operators
32. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Execution Engine-Pipelined Plans
Executes all operators in the plan in parallel.
Operators or iterators, consume tuples from its children and
output produced tuples to their parents.
Tuples flow from the leaf nodes towards the root of the tree.
Iterators have the following functions:
Open prepares an operator for producing data.
Next produces an item.
Close performs final house-keeping.
33. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Examples of Iterator Functions [11]
Iterator Open Next Close
Scan open input call next on input close input
34. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Examples of Iterator Functions [11]
Iterator Open Next Close
Scan open input call next on input close input
Select open input call next on input close input
until an item qualifies
35. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Examples of Iterator Functions [11]
Iterator Open Next Close
Scan open input call next on input close input
Select open input call next on input close input
until an item qualifies
Hash Join allocate hash directory call next on probe close probe input
open left partition input input until a deallocate hash
build hash table match is found directory
next on partition input
close partition input
open right probe input
36. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Examples of Iterator Functions [11]
Iterator Open Next Close
Scan open input call next on input close input
Select open input call next on input close input
until an item qualifies
Hash Join allocate hash directory call next on probe close probe input
open left partition input input until a deallocate hash
build hash table match is found directory
next on partition input
close partition input
open right probe input
Merge Join open both inputs get next item from close both inputs
input with smaller key
until a match is found
37. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Limitations of the Optimize-Then-Execute Paradigm
Three Mayor Steps
Off-line statistics generation to
collect cardinalities, values
frequencies, and costs to access
data.
Cost and heuristic-based query
optimization to identify
”good-enough” plans.
Assumptions:
Different values of the attributes
Availability of the data
Cardinality of operators
Behavior of data sources
Query Execution:
pipelined operators are processed
at the time and output is produced by
its sub-operators.
materialized operators are
completely executed before firing its
parents for processing.
Blocked Iterator Model.
38. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Introduction
Summary: The Optimize-Then-Execute Paradigm
Relies on knowledge about the cost and cardinality of the
accessed datasets.
Effectively achieves the generation of query execution plans
where the size of intermediate results is minimized.
Does not scale due to lack of statistics and unpredictable data
transfers or data availability.
39. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
The Adaptive Query
Processing Paradigm
40. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Adaptive Query Processing [4, 9, 18]
Adaptivity is needed:
Misestimated or missing statistics.
Unexpected correlations.
Unpredictable costs.
Dynamically changing data, workload and sources availability.
Rates at which tuples arrive from the sources, change.
Initial Delays.
Slow Delivery
Bursty Arrivals
Iterative environments.
Systems able to change their behavior by learning behavior of data providers.
Receive information from the environment.
Use up-to-date information to change their behavior.
Keep iterating over time to adapt their behavior based on the environment
conditions.
41. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Adaptive Query Processing
Database
Catalog
Statistics
Generator
Query
Optimizer
Query
Execution
Engine
Query
RunStatistics
Collected
Statistics
Typical Optimize-Then-Execute
Architecture.
43. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Agenda
Types of Adaptation.
Adaptivity in Relational Databases
Adaptivity in the Web of Data
44. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Types of Adaptive-Based Solution
Adaptation Level
Source Selection searching strategies to select the best sources
for answering a query according to real-time source conditions.
Query Execution strategies to adapt query processing to
environment conditions.
Granularity of the Adaptation
Fine-grained adaptation of small processes, e.g., per-tuple or
source basis.
Coarse-grained is attempted for large processes, e.g., several
queries are executed under certain assumptions about the
conditions of the environment.
45. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Adaptive Query Processing Strategies
Plan-based Systems adapt to environment conditions, by
extending the optimize-then-execute paradigm.
Adaptive Join Processing(Intra-Operator) adaptivity is performed
at tuple level and query operators are able to adapt
to the environment changes, even in the context of a
fixed query plan.
Non-Pipelined Query Execution (Inter-Operator Re-optimization)
sub-queries against remote data sources are
re-scheduled based on uncertainties in the execution
cost of the remote access, size of the materialized
intermediate results, and unexpected delays.
46. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Inter-Operator Re-optimization[4, 17, 21, 27]
Ingress [30] interleaves sub-plan generation with plan execution; for each table
in a query, the smallest tables are selected first until plans of all the tuples and
tables are considered.
Dynamic Query Planning [7] off-line define a set of constraints that optimal
plans must meet; during execution time, plans that meet constraints are
generated based on dataset conditions.
DEC RDB [2] implements competition, an adaptive strategy which generates
and runs multiple plans in parallel for a while, and after a short period of time it
selects the most promising ones and discards the rest.
Query Scrambling [27, 28] starts with a plan and it may change it on-the-fly,
depending on data transfer delays or misestimations; operators can be ordered
differently, or not directly combined tables can be joined by a new operator
during re-optimization.
Routing techniques avoid traditional optimizers, and generate plans for each
tuple during runtime by sending tuples through the pool of eligible operators;
Eddy [3, 25], M-Join [29] and STAIRs [8] follow this adaptive strategy.
Fine-Grained adaptivity is achieved.
47. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Inter-Operator Re-optimization[4, 17, 21, 27]
Detect and correct situations where optimizers select sub-optimal plans, which
could be improved if accurate statistics are considered.
Existing approaches extend statistics tracking with a log of estimates considered
by the optimizer to generate the query plan, and of system conditions during
query execution.
Physical operators are implemented to track statistics and fire re-optimizations.
Re-optimization must ensure that a new plan will not duplicate or miss outputs,
and reuse previous computations.
Should avoid thrashing, i.e., spend most of the time and resources, monitoring
operators.
48. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Query Scrambling[27, 28]
Query Scrambling is an execution query
method that reacts to data arrival delays.
First, a plan is generated, and reacts during
query execution to misestimations.
Existing operators can be rescheduled,
retaining one part of the plan, and scheduling
another.
Query operators are divided into runnable and
blocked.
Maximal runnable sub-trees are identified, i.e.,
sub-tree where all the operators are runnable.
Executed sub-trees are materialized until the
delayed sources become available and the
original plan can resumed.
Query optimizer is invoked to identify
non-delayed sub-trees that can be
materialized first.
New operators can be introduced to join
tables that were not directly joined in the
original plan.
49. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Query Scrambling[27, 28]
Query Scrambling is an execution query
method that reacts to data arrival delays.
First, a plan is generated, and reacts during
query execution to misestimations.
Existing operators can be rescheduled,
retaining one part of the plan, and scheduling
another.
Query operators are divided into runnable and
blocked.
Maximal runnable sub-trees are identified, i.e.,
sub-tree where all the operators are runnable.
Executed sub-trees are materialized until the
delayed sources become available and the
original plan can resumed.
Query optimizer is invoked to identify
non-delayed sub-trees that can be
materialized first.
New operators can be introduced to join
tables that were not directly joined in the
original plan.
A B C D E F
1
3
5
4
2
Source A is delayed.
50. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Query Scrambling[27, 28]
Query Scrambling is an execution query
method that reacts to data arrival delays.
First, a plan is generated, and reacts during
query execution to misestimations.
Existing operators can be rescheduled,
retaining one part of the plan, and scheduling
another.
Query operators are divided into runnable and
blocked.
Maximal runnable sub-trees are identified, i.e.,
sub-tree where all the operators are runnable.
Executed sub-trees are materialized until the
delayed sources become available and the
original plan can resumed.
Query optimizer is invoked to identify
non-delayed sub-trees that can be
materialized first.
New operators can be introduced to join
tables that were not directly joined in the
original plan.
A B C
D E F
1
3
5
Sub-tree 4 is materialized.
51. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Query Scrambling[27, 28]
Query Scrambling is an execution query
method that reacts to data arrival delays.
First, a plan is generated, and reacts during
query execution to misestimations.
Existing operators can be rescheduled,
retaining one part of the plan, and scheduling
another.
Query operators are divided into runnable and
blocked.
Maximal runnable sub-trees are identified, i.e.,
sub-tree where all the operators are runnable.
Executed sub-trees are materialized until the
delayed sources become available and the
original plan can resumed.
Query optimizer is invoked to identify
non-delayed sub-trees that can be
materialized first.
New operators can be introduced to join
tables that were not directly joined in the
original plan.
A B C
D E F
1 3
5
6
A new operator is created between node 3 and D, E and
F.
52. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Query Scrambling[27, 28]
Query Scrambling is an execution query
method that reacts to data arrival delays.
First, a plan is generated, and reacts during
query execution to misestimations.
Existing operators can be rescheduled,
retaining one part of the plan, and scheduling
another.
Query operators are divided into runnable and
blocked.
Maximal runnable sub-trees are identified, i.e.,
sub-tree where all the operators are runnable.
Executed sub-trees are materialized until the
delayed sources become available and the
original plan can resumed.
Query optimizer is invoked to identify
non-delayed sub-trees that can be
materialized first.
New operators can be introduced to join
tables that were not directly joined in the
original plan.
A
B C D E F
1
5
The new operator is executed.
53. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Intra-Operator Techniques
Adapts plans to environment changes, even in presence of a
fixed plan.
Incrementally produce results as they become available.
Symmetric Hash Join [9] and Xjoin [26] overcome Hash Join
blocking limitations, and hide delays while producing answers
even if both input data sources become blocked.
Fine-Grained adaptivity is achieved at each operator.
54. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Example-Blocking Operators
A B
D
Hash Join
Merge Join
Index Nested Loop Join
C
C.a Hashtable
SORT SORT
Materialization points
Blocking Operators
55. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Symmetric Hash Join[9]
Tables
Partitions
A Join B
Partition
Phase
Probing
Phase
Hash Tables
Joined tuples
Traditional Hash Join blocks the probing
phase until the partition phase is
completely finished.
56. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
The Adaptive Query Processing Paradigm
Symmetric Hash Join[9]
Tables
Partitions
A Join B
Partition
Phase
Probing
Phase
Hash Tables
Joined tuples
Traditional Hash Join blocks the probing
phase until the partition phase is
completely finished.
A B
Hash
Table
Hash
Table
Joined tuples
Partition Partition
Probe
Probe
Symmetric Hash Join builds hash tables
on both inputs; when a tuple arrives, it is
inserted in its hash table and probed
against the other table.
57. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptivity in the Web of Data
Adaptivity in the Web of
Data
58. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptivity in the Web of Data
Adaptive Query Processing for the Web of Data
Adaptivity during Source Selection relies on source or endpoint
descriptions and statistics describe data stored at the
endpoints.
Adaptivity at Query Execution Level approaches implement intra-
and inter-operator query processing techniques to
adapt execution to no accurate statistics,
uncontrollable network conditions, query complexity,
dynamic data and workload.
59. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptivity in the Web of Data
Evolution of Adaptive Approaches in the Web of Data
Summary
Coarse-Grained Granularity Fine-Grained Granularity
Plan-based Techniques
i.e., ARQ, DARQ
Adaptive Source-Selection
Techniques
i.e., FedX, SPLENDID
Adaptive Query Processing
Techniques
i.e., ANAPSID, Avalanche,
Hartig et. al. 09, Ladwig and Tran 10
60. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptivity in the Web of Data
Adaptivity during Source Selection
coarse-Grained Approaches
Harth et al.[12]: Qtree-based index structure to store data
source statistics collected in a pre-processing stage; these
statistics are used for source ranking and to determine the
best sources to answer a join query.
Li and Heflin[20] tree structure to support integration of data
from multiple heterogeneous sources; nodes are annotated
with source join selectivities and used to select the
corresponding datasets.
Finer granularity is achieved whenever the statistics are
re-computed frequently.
61. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptivity in the Web of Data
Fined-Grained Approaches to traverse
the Web of Data
Hartig et al.[13, 14, 15] source selection and link traversal are
interleaved during query execution time. A non-blocking
iterator model is used for traversing relevant links
relies on an asynchronous pipeline of iterators executed in an
order heuristically determined. e.g., the most selective iterators
are executed first [14].
query engine is able to adapt the execution to source
availability by detecting on-the-fly whenever a dereferenced
dataset stops responding; iterators can be cancelled, and other
requests are submitted to alive datasets.
Fine-Grained adaptivity is achieved during query selection and link
traversal.
62. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptivity in the Web of Data
Combined-Grained Approaches to traverse
the Web of Data
Ladwig and Tran[19]
coarse-grained granularity: aggregate indexes keep data
distributions and are used for source selection.
Fine-grained granularity: Symmetric Hash Joins incrementally
produce answers even when sources become
blocked.
Adaptivity is managed at different levels and granularity during link
traversal, but these approaches are not tailored to access data from
federations of endpoints.
63. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
SPARQL Endpoints
Implement SPARQL protocol and enable users to
query particular datasets or to dereference linked data.
Developed for very light-weight use.
Executions against several endpoints can be
unsuccessful because of endpoint timeouts,
unpredictable data transfers and endpoint availability.
64. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
SPLENDID: SPARQL Endpoint Federation Exploiting
VOID Descriptions[10]
Statistical Information stored in VOID Descriptions is used to
perform source selection first stage.
Contact Source during query execution is used to improve the
preliminar source selection.
Pattern grouping is performed for set of triples that share the
same unique endpoint, and for sameAs triple that shares an
unbound subject variable with other triples.
Cost-based join ordering using dynamic programming
determines the best plan, giving preference to bushy tree
plans.
Bind Joins that reduces network overhead.
65. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
SPLENDID
Example: drugs and their components’ url and image
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX drugbank: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/>
PREFIX purl:<http://purl.org/dc/elements/1.1/>
PREFIX bio2rdf:<http://bio2rdf.org/ns/bio2rdf#>
SELECT $drug $keggUrl $chebiImage WHERE {
$drug rdf:type drugbank:drugs .
$drug drugbank:keggCompoundId $keggDrug .
$drug drugbank:genericName $drugBankName .
$keggDrug bio2rdf:url $keggUrl .
$chebiDrug purl:title $drugBankName .
$chebiDrug bio2rdf:image $chebiImage .
}
70. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
SPLENDID
Execution plan
71. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
Adaptivity at Endpoint Selection Level
Fine-Grained Granularity
FedX [24]
no knowledge about mappings or statistics is maintained.
endpoints are consulted on-the-fly to determine if a predicates
can be answered.
cache is used to record endpoint predicates.
queries are divided into exclusive groups, i.e., triple patterns
that can be executed by only one endpoint.
heuristically groups are ordered.
Fine-grained granularity at source selection level while
coarse-grained granularity during plan generation.
72. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
FedX
Example: drugs and their components’ url and image
PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
PREFIX drugbank: http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/
PREFIX purl:http://purl.org/dc/elements/1.1/
PREFIX bio2rdf:http://bio2rdf.org/ns/bio2rdf#
SELECT $drug $keggUrl $chebiImage WHERE {
$drug rdf:type drugbank:drugs .
$drug drugbank:keggCompoundId $keggDrug .
$drug drugbank:genericName $drugBankName .
$keggDrug bio2rdf:url $keggUrl .
$chebiDrug purl:title $drugBankName .
$chebiDrug bio2rdf:image $chebiImage .
}
77. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
Execution plan
78. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
Adaptivity at Query Execution Level
Coarse-Grained Granularity
SPARQL-DQP [6] and ARQ 1 exploit information on SPARQL
1.1 queries to decide where the subqueries of triple patterns
will be executed. Both rely on statistics or heuristics during
query optimization, reaching thus coarse-grained adaptation
at plan level.
FedX [24] no adaptation is performed during query processing.
Finer granularity is achieved whenever the statistics are
re-computed frequently.
1
http://jena.sourceforge.net/ARQ
79. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
Adaptivity at Query Execution Level
Fine-Grained Granularity
AVALANCHE [5]
inter-operator solution to heuristically select on-the-fly the
most promising plans.
statistics about cardinalities and data distribution are
considered to identify possibly good plans.
competition strategy top-k plans are executed in parallel and
the output of a query corresponds to the answers produced by
the most promising plan(s) in a given period of time.
Fine-grained granularity. Decision of completely executing top-k
plan(s) is taken on-the-fly, based on the current conditions of the
environment.
80. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
Adaptivity at Query Execution Level
Fine-Grained Granularity
Avalanche – execution model
SELECT ?name ?title WHERE {
?paper dblp:has-title ?title .
?paper dblp:has-author ?author .
?author dblp:has-name ?name .
}
find
endpoints request
statistics
distributed
join
Endpoints Directory
or Search Engine
Avalanche
SPARQL endpoint
... on the WWW
Query
Borrows from:
p2p systems
the Web
No completeness
guarantees, first K results
are obtained iteratively
(plan-level) given
stopping criteria:
timeout
relative saturation
LIMIT modifier
81. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
Adaptivity at Query Execution Level
Fine-Grained Granularity
Avalanche – architecture
3] Query Planning and Execution phaseAVALANCHE endpoints Search Engine
i.e., http://void.rkbexplorer.com/
1] Source Discovery phase
Plans
Queue
Plan
Generator
Finished
Plans
Queue
Results
Queue
Query
Stopper
Executor
MaterializerExecutor
Executor
Executor
Materializer
Materializer
Materializer
Results
Statistics
Requester
Query
Query
Parser
2] Statistics Gathering phase
Source
Selector
optimization is concomitant with query execution
scheduling is dynamic between plans (highlighted) – results come in out-of-order
plan space: |Hosts| ∗ |SubQueries| ∗ PlanDepth
multi-path depth-first heuristic search based on parametric cost model
82. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
Adaptivity at Query Execution Level
Fine-Grained Granularity
Avalanche – plan execution
1) Join (s1, s2)
3) Send(R1)
10) Update (s2, s1)
12) Send(R2)host A
2) R1 = Execute (s1)
13) R1 = Filter (R1, R2)
?sameFilm
?sameFilm dbpedia:hasPhotoCollection
?photoCollection.
?sameFilm dbpedia:studio
ʻʻProducers Circleʼʼ.
SubQuery 1 - s1
?film dc:title ?title.
?film movie:actor ?actor.
?film owl:sameAs ?sameFilm.
SubQuery 2 - s2
?actor a foaf:Person.
?actor movie:actor_name ?name.
SubQuery 3 - s3
5) Join (s2, s3)
6) Send(FR2)
host C
7) FR3 = ExecuteFilter (R3)
9) Send(FR3)
8) Update (s3, s2)
host B
4) FR2 = ExecuteFilter (R1)
11) R2 = Filter (FR2)
?actor
DBPEDIA endpointsLinked Movie Database endpoints
left linear execution strategy
not adaptive
83. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
Adaptivity at Query Execution Level
Fine-Grained Granularity
Avalanche – plan execution
1) Join (s1, s2)
3) Send(R1)
10) Update (s2, s1)
12) Send(R2)host A
2) R1 = Execute (s1)
13) R1 = Filter (R1, R2)
?sameFilm
?sameFilm dbpedia:hasPhotoCollection
?photoCollection.
?sameFilm dbpedia:studio
ʻʻProducers Circleʼʼ.
SubQuery 1 - s1
?film dc:title ?title.
?film movie:actor ?actor.
?film owl:sameAs ?sameFilm.
SubQuery 2 - s2
?actor a foaf:Person.
?actor movie:actor_name ?name.
SubQuery 3 - s3
5) Join (s2, s3)
6) Send(FR2)
host C
7) FR3 = ExecuteFilter (R3)
9) Send(FR3)
8) Update (s3, s2)
host B
4) FR2 = ExecuteFilter (R1)
11) R2 = Filter (FR2)
?actor
DBPEDIA endpointsLinked Movie Database endpoints
left linear execution strategy
not adaptive
could benefit from adaptive execution . . .
84. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
ANAPSID: An Adaptive Query Processing Engine for
Federation of Endpoints [1]
ANAPSID – Architecture
Mediator
Query Planner
SPARQL 1.1 Query
Query
Optimizer
Query
Decomposer
Adaptive
Query
Engine
Query
Schema Alignments
Endpoint Endpoint
Adaptive Plan
SPARQL 1.0 Query
ANAPSID
Decomposer: rewrites the query into sub-queries.
Optimizer: generates an optimal bushy tree execution plan.
Query Engine: implements inter- and intra-operators
techniques to perform fine-grained adaptivity.
85. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
ANAPSID: An Adaptive Query Processing Engine for
Federation of Endpoints [1]
Adaptivity is performed at both source
selection and query execution levels.
Endpoint descriptions used to select
endpoints that can answer a triple pattern;
triple patterns that can be executed by the
same endpoint are grouped together;
endpoints may be contacted on-the-fly to
decide if it is able to answer a triple pattern.
Bushy-tree plans are generated to reduce the
size of intermediate results and the number
of Cartesian products.
Non-blocking operators efficiently retrieve
data from endpoints and keep producing
new answers even when endpoints become
blocked; agJoin, agUnion and agOptional
provide adaptive implementations of
SPARQL logical operators.
Mediator
Query Planner
SPARQL 1.1 Query
Query
Optimizer
Query
Decomposer
Adaptive
Query
Engine
Query
Schema Alignments
Endpoint Endpoint
Adaptive Plan
SPARQL 1.0 Query
ANAPSID
Fine-grained granularity during query
execution and medium fine-grained
during endpoint selection.
86. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
ANAPSID Adaptive Join Operator
Operators implement main
memory replacement policies to
flush previously computed matches
to secondary memory, ensuring no
duplicate generation.
Each operator maintains a data
structure called Resource Join
Tuple (RJT), that records for each
instantiation of the variable(s), the
tuples that have already matched.
(R, {TR1, TR2, ..., TRn}).
87. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Adaptive Query Processing for Federations of Endpoints
ANAPSID Adaptive Join Operator
Adaptive Join Operator works in three
stages:
First stage is performed while at
least one endpoint sends data. If
main memory becomes full,
Resource Join Tuples are flushed
to secondary memory.
Second stage is performed
whenever both are blocked.
Resource Join Tuples is main
memory are joined with Resource
Join Tuples in secondary memory
Third stage is fired when both
endpoints finish sending data.
88. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Hands-On Section
Fedbench Datasets and Queries
Dataset Endpoint name #triples
NY Times News 314k
LinkedMDB Movies 6.14M
Jamendo Music 1.04M
Geonames
Geography 7.98M
SW Dog Food SW 84k
KEGG Chemicals 10.9M
Drugbank Drugs 517k
ChEBI Compounds 4.77M
SP2B-10M Bibliographic 10M
DBPedia subset
Infobox Types 5.49M
Infobox Properties 10.80M
Titles 7.33M
Articles Categories 10.91M
Images 3.88M
SKOS Categories 2.24M
Other 2.45M
Queries
Cross Domain CD1-CD7.
Linked Data LD1-LD11.
Life Science Linked Data
LSD1-LD7.
Complex Queries C1-C5.
89. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Hands-On Section
Delays Configuration
Message Size: 16KB
No Delay: perfect network.
Gamma 1 : fast network with a gamma distribution (α = 1, β = 0.3) of
response latency resulting in an average latency of 0.3 seconds.
Gamma 2: medium fast network with a gamma distribution (α = 3, β = 1.0)
of response latency resulting in an average latency of 3 seconds.
Gamma 3: slow network with a gamma distribution (α = 3, β = 1.5) of
response latency resulting in an average latency of 4.5 seconds per message.
Probability Density Function for Gamma
0
0.5
1
0 2 4 6 8 10 12 14 16 18 20
Density
Gamma 1 (α = 1, β = 0.3)
Gamma 2 (α = 3, β = 1.0)
Gamma 3 (α = 3, β = 1.5)
90. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Hands-On Section
Evaluation Parameters
Exec Result
Time Completeness
Query
# patterns
instantiations
result size
Data
data distribution
data correlations
dataset size
Network
#endpoints
latency
data transfer distribution
91. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Hands-On Section
Evaluation
Objective: Study the Adaptive Capability of the Different
Engines to Network Variation.
Measurements:
Time first tuple: elapsed time between the
execution of the plan and the output of the first
answer.
Time all tuples: elapsed time between the
execution of the plan and the output of the
whole answer.
Number of Results: number of tuples
produced for each query.
92. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
Summary and Future Directions
Summary and Future
Directions
93. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
References
[1] Maribel Acosta, Maria-Esther Vidal, Tomas Lampo, Julio
Castillo, and Edna Ruckhaus. ANAPSID: AN Adaptive query
ProcesSing engIne for sparql enDpoints. In Proceedings of the
International Semantic Web Conference (ISWC), 2011.
[2] Gennady Antoshenkov and Mohamed Ziauddin. Query
processing and optimization in oracle rdb. VLDB J.,
5(4):229–237, 1996.
[3] Ron Avnur and Joseph M. Hellerstein. Eddies: Continuously
adaptive query processing. In SIGMOD Conference, pages
261–272, 2000.
[4] Shivnath Babu and Pedro Bizarro. Adaptive query processing
in the looking glass. In CIDR, pages 238–249, 2005.
[5] Cosmin Basca and Abraham Bernstein. Avalanche: Putting
the Spirit of the Web back into Semantic Web Querying. In
SSWS2010 Workshop, Shanghai, China, 2010.
94. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
References
[6] Carlos Buil-Aranda, Marcelo Arenas, and Oscar Corcho.
Semantics and optimization of the sparql 1.1 federation
extension. In ESWC (2), pages 1–15, 2011.
[7] Richard L. Cole and Goetz Graefe. Optimization of dynamic
query evaluation plans. In SIGMOD Conference, pages
150–160, 1994.
[8] Amol Deshpande and Joseph M. Hellerstein. Lifting the
burden of history from adaptive query processing. In VLDB,
pages 948–959, 2004.
[9] Amol Deshpande, Zachary G. Ives, and Vijayshankar Raman.
Adaptive query processing. Foundations and Trends in
Databases, 1(1):1–140, 2007.
[10] Olaf G¨orlitz and Steffen Staab. SPLENDID: SPARQL
Endpoint Federation Exploiting VOID Descriptions. In
Proceedings of the 2nd International Workshop on Consuming
Linked Data, Bonn, Germany, 2011.
95. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
References
[11] Goetz Graefe. Query evaluation techniques for large
databases. ACM Comput. Surv., 25(2):73–170, 1993.
[12] Andreas Harth, Katja Hose, Marcel Karnstedt, Axel Polleres,
Kai-Uwe Sattler, and J¨urgen Umbrich. Data summaries for
on-demand queries over linked data. In WWW, 2010.
[13] Olaf Hartig. How caching improves efficiency and result
completeness for querying linked data. In the 4th Linked Data
on the Web (LDOW) Workshop at the World Wide Web
Conference (WWW), 2011.
[14] Olaf Hartig. Zero-knowledge query planning for an iterator
implementation of link traversal based query execution. In
Extended Semantic Web Conference, 2011.
[15] Olaf Hartig, Christian Bizer, and Johann Christoph Freytag.
Executing sparql queries over the web of linked data. In
International Semantic Web Conference, pages 293–309,
2009.
96. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
References
[16] Yannis E. Ioannidis. Query optimization. ACM Comput.
Surv., 28(1):121–123, 1996.
[17] Navin Kabra and David J. DeWitt. Efficient mid-query
re-optimization of sub-optimal query execution plans. In
SIGMOD Conference, pages 106–117, 1998.
[18] Kamlesh Laddhad and S. Sudarshan. Adaptive query
processing. Technical Report 05329014, Kanwal Rekhi School
of Information Technology, Indian Institute of Technology,
Bombay, Mumbai, 2006.
[19] Gunter Ladwig and Thanh Tran. Linked data query processing
strategies. In International Semantic Web Conference
(ISWC), 2010.
[20] Yingjie Li and Jeff Heflin. Using reformulation trees to
optimize queries over distributed heterogeneous sources. In
ISWC, pages 502–517, 2010.
97. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
References
[21] Volker Markl, Vijayshankar Raman, David E. Simmen, Guy M.
Lohman, and Hamid Pirahesh. Robust query processing
through progressive optimization. In SIGMOD Conference,
pages 659–670, 2004.
[22] Priti Mishra and Margaret H. Eich. Join processing in
relational databases. ACM Comput. Surv., 24(1):63–113,
1992.
[23] Bastian Quilitz and Ulf Leser. Querying distributed rdf data
sources with sparql. In ESWC, pages 524–538, 2008.
[24] Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel,
and Michael Schmidt. Fedx: Optimization techniques for
federated query processing on linked data. In International
Semantic Web Conference (1), pages 601–616, 2011.
[25] Feng Tian and David J. DeWitt. Tuple routing strategies for
distributed eddies. In VLDB, pages 333–344, 2003.
98. Adaptive Semantic Data Management Techniques for Federations of Endpoints- Half-Day Tutorial at ESWC 2012
References
[26] Tolga Urhan and Michael J. Franklin. Xjoin: A
reactively-scheduled pipelined join operator. IEEE Data Eng.
Bull., 23(2):27–33, 2000.
[27] Tolga Urhan and Michael J. Franklin. Dynamic pipeline
scheduling for improving interactive query performance. In
VLDB, pages 501–510, 2001.
[28] Tolga Urhan, Michael J. Franklin, and Laurent Amsaleg. Cost
based query scrambling for initial delays. In SIGMOD
Conference, pages 130–141, 1998.
[29] Stratis Viglas, Jeffrey F. Naughton, and Josef Burger.
Maximizing the output rate of multi-way join queries over
streaming information sources. In VLDB, pages 285–296,
2003.
[30] Eugene Wong and Karel Youssefi. Decomposition-a strategy
for query processing. ACM Trans. on Database Systems, 1(3),
1976.