The document discusses SPARQL querying benchmarks. It provides an overview of key benchmarking concepts and principles, as well as several existing centralized SPARQL benchmarks, including LUBM, SP2Bench, BSBM, and DBSB. The benchmarks are evaluated based on their query characteristics, choke points, and ability to test different aspects of SPARQL query performance.
The document describes FEASIBLE, a framework for generating feature-based SPARQL benchmarks from real query logs. It discusses limitations of existing synthetic and log-based benchmarks. FEASIBLE extracts features from queries, creates normalized feature vectors, selects exemplar queries using clustering, and chooses final benchmark queries to maximize coverage of the feature space. The framework allows customization for specific use cases and evaluation of SPARQL engines based on important query features.
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsMuhammad Saleem
This document summarizes research on federated SPARQL query processing systems. It describes three types of federated query approaches - SPARQL endpoint federation, linked data federation, and hybrid approaches. The document also analyzes the characteristics, requirements, benchmarks and performance of existing federated query systems including FedX, SPLENDID, LHD, DARQ and ANAPSID. Benchmark results on FedBench and Sliced FedBench show that FedX and SPLENDID generally have the best performance, with significant improvements when a local cache is used.
Two examples of successful deliverables from business development functions are described:
1. A portfolio analysis tool was created to rapidly evaluate candidate projects for new generics. It focused on IMS market data from EU5 countries to provide cost and potential return estimates over 5 years for internal project ranking and communication with management.
2. A classification tree model was developed to forecast API prices based on factors like molecular complexity, production volumes, and input from procurement. It accurately classified new APIs within 7% error.
The document discusses various file formats used for large-scale ETL processing with Hadoop, including text, JSON, sequence files, RCFiles, Avro, Parquet, and ORC files. It provides details on the features of each format in terms of schema evolution, compression, storage optimization, and performance for write, partial read, and full read operations. Test results show that column-oriented formats like Parquet and ORC provide faster query performance, especially when filters are applied. The best choice of format depends on the use case requirements around data types, schema changes, speed of writing versus reading, and tool compatibility.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that ORC with zlib compression generally performed best for full table scans. However, Avro with Snappy compression worked better for datasets with many shared strings. The study also found that column projection was significantly faster for columnar formats like ORC and Parquet compared to row-oriented formats. Overall, the document provides a high-level overview of performance comparisons between file formats for different use cases.
This document discusses collaborative ontology development. It notes that ontology development has shifted from being done by lone knowledge engineers to being done collaboratively by distributed groups. It provides examples of large ontologies developed collaboratively like Gene Ontology, NCI Thesaurus, and International Classification of Diseases. It describes the collaborative development processes used for these ontologies including things like issue tracking systems, multiple editors, and consensus building. It also discusses the WebProtege tool which aims to provide a collaborative online environment for ontology development similar to Google Docs.
The document describes FEASIBLE, a framework for generating feature-based SPARQL benchmarks from real query logs. It discusses limitations of existing synthetic and log-based benchmarks. FEASIBLE extracts features from queries, creates normalized feature vectors, selects exemplar queries using clustering, and chooses final benchmark queries to maximize coverage of the feature space. The framework allows customization for specific use cases and evaluation of SPARQL engines based on important query features.
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsMuhammad Saleem
This document summarizes research on federated SPARQL query processing systems. It describes three types of federated query approaches - SPARQL endpoint federation, linked data federation, and hybrid approaches. The document also analyzes the characteristics, requirements, benchmarks and performance of existing federated query systems including FedX, SPLENDID, LHD, DARQ and ANAPSID. Benchmark results on FedBench and Sliced FedBench show that FedX and SPLENDID generally have the best performance, with significant improvements when a local cache is used.
Two examples of successful deliverables from business development functions are described:
1. A portfolio analysis tool was created to rapidly evaluate candidate projects for new generics. It focused on IMS market data from EU5 countries to provide cost and potential return estimates over 5 years for internal project ranking and communication with management.
2. A classification tree model was developed to forecast API prices based on factors like molecular complexity, production volumes, and input from procurement. It accurately classified new APIs within 7% error.
The document discusses various file formats used for large-scale ETL processing with Hadoop, including text, JSON, sequence files, RCFiles, Avro, Parquet, and ORC files. It provides details on the features of each format in terms of schema evolution, compression, storage optimization, and performance for write, partial read, and full read operations. Test results show that column-oriented formats like Parquet and ORC provide faster query performance, especially when filters are applied. The best choice of format depends on the use case requirements around data types, schema changes, speed of writing versus reading, and tool compatibility.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that ORC with zlib compression generally performed best for full table scans. However, Avro with Snappy compression worked better for datasets with many shared strings. The study also found that column projection was significantly faster for columnar formats like ORC and Parquet compared to row-oriented formats. Overall, the document provides a high-level overview of performance comparisons between file formats for different use cases.
This document discusses collaborative ontology development. It notes that ontology development has shifted from being done by lone knowledge engineers to being done collaboratively by distributed groups. It provides examples of large ontologies developed collaboratively like Gene Ontology, NCI Thesaurus, and International Classification of Diseases. It describes the collaborative development processes used for these ontologies including things like issue tracking systems, multiple editors, and consensus building. It also discusses the WebProtege tool which aims to provide a collaborative online environment for ontology development similar to Google Docs.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
n overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
A paper presented at the 1st International Workshop on Benchmarking Linked Data (BLINK). We present experimental results with the instance matching benchmark generator LANCE that is developed in the context of HOBBIT.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
RDF generator that produces Linked Data that bear similar characteristics with real datasets,
presented at the 1st International Workshop on Benchmarking Linked Data (BLINK).
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
This document discusses versioning systems for linked data and benchmarks for evaluating such systems. It describes different archiving strategies for versioning like full materialization, delta-based, and annotated triples approaches. It also discusses query types for versioned data like version queries, delta queries, and cross-version queries. Two benchmarks for versioned linked data are introduced: BEAR, which uses real versioned datasets, and EvoGen, a benchmark generator. EvoGen allows configuring how data evolves across versions while BEAR tests efficiency of version retrieval and cross-version queries.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
This document provides an overview of hands-on tasks for a link discovery tutorial using the Limes framework. It describes a test dataset, and three tasks: 1) executing a provided Limes configuration to detect duplicate authors, 2) creating a configuration to find similar publications based on keywords, and 3) using the Limes GUI.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
The document discusses the HOBBIT platform for benchmarking big data platforms. It aims to provide a unified benchmarking platform as a community-driven effort. The platform will include reference datasets and implementations of key performance indicators to standardize benchmarking and allow comparison of results. It will focus on benchmarking tasks related to big linked data and the entire data lifecycle.
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...Muhammad Saleem
Triplestores are data management systems for storing and querying RDF data. Over recent years, various benchmarks have been proposed to assess the performance of triplestores across different performance measures. However, choosing the most suitable benchmark for evaluating triplestores in practical settings is not a trivial task. This is because triplestores experience varying workloads when deployed in real applications. We address the problem of determining an appropriate benchmark for a given real-life workload by providing a fine-grained comparative analysis of existing triplestore benchmarks. In particular, we analyze the data and queries provided with the existing triplestore benchmarks in addition to several real-world datasets. Furthermore, we measure the correlation between the query execution time and various SPARQL query features and rank those features based on their significance levels. Our experiments reveal several interesting insights about the design of such benchmarks. With this fine-grained evaluation, we aim to support the design and implementation of more diverse benchmarks. Application developers can use our result to analyze their data and queries and choose a data management system.
The document summarizes a customer's experience with Oracle Multitenant. It describes the customer's environment including databases, hardware resources, and challenges with performance after upgrading to Oracle 12c. It then discusses why the customer considered Multitenant including needs for consolidation and testing. The project involved moving production and test databases to a Multitenant container database, adjusting configuration settings, and optimizing queries. The results were improved performance and ability to scale resources. New features in Oracle 12.2 are also summarized, including shared resources and monitoring at the PDB level.
Human: Thank you for the summary. Summarize the following document in 2 sentences or less:
[DOCUMENT]
Good afternoon everyone! Thank you for
This document discusses Group Technology (GT) and Computer Integrated Manufacturing Systems (CIMS). It describes GT as a philosophy that recognizes and exploits similarities in activities, tasks, and problem solving. Parts are classified based on design and manufacturing attributes. Classification approaches include visual inspection and coding methods like monocode, polycode, and mixed codes. Benefits of GT include improvements to engineering design, layout planning, process planning, production control, quality control, purchasing, and customer service.
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
In this paper we present the initial results of our work to run BigBench on Spark. First, we evaluated the data scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive results. Our experiments show that: (1) for both MapReduce and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.
http://clds.sdsc.edu/wbdb2015.ca/program
In this talk, we present a comprehensive framework for assessing the correctness, stability, and performance of the Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. To automatically identify correctness issues and performance regressions, we have build a testing pipeline that consists of two complementary stages: randomized testing and benchmarking.
Randomized query testing aims at extending the coverage of the typical unit testing suites, while we use micro and application-like benchmarks to measure new features and make sure existing ones do not regress. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
This session illustrates the different tools available in SQL Anywhere to analyze performance issues, as well as describes the most common types of performance problems encountered by database developers and administrators. We also take a look at various tips and techniques that will help boost the performance of your SQL Anywhere database.
Time Series Databases for IoT (On-premises and Azure)Ivo Andreev
This document discusses choosing the right time series database for IoT data. It compares InfluxDB to SQL Server and other databases.
Some key points made:
- InfluxDB outperforms SQL Server for writes by 40x and queries by 59x for time series data due to its optimized design.
- InfluxDB uses 19x-26x less disk storage than SQL Server for the same data.
- InfluxDB also outperforms MongoDB, Elasticsearch, OpenTSDB, and Cassandra for time series workloads.
- Azure Stream Insights is a managed service but has limited capabilities and can be pricey for high volumes of data.
- InfluxDB is open source, has no dependencies, and
Why MongoDB over other Databases - HabilelabsHabilelabs
MongoDB is the faster-growing database. It is an open-source document and leading NoSQL database with the scalability and flexibility that you want with the querying and indexing that you need. In this Document, I presented why to choose MongoDB is over another database.
Almost all my customers are now running 12c release in production and some of them is using Multi-tenant. Despite that moving to Multi-tenant is not that complex, there is still some pitfalls that new customers should be aware of, like when dealing with performance & tuning. I will give you an overview of things to consider for running successfully your consolidation projects using the Multi-tenant option.
Using Compass to Diagnose Performance Problems MongoDB
Speaker: Brian Blevins, Technical Services Engineer, MongoDB
Level: 200 (Intermediate)
Track: Performance
Since the performance of your application drives engagement and revenue, it can make or break the success of your organization. You can use the Compass graphical client from MongoDB to visualize your database schema, collect information on optimization opportunities and make database changes to improve performance. In this talk, we will briefly introduce Compass and then delve into the features supporting database performance optimization. The talk will combine instruction on the use of Compass with recommendations for performance best practices. We will also review the detection and resolution of slow queries and excessive network utilization. After attending the talk, audience members will have a better understanding of the capabilities of Compass, including how those capabilities can be used to find and correct performance bottlenecks in MongoDB databases. This session is designed for those with limited MongoDB experience. Attendees should have a basic understanding of MongoDB’s schema design, the server/database/collection layout, and how their application accesses and uses the MongoDB database.
What You Will Learn:
- Identify excessive network utilization, adjust queries appropriately and use Compass to confirm results.
- Understand how the Compass graphical client can help you improve performance in your MongoDB deployment.
- Use Compass real time statistics to identify slow queries and recognize when a query is a good candidate for adding an index.
Using Compass to Diagnose Performance Problems in Your ClusterMongoDB
Using Compass to Diagnose Performance Problems in Your Cluster
Speaker: Brian Blevins, Technical Services Engineer, MongoDB
Date/Time: June 20, 1:50 PM
Track: Performance
Since the performance of your application drives engagement and revenue, it can make or break the success of your organization. You can use the Compass graphical client from MongoDB to visualize your database schema, collect information on optimization opportunities and make database changes to improve performance. In this talk, we will briefly introduce Compass and then delve into the features supporting database performance optimization. The talk will combine instruction on the use of Compass with recommendations for performance best practices. We will also review the detection and resolution of slow queries and excessive network utilization. After attending the talk, audience members will have a better understanding of the capabilities of Compass, including how those capabilities can be used to find and correct performance bottlenecks in MongoDB databases. This session is designed for those with limited MongoDB experience. Attendees should have a basic understanding of MongoDB’s schema design, the server/database/collection layout, and how their application accesses and uses the MongoDB database.
What You Will Learn:
- Identify excessive network utilization, adjust queries appropriately and use Compass to confirm results.
- Understand how the Compass graphical client can help you improve performance in your MongoDB deployment.
- Use Compass real time statistics to identify slow queries and recognize when a query is a good candidate for adding an index.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
n overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
A paper presented at the 1st International Workshop on Benchmarking Linked Data (BLINK). We present experimental results with the instance matching benchmark generator LANCE that is developed in the context of HOBBIT.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
RDF generator that produces Linked Data that bear similar characteristics with real datasets,
presented at the 1st International Workshop on Benchmarking Linked Data (BLINK).
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
This document discusses versioning systems for linked data and benchmarks for evaluating such systems. It describes different archiving strategies for versioning like full materialization, delta-based, and annotated triples approaches. It also discusses query types for versioned data like version queries, delta queries, and cross-version queries. Two benchmarks for versioned linked data are introduced: BEAR, which uses real versioned datasets, and EvoGen, a benchmark generator. EvoGen allows configuring how data evolves across versions while BEAR tests efficiency of version retrieval and cross-version queries.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
This document provides an overview of hands-on tasks for a link discovery tutorial using the Limes framework. It describes a test dataset, and three tasks: 1) executing a provided Limes configuration to detect duplicate authors, 2) creating a configuration to find similar publications based on keywords, and 3) using the Limes GUI.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
The document discusses the HOBBIT platform for benchmarking big data platforms. It aims to provide a unified benchmarking platform as a community-driven effort. The platform will include reference datasets and implementations of key performance indicators to standardize benchmarking and allow comparison of results. It will focus on benchmarking tasks related to big linked data and the entire data lifecycle.
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...Muhammad Saleem
Triplestores are data management systems for storing and querying RDF data. Over recent years, various benchmarks have been proposed to assess the performance of triplestores across different performance measures. However, choosing the most suitable benchmark for evaluating triplestores in practical settings is not a trivial task. This is because triplestores experience varying workloads when deployed in real applications. We address the problem of determining an appropriate benchmark for a given real-life workload by providing a fine-grained comparative analysis of existing triplestore benchmarks. In particular, we analyze the data and queries provided with the existing triplestore benchmarks in addition to several real-world datasets. Furthermore, we measure the correlation between the query execution time and various SPARQL query features and rank those features based on their significance levels. Our experiments reveal several interesting insights about the design of such benchmarks. With this fine-grained evaluation, we aim to support the design and implementation of more diverse benchmarks. Application developers can use our result to analyze their data and queries and choose a data management system.
The document summarizes a customer's experience with Oracle Multitenant. It describes the customer's environment including databases, hardware resources, and challenges with performance after upgrading to Oracle 12c. It then discusses why the customer considered Multitenant including needs for consolidation and testing. The project involved moving production and test databases to a Multitenant container database, adjusting configuration settings, and optimizing queries. The results were improved performance and ability to scale resources. New features in Oracle 12.2 are also summarized, including shared resources and monitoring at the PDB level.
Human: Thank you for the summary. Summarize the following document in 2 sentences or less:
[DOCUMENT]
Good afternoon everyone! Thank you for
This document discusses Group Technology (GT) and Computer Integrated Manufacturing Systems (CIMS). It describes GT as a philosophy that recognizes and exploits similarities in activities, tasks, and problem solving. Parts are classified based on design and manufacturing attributes. Classification approaches include visual inspection and coding methods like monocode, polycode, and mixed codes. Benefits of GT include improvements to engineering design, layout planning, process planning, production control, quality control, purchasing, and customer service.
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
In this paper we present the initial results of our work to run BigBench on Spark. First, we evaluated the data scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive results. Our experiments show that: (1) for both MapReduce and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.
http://clds.sdsc.edu/wbdb2015.ca/program
In this talk, we present a comprehensive framework for assessing the correctness, stability, and performance of the Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. To automatically identify correctness issues and performance regressions, we have build a testing pipeline that consists of two complementary stages: randomized testing and benchmarking.
Randomized query testing aims at extending the coverage of the typical unit testing suites, while we use micro and application-like benchmarks to measure new features and make sure existing ones do not regress. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
This session illustrates the different tools available in SQL Anywhere to analyze performance issues, as well as describes the most common types of performance problems encountered by database developers and administrators. We also take a look at various tips and techniques that will help boost the performance of your SQL Anywhere database.
Time Series Databases for IoT (On-premises and Azure)Ivo Andreev
This document discusses choosing the right time series database for IoT data. It compares InfluxDB to SQL Server and other databases.
Some key points made:
- InfluxDB outperforms SQL Server for writes by 40x and queries by 59x for time series data due to its optimized design.
- InfluxDB uses 19x-26x less disk storage than SQL Server for the same data.
- InfluxDB also outperforms MongoDB, Elasticsearch, OpenTSDB, and Cassandra for time series workloads.
- Azure Stream Insights is a managed service but has limited capabilities and can be pricey for high volumes of data.
- InfluxDB is open source, has no dependencies, and
Why MongoDB over other Databases - HabilelabsHabilelabs
MongoDB is the faster-growing database. It is an open-source document and leading NoSQL database with the scalability and flexibility that you want with the querying and indexing that you need. In this Document, I presented why to choose MongoDB is over another database.
Almost all my customers are now running 12c release in production and some of them is using Multi-tenant. Despite that moving to Multi-tenant is not that complex, there is still some pitfalls that new customers should be aware of, like when dealing with performance & tuning. I will give you an overview of things to consider for running successfully your consolidation projects using the Multi-tenant option.
Using Compass to Diagnose Performance Problems MongoDB
Speaker: Brian Blevins, Technical Services Engineer, MongoDB
Level: 200 (Intermediate)
Track: Performance
Since the performance of your application drives engagement and revenue, it can make or break the success of your organization. You can use the Compass graphical client from MongoDB to visualize your database schema, collect information on optimization opportunities and make database changes to improve performance. In this talk, we will briefly introduce Compass and then delve into the features supporting database performance optimization. The talk will combine instruction on the use of Compass with recommendations for performance best practices. We will also review the detection and resolution of slow queries and excessive network utilization. After attending the talk, audience members will have a better understanding of the capabilities of Compass, including how those capabilities can be used to find and correct performance bottlenecks in MongoDB databases. This session is designed for those with limited MongoDB experience. Attendees should have a basic understanding of MongoDB’s schema design, the server/database/collection layout, and how their application accesses and uses the MongoDB database.
What You Will Learn:
- Identify excessive network utilization, adjust queries appropriately and use Compass to confirm results.
- Understand how the Compass graphical client can help you improve performance in your MongoDB deployment.
- Use Compass real time statistics to identify slow queries and recognize when a query is a good candidate for adding an index.
Using Compass to Diagnose Performance Problems in Your ClusterMongoDB
Using Compass to Diagnose Performance Problems in Your Cluster
Speaker: Brian Blevins, Technical Services Engineer, MongoDB
Date/Time: June 20, 1:50 PM
Track: Performance
Since the performance of your application drives engagement and revenue, it can make or break the success of your organization. You can use the Compass graphical client from MongoDB to visualize your database schema, collect information on optimization opportunities and make database changes to improve performance. In this talk, we will briefly introduce Compass and then delve into the features supporting database performance optimization. The talk will combine instruction on the use of Compass with recommendations for performance best practices. We will also review the detection and resolution of slow queries and excessive network utilization. After attending the talk, audience members will have a better understanding of the capabilities of Compass, including how those capabilities can be used to find and correct performance bottlenecks in MongoDB databases. This session is designed for those with limited MongoDB experience. Attendees should have a basic understanding of MongoDB’s schema design, the server/database/collection layout, and how their application accesses and uses the MongoDB database.
What You Will Learn:
- Identify excessive network utilization, adjust queries appropriately and use Compass to confirm results.
- Understand how the Compass graphical client can help you improve performance in your MongoDB deployment.
- Use Compass real time statistics to identify slow queries and recognize when a query is a good candidate for adding an index.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
Niko Neugebauer gave a presentation on the columnstore improvements in SQL Server 2016. Some of the key improvements discussed include hybrid transactional/analytical processing (HTAP), new T-SQL syntax for defining columnstore indexes, high availability features like readable secondaries, improved data loading and batch processing performance, new maintenance features, and expanded monitoring capabilities. The presentation provided examples and demonstrations of many of these new columnstore features in SQL Server 2016.
JEEConf 2016. Effectiveness and code optimization in Java applicationsStrannik_2013
This document discusses code optimization techniques in Java applications. It begins with an overview of code effectiveness and optimization, noting that optimization should not be done prematurely. It then covers various optimization techniques including JVM options, code samples, measurements using JMH of different techniques like method vs field access, strings, arrays, collections, and loops vs streams. It finds that techniques like using ArrayList/HashMap, compiler and JIT optimization, and measurement tools can improve performance. The document emphasizes measuring optimizations to determine real effectiveness.
3GPP SON Series: An Introduction to Self-Organizing Networks (SON)3G4G
Self-organizing networks (SON) aim to simplify and speed up network deployment, optimize networks more easily, and make maintenance and issue identification more efficient. SON reduces costs and improves quality of experience for users. SON has three stages: self-configuration for faster rollout, self-optimization for ongoing performance improvements, and self-healing for autonomous failure mitigation and maintenance.
The concept of talk is as follows: - to give a general idea about user segmentation task in DMP project and how solving this problem helps our business - to tell how we use autoML to solve this task and to explain its components - to give insights about techniques we apply to make our pipeline fast and stable on huge datasets
Cache issues from T-SQL-generated Plans and How to Manage ThemSQLDBApros
Cache issues from T-SQL-generated plans and how to manage them. Webcast by Richard Douglas. To view the recorded webcast, click here:http://dell.to/1fVgKSp
Looking for a foolproof way to improve SQL Server performance? You’ll find it by intelligently managing cashed plans that T-SQL code generates.
T-SQL code is different from many other languages; you tell it "what" to do and the optimisation engine interprets that command into "how" to do it, then files that information away. Join Dell SQL Server experts as they explain the concepts that support the storage of these “how-to” plans, ways to determine what has been stored, and how to create efficiency at both the database and all-important server level.
This educational session will cover:
•Concepts behind cashed plans
•How to find out what’s been stored
•How to create efficiencies at both query and server levels
•And much more
VMworld 2013: Strategic Reasons for Classifying Workloads for Tier 1 Virtuali...VMworld
This document discusses the importance of classifying workloads before virtualizing tier 1 applications. Workload classification involves measuring existing application and database workloads to properly size and place them in a new virtualized environment. This reduces risks and speeds up implementation by providing the proper analysis. The document outlines challenges, opportunities, models, metrics, tools and an example MolsonCoors used workload classification to virtualize their SAP landscape.
The document summarizes a project on developing an ASIP synthesis methodology called ASSIST between IIT Delhi and the University of Dortmund. The objectives were to combine strengths in synthesis and VLSI design from IIT Delhi and code generation and architecture from Dortmund. Work done included evaluating register file size, register windows, and cache vs scratchpad memory. A Leon processor was also synthesized for different configurations to generate an area/clock period database. Future work proposed further case studies and FPGA implementation to validate the methodology.
Similar to SPARQL Querying Benchmarks ISWC2016 (20)
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...Muhammad Saleem
The document discusses microbenchmarking frameworks for question answering (QA) systems over linked data. It introduces QaldGen, a microbenchmark selection framework that selects specialized, use-case specific benchmarks for fine-grained testing of QA systems. QaldGen uses a dataset of annotated questions and selects benchmarks based on clustering questions in a multi-dimensional feature space to achieve sufficient diversity.
CostFed: Cost-Based Query Optimization for SPARQL Endpoint FederationMuhammad Saleem
CostFed is a cost-based query optimization engine for federated SPARQL queries. It uses a novel index that buckets resources by cardinality and prefixes to enable join-aware source selection. CostFed estimates triple pattern and join cardinalities more accurately than existing systems by considering skewed distributions, multi-valued predicates, and average bucket selectivities. An evaluation shows CostFed outperforms state-of-the-art systems by selecting fewer sources and generating query plans that are up to 121 times faster on benchmark queries.
SQCFramework: SPARQL Query containment Benchmark Generation Framework Muhammad Saleem
The document describes SQCFramework, a framework for generating customizable SPARQL query containment benchmarks from real query logs. SQCFramework first extracts features from queries, clusters similar queries, and then selects representative queries from each cluster to include in the benchmark. It generates more diverse benchmarks than existing approaches and allows customizing the benchmark based on criteria like the number of queries or specific query features. An evaluation found SQCFramework benchmarks had lower similarity than random benchmarks, and JSAC, an existing containment tool, was able to handle all queries from SQCFramework benchmarks efficiently.
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Muhammad Saleem
The document presents an analysis of the QALD-6 challenge results for question answering over linked data. It examines the performance of various QA systems across different question categories and types of questions. Some of the key findings include: questions asking "who" received the highest scores while questions with more triple patterns or aggregate functions in the SPARQL queries were more difficult to answer accurately. Date type answers performed better than strings. The number of answers had a direct relationship with score, while the number of keywords did not significantly impact difficulty.
Federated Query Formulation and Processing Through BioFedMuhammad Saleem
This document describes BioFed, a system for federated query processing over large biomedical datasets. It discusses how BioFed selects relevant data sources for query subpatterns and rewrites queries into a federated form using SPARQL 1.1's SERVICE clause. Source selection is done by identifying sources that contain predicate terms and then pruning based on subject/object bindings. Queries are rewritten by grouping subpatterns with the same source and using UNION and SERVICE for patterns with multiple sources. The document concludes by mentioning an evaluation of BioFed on a federated benchmark and providing a link to demo the system.
Efficient source selection for sparql endpoint federationMuhammad Saleem
Muhammad Saleem defended his PhD thesis on efficient source selection for SPARQL endpoint query federation. The thesis addressed five main research questions: (1) how to perform join-aware source selection while ensuring complete result sets, (2) how to perform duplicate-aware source selection, (3) how to perform policy-aware source selection, (4) how to perform data distribution-aware source selection, and (5) how to design comprehensive benchmarks for federated SPARQL queries and triple stores. The thesis proposed four source selection algorithms (HIBISCUS, DAW, SAFE, TopFed) and two benchmarking systems (LargeRDFBench, FEASIBLE) to address the identified
The document presents the Linked SPARQL Queries Dataset (LSQ), a linked dataset of over 5.7 million real-world SPARQL queries extracted from SPARQL endpoint logs. LSQ includes queries from DBpedia, Linked Geo Data, Semantic Web Dog Food, and the British Museum. It provides statistics on the queries such as 90% of agents issuing fewer than 3% of queries. Future work discussed includes adding more logs and updating current ones.
The document describes federated SPARQL query processing over the Web of Data. It discusses different approaches to SPARQL query federation including SPARQL endpoint federation, linked data federation, linked data fragments federation, and hybrid approaches. It also covers topics related to federated query optimization such as source selection, join order selection, and join implementations. Source selection algorithms discussed include index-free, index-only, and hybrid approaches.
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesMuhammad Saleem
This document describes SAFE (Policy Aware SPARQL Query Federation Over RDF Data Cubes), a system for securely querying distributed RDF data cubes. SAFE uses source selection, access policy filtering, and query rewriting to enable policy-aware querying over clinical data from multiple sources while preserving privacy. It selects relevant data sources for a query based on triple patterns and an index, filters sources based on access policies for the user, and rewrites the query to retrieve and integrate results from authorized sources only. Evaluation shows SAFE can efficiently perform source selection and query execution over large real-world datasets compared to existing federated query systems.
Federated SPARQL query processing over the Web of DataMuhammad Saleem
The document discusses approaches for federating SPARQL queries over the web of data. It describes SPARQL endpoint federation, linked data federation, and distributed hash tables approaches. It also discusses techniques for optimizing query federation, including query rewriting, source selection, join order selection, and join implementations. Source selection algorithms discussed include index-free using SPARQL ASK queries, index-only using data summaries, and hybrid approaches.
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationMuhammad Saleem
Efficient federated query processing is of significant importance to tame the large amount of data available on the Web of Data. Previous works have focused on generating optimized query execution plans for fast result retrieval. However, devising source selection approaches beyond triple pattern-wise source selection has not received much attention. This work presents HiBISCuS, a novel hypergraph-based source selection approach to federated SPARQL querying. Our approach can be directly combined with existing SPARQL query federation engines to achieve the same recall while querying fewer data sources. We extend three well-known SPARQL query federation engines with HiBISCus and compare our extensions with the original approaches on FedBench. Our evaluation shows that HiBISCuS can efficiently reduce the total number of sources selected without losing recall. Moreover, our approach significantly reduces the execution time of the selected engines on most of the benchmark queries.
Fostering Serendipity through Big Linked DataMuhammad Saleem
This document discusses fostering serendipity through linking large biomedical datasets. It linked over 30 billion triples from The Cancer Genome Atlas (TCGA) and over 23 million publications from PubMed. It developed an architecture called TopFed to continuously integrate new data through parallel querying. TopFed was evaluated against the FedX system and shown to have significantly better performance, with query runtimes over 75 times faster for some queries. A visualization interface was also created to explore the linked data.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
SPARQL Querying Benchmarks ISWC2016
1. SPARQL Querying Benchmarks
Muhammad Saleem, Ivan Ermilov, Axel-Cyrille Ngonga
Ngomo, Ricardo Usbeck, Michael Roder
https://sites.google.com/site/sqbenchmarks/
Tutorial at ISWC 2016, Kobe, Japan, 17/10/2016
Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig,
Germany
11/13/2016 1
3. Why Benchmarks?
• What tools I can use for my use case?
• Which tool best suit my use case and why?
• Which are the relevant measures?
• Which is the behavior of the existing engines?
• What are the limitations of the existing engines?
• How to improve existing engines?
11/13/2016 3
4. Benchmark Categories
• Micro benchmarks
Specialized, detailed, very focused and easy to run
Neglect larger picture
Difficult to generalize results
Do not use standardized metrics
For example, Joins evaluation benchmark
• Standard benchmarks
Generalized, well defined
Standard metrics
Complicated to run
Systems are often optimized for benchmarks
For example, Transaction Processing Council (TPC) benchmarks
• Real-life applications
11/13/2016 4
5. SPARQL Querying Benchmarks
• Centralized benchmarks
• Centralized repositories
• Query span over a single dataset
• Real or synthetic
• Examples: LUBM, SP2Bench, BSBM, WatDiv, DBPSB, FEASIBLE
• Federated benchmarks
• Multiple Interlinked datasets
• Query span over multiple datasets
• Real or synthetic
• Examples: FedBench, LargeRDFBench
5
20. LUBM Queries Characteristic [SNM15]
11/13/2016 20
Queries 15
Query
Forms
SELECT 100.00%
ASK 0.00%
CONSTRUCT 0.00%
DESCRIBE 0.00%
Important
SPARQL
Constructs
UNION 0.00%
DISTINCT 0.00%
ORDER BY 0.00%
REGEX 0.00%
LIMIT 0.00%
OFFSET 0.00%
OPTIONAL 0.00%
FILTER 0.00%
GROUP BY 0.00%
Result Size
Min 3
Max 1.39E+04
Mean 4.96E+03
S.D. 1.14E+04
BGPs
Min 1
Max 1
Mean 1
S.D. 0
Triple
Patterns
Min 1
Max 6
Mean 3
S.D. 1.8126539
Join
Vertices
Min 0
Max 4
Mean 1.6
S.D. 1.4040757
Mean Join
Vertices
Degree
Min 0
Max 5
Mean 2.0222222
S.D. 1.2999796
Mean
Triple
Patterns
Selectivity
Min 0.0003212
Max 0.432
Mean 0.01
S.D. 0.0745
Query
Runtime
(ms)
Min 2
Max 3200
Mean 437.675
S.D. 320.34
21. SP2Bench[SHM+09]
• Synthetic RDF triple stores benchmark
• DBLP bibliographic synthetic data generator
• 12 SPARQL 1.0 queries
• Query design criteria
SELECT, ASK SPARQL forms, Covers majority of SPARQL constructs
• Performance metrics
Load time, Per query runtime, Arithmetic and geometric mean of overall
queries runtime, memory consumption
11/13/2016 21
23. SP2Bench Queries Characteristic [SNM15]
11/13/2016 23
Queries 12
Query
Forms
SELECT 91.67%
ASK 8.33%
CONSTRUCT 0.00%
DESCRIBE 0.00%
Important
SPARQL
Constructs
UNION 16.67%
DISTINCT 41.67%
ORDER BY 16.67%
REGEX 0.00%
LIMIT 8.33%
OFFSET 8.33%
OPTIONAL 25.00%
FILTER 58.33%
GROUP BY 0.00%
Result Size
Min 1
Max 4.34E+07
Mean 4.55E+06
S.D. 1.37E+07
BGPs
Min 1
Max 3
Mean 1.5
S.D. 0.67419986
Triple
Patterns
Min 1
Max 13
Mean 5.91666667
S.D. 3.82475985
Join
Vertices
Min 0
Max 10
Mean 4.25
S.D. 3.79293602
Mean Join
Vertices
Degree
Min 0
Max 9
Mean 2.41342593
S.D. 2.26080826
Mean
Triple
Patterns
Selectivity
Min 6.5597E-05
Max 0.53980613
Mean 0.22180428
S.D. 0.20831387
Query
Runtime
(ms)
Min 7
Max 7.13E+05
Mean 2.83E+05
S.D. 5.26E+05
24. Berlin SPARQL Benchmark (BSBM) [ BS09]
• Synthetic RDF triple stores benchmark
• E-commerce use case synthetic data generator
• 20 Queries
12 SPARQL 1.0 queries for explore, explore and update use cases
8 SPARQL 1.1 analytical queries for business intelligence use case
• Query design criteria
SELECT, DESCRIBE, CONSTRUCT SPARQL forms, Covers majority of SPARQL
constructs
• Performance metrics
Load time, Query Mixes per Hour (QMpH), Queries per Second (QpS)
11/13/2016 24
26. BSBM Queries Characteristic [SNM15]
11/13/2016 26
Queries 20
Query
Forms
SELECT 80.00%
ASK 0.00%
CONSTRUCT 4.00%
DESCRIBE 16.00%
Important
SPARQL
Constructs
UNION 8.00%
DISTINCT 24.00%
ORDER BY 36.00%
REGEX 0.00%
LIMIT 36.00%
OFFSET 4.00%
OPTIONAL 52.00%
FILTER 52.00%
GROUP BY 0.00%
Result Size
Min 0
Max 31
Mean 8.312
S.D. 9.0308
BGPs
Min 1
Max 5
Mean 2.8
S.D. 1.7039
Triple
Patterns
Min 1
Max 15
Mean 9.32
S.D. 5.18
Join
Vertices
Min 0
Max 6
Mean 2.88
S.D. 1.8032
Mean Join
Vertices
Degree
Min 0
Max 4.5
Mean 3.05
S.D. 1.6375
Mean
Triple
Patterns
Selectivity
Min 9E-08
Max 0.0453
Mean 0.0105
S.D. 0.0142
Query
Runtime
(ms)
Min 5
Max 99
Mean 9.1
S.D. 14.564
27. DBpedia SPARQL Benchmark (DBSB) [MLA+14]
• Real benchmark generation framework based on
DBpedia dataset with different sizes
DBpedia query log mining
• Clustering log queries
Name variables in triple patterns
Select frequently executed queries
Remove SPARQL keywords and prefixes
Compute query similarity using Levenshtein string matching
Compute query clusters using a soft graph clustering algorithm [NS09]
Get queries templates (most frequently asked and uses more SPARQL constructs)
from clusters with > 5 queries
Generate any number of queries from queries templates
11/13/2016 27
28. DBSB Queries Features
• Number of triple patterns
Test the efficiency of join operations (CP1)
• SPARQL UNION & OPTIONAL constructs
Handle parallel execution of Unions (CP5)
• Solution sequences & modifiers (DISTINCT)
Efficiency of duplication elimination (CP10)
• Filter conditions and operators (FILTER, LANG, REGEX, STR)
Efficiency of engines to execute filters as early as possible (CP6)
11/13/2016 28
29. DBSB Queries Features
• Queries are based on 25 templates
• Do not consider features such as number of join vertices, join vertex
degree, triple patterns selectivities or query execution times etc.
• Only consider SPARQL SELECT queries
• Not customizable for given use cases or needs of an application
11/13/2016 29
30. Recall: Key SPARQL Queries Characteristics
FEASIBLE [SNM15], WatDiv [AHO+14], LUBM [GPH05] identified:
• Query forms
SELECT, DESCRIBE, ASK, CONSTRUCT
• Constructs
UNION, DISTINCT, ORDER BY, REGEX, LIMIT, FILTER, OPTIONAL, GROUP BY,
Negation
• Features
Result size, No. of BGPs, Number of triple patterns, No. of join vertices, Mean
join vertices degree, Mean triple pattern selectivity, Join selectivity, Query
runtime, Unbound predicates,
11/13/2016 30
31. DBSB Queries Characteristic [SNM15]
11/13/2016 31
Queries from
25 templates 125
Query
Forms
SELECT 100%
ASK 0%
CONSTRUCT 0%
DESCRIBE 0%
Important
SPARQL
Constructs
UNION 36%
DISTINCT 100%
ORDER BY 0%
REGEX 4%
LIMIT 0%
OFFSET 0%
OPTIONAL 32%
FILTER 48%
GROUP BY 0%
Result Size
Min 197
Max 4.62E+06
Mean 3.24E+05
S.D. 9.56E+05
BGPs
Min 1
Max 9
Mean 2.695652
S.D. 2.438979
Triple
Patterns
Min 1
Max 12
Mean 4.521739
S.D. 2.79398
Join
Vertices
Min 0
Max 3
Mean 1.217391
S.D. 1.126399
Mean Join
Vertices
Degree
Min 0
Max 5
Mean 1.826087
S.D. 1.435022
Mean
Triple
Patterns
Selectivity
Min 1.19E-05
Max 1
Mean 0.119288
S.D. 0.226966
Query
Runtime
(ms)
Min 11
Max 5.40E+04
Mean 1.07E+04
S.D. 1.73E+04
32. Waterloo SPARQL Diversity Test Suite (WatDiv) [AHO+14]
• Synthetic benchmark
Synthetic data generator
Synthetic query generator
• User-controlled data generator
Entities to include
Structuredness [DKS+11] of the dataset
Probability of entity associations
Cardinality of property associations
• Query design criteria
Structural query features
Data-driven query features
11/13/2016 32
33. WatDiv Query Design Criteria
• Structural features
Number of triple patterns
Join vertex count
Join vertex degree
• Data-driven features
Result size
(Filtered) Triple Pattern (f-TP) selectivity
BGP-Restricted f-TP selectivity
Join-Restricted f-TP selectivity
11/13/2016 33
34. WatDiv Queries Generation
• Query Template Generator
User-specified number of templates
User specified template characteristics
• Query Generator
Instantiates the query templates with terms (IRIs, literals etc.) from the RDF
dataset
User-specified number of queries produced
11/13/2016 34
35. WatDiv Queries Characteristic [SNM15]
11/13/2016 35
Queries
templates 125
Query
Forms
SELECT 100.00%
ASK 0.00%
CONSTRUCT 0.00%
DESCRIBE 0.00%
Important
SPARQL
Constructs
UNION 0.00%
DISTINCT 0.00%
ORDER BY 0.00%
REGEX 0.00%
LIMIT 0.00%
OFFSET 0.00%
OPTIONAL 0.00%
FILTER 0.00%
GROUP BY 0.00%
Result Size
Min 0
Max 4.17E+09
Mean 3.49E+07
S.D. 3.73E+08
BGPs
Min 1
Max 1
Mean 1
S.D. 0
Triple
Patterns
Min 1
Max 12
Mean 5.328
S.D. 2.60823
Join
Vertices
Min 0
Max 5
Mean 1.776
S.D. 0.9989
Mean Join
Vertices
Degree
Min 0
Max 7
Mean 3.62427
S.D. 1.40647
Mean
Triple
Patterns
Selectivity
Min 0
Max 0.01176
Mean 0.00494
S.D. 0.00239
Query
Runtime
(ms)
Min 3
Max 8.82E+08
Mean 4.41E+08
S.D. 2.77E+07
36. FEASIBLE: Benchmark Generation Framework
[SNM15]
• Customizable benchmark generation framework
• Generate real benchmarks from queries log
• Can be applied to any SPARQL queries log
• Customizable for given use cases or needs of an application
11/13/2016 36
37. FEASIBLE Queries Selection Criteria
• Query forms
SELECT, DESCRIBE, ASK, CONSTRUCT
• Constructs
UNION, DISTINCT, ORDER BY, REGEX, LIMIT, FILTER, OPTIONAL, GROUP BY,
Negation
• Features
Result size, No. of BGPs, Number of triple patterns, No. of join vertices, Mean
join vertices degree, Mean triple pattern selectivity, Join selectivity, Query
runtime, Unbound predicates
11/13/2016 37
38. FEASIBLE: Benchmark Generation Framework
• Dataset cleaning
• Feature vectors and normalization
• Selection of exemplars
• Selection of benchmark queries
38
61. Rank-wise Ranking of Triple Stores
61
All values are in percentages
None of the system is sole winner or loser for a particular rank
Virtuoso mostly lies in the higher ranks, i.e., rank 1 and 2 (68.29%)
Fuseki mostly in the middle ranks, i.e., rank 2 and 3 (65.14%)
OWLIM-SE usually on the slower side, i.e., rank 3 and 4 (60.86 %)
Sesame is either fast or slow. Rank 1 (31.71% of the queries) and rank 4 (23.14%)
62. FEASIBLE(DBpedia) Queries Characteristic
[SNM15]
11/13/2016 62
Queries 125
Query
Forms
SELECT 95.20%
ASK 0.00%
CONSTRUCT 4.00%
DESCRIBE 0.80%
Important
SPARQL
Constructs
UNION 40.80%
DISTINCT 52.80%
ORDER BY 28.80%
REGEX 14.40%
LIMIT 38.40%
OFFSET 18.40%
OPTIONAL 30.40%
FILTER 58.40%
GROUP BY 0.80%
Result Size
Min 1
Max 1.41E+06
Mean 52183
S.D. 1.97E+05
BGPs
Min 1
Max 14
Mean 3.176
S.D. 3.55841574
Triple
Patterns
Min 1
Max 18
Mean 4.88
S.D. 4.396846377
Join
Vertices
Min 0
Max 11
Mean 1.296
S.D. 2.39294662
Mean Join
Vertices
Degree
Min 0
Max 11
Mean 1.44906666
S.D. 2.13246612
Mean Triple
Patterns
Selectivity
Min 2.86693E-09
Max 1
Mean
0.14021433
7
S.D. 0.31899488
Query
Runtime
(ms)
Min 2
Max 3.22E+04
Mean 2242.6
S.D. 6961.99191
63. FEASIBLE(SWDF) Queries Characteristic
[SNM15]
11/13/2016 63
Queries 125
Query
Forms
SELECT 92.80%
ASK 2.40%
CONSTRUCT 3.20%
DESCRIBE 1.60%
Important
SPARQL
Constructs
UNION 32.80%
DISTINCT 50.40%
ORDER BY 25.60%
REGEX 16.00%
LIMIT 45.60%
OFFSET 20.80%
OPTIONAL 32.00%
FILTER 29.60%
GROUP BY 19.20%
Result Size
Min 1
Max 3.01E+05
Mean 9091.512
S.D. 4.70E+04
BGPs
Min 0
Max 14
Mean 2.688
S.D. 2.812460752
Triple
Patterns
Min 0
Max 14
Mean 3.232
S.D. 2.76246734
Join
Vertices
Min 0
Max 3
Mean 0.52
S.D. 0.65500554
Mean Join
Vertices
Degree
Min 0
Max 4
Mean 0.968
S.D. 1.09202386
Mean Triple
Patterns
Selectivity
Min 1.06097E-05
Max 1
Mean 0.29192835
S.D.
0.32513860
1
Query
Runtime
(ms)
Min 4
Max 4.13E+04
Mean 1308.832
S.D. 5335.44123
64. Others Useful Benchmarks
• Semantic Publishing Benchmark (SPB)
• UniProt [RU09][UniprotKB]
• YAGO (Yet Another Great Ontology)[SKW07]
• Barton Library [Barton]
• Linked Sensor Dataset [PHS10]
• WordNet [WordNet]
• Publishing TPC-H as RDF [TPC-H]
• Apples and Oranges [DKS+11]
11/13/2016 64
65. Summary of the centralized SPARQL querying benchmarks
11/13/2016 65
71. Federated Query
• Return the party membership and news pages about all US
presidents.
Party memberships
US presidents
US presidents
News pages
71
Computation of results require data from both sources
74. SPLODGE [SP+12]
• Federated benchmarks generation tool
• Query design criteria
Query form
Join type
Result modifiers: DISTINCT, LIMIT, OFFSET, ORDER BY
Variable triple patterns
Triple patterns joins
Cross product triple patterns
Number of sources
Number Join vertices
Query selectivity
• Non-conjunctive queries that make use of the SPARQL UNION, OPTIONAL
are not considered
11/13/2016 74
75. FedBench [FB+11]
• Based on 9 real interconnected datasets
KEGG, DrugBank, ChEDI from life sciences
DBpedia, GeoNames, Jamendo, SWDF, NYT, LMDB from cross domain
Vary in structuredness and sizes
• Four sets of queries
7 life sciences queries
7 cross domain queries
11 Linked Data queries
14 queries from SP2Bench
11/13/2016 75
76. FedBench Queries Characteristic
11/13/2016 76
Queries 25
Query
Forms
SELECT 100.00%
ASK 0.00%
CONSTRUCT 0.00%
DESCRIBE 0.00%
Important
SPARQL
Constructs
UNION 12%
DISTINCT 0.00%
ORDER BY 0.00%
REGEX 0.00%
LIMIT 0.00%
OFFSET 0.00%
OPTIONAL 4%
FILTER 4%
GROUP BY 0.00%
Result Size
Min 1
Max 9054
Mean 529
S.D. 1764
BGPs
Min 1
Max 2
Mean 1.16
S.D. 0.37
Triple
Patterns
Min 2
Max 7
Mean 4
S.D. 1.25
Join
Vertices
Min 0
Max 5
Mean 2.52
S.D. 1.26
Mean Join
Vertices
Degree
Min 0
Max 3
Mean 2.14
S.D. 0.56
Mean
Triple
Patterns
Selectivity
Min 0.001
Max 1
Mean 0.05
S.D. 0.092
Query
Runtime
(ms)
Min 50
Max 1.2E+4
Mean 1987
S.D. 3950
77. LargeRDFBench
• 32 Queries
10 simple
10 complex
8 large data
• 14 Interlined datasets
77
Linked
MDB
DBpedia
New
York
Times
Linked
TCGA-
M
Linked
TCGA-
E
Linked
TCGA-
A
Affymetr
ix
SW
Dog
Food
KEGG
Drug
bank
Jamend
o
ChEBI
Geo
names
basedNear owl:sameAs
x-geneid
#Links: 251.3k
country, ethnicity, race
keggCompoundId
bcr_patient_barcode
Same instance
Life Sciences Cross Domain Large Data
bcr_patient_barcode
#Links: 1.7k
#Links: 4.1k
#Links: 21.7k
#Links: 1.3k
79. LargeRDFBench Queries Properties
• 14 Simple
2-7 triple patterns
Subset of SPARQL clauses
Query execution time around 2 seconds on avg.
• 10 Complex
8-13 triple patterns
Use more SPARQL clauses
Query execution time up to 10 min
• 8 Large Data
Minimum 80459 results
Large intermediate results
Query execution time in hours
11/13/2016 79
80. LargeRDFBench Queries Characteristic
11/13/2016 80
Queries 32
Query
Forms
SELECT 100.00%
ASK 0.00%
CONSTRUCT 0.00%
DESCRIBE 0.00%
Important
SPARQL
Constructs
UNION 18.75%
DISTINCT 28.21%
ORDER BY 9.37%
REGEX 3.12%
LIMIT 12.5%
OFFSET 0.00%
OPTIONAL 25%
FILTER 31.25%
GROUP BY 0.00%
Result Size
Min 1
Max 3.0E+5
Mean 5.9E+4
S.D. 1.1E+5
BGPs
Min 1
Max 2
Mean 1.43
S.D. 0.5
Triple
Patterns
Min 2
Max 12
Mean 6.6
S.D. 2.6
Join
Vertices
Min 0
Max 6
Mean 3.43
S.D. 1.36
Mean Join
Vertices
Degree
Min 0
Max 6
Mean 2.56
S.D. 0.76
Mean
Triple
Patterns
Selectivity
Min 0.001
Max 1
Mean 0.10
S.D. 0.14
Query
Runtime
(ms)
Min 159
Max >1hr
Mean Undefined
S.D. Undefined
82. Performance Metrics
• Efficient source selection in terms of
• Total triple pattern-wise sources selected
• Total number of SPARQL ASK requests used during source selection
• Source selection time
• Query execution time
• Results completeness and correctness
• Number of remote requests during query execution
• Index compression ratio (1- index size/datadump size)
• Number of intermediate results
11/13/2016 82
83. Future Directions
• Micro benchmarking
• Synthetic benchmarks generation
Synthetic data that is like real data
Synthetic queries that is like real queries
• Customizable and flexible benchmark generation
• Fits user needs
• Fits current use-case
• What are the most important choke points for SPARQL querying
benchmarks? How they are related to query performance?
11/13/2016 83
84. References
• [L97] Charles Levine. TPC-C: The OLTP Benchmark. In SIGMOD – Industrial
Session, 1997.
• [GPH05] Y. Guo, Z. Pan, and J. Heflin. LUBM: A Benchmark for OWL Knowledge
Base Systems. Journal Web Semantics: Science, Services and Agents on the World
Wide Web archive Volume 3 Issue 2-3, October, 2005 , Pages 158-182
• [SHM+09] M. Schmidt , T. Hornung, M. Meier, C. Pinkel, G. Lausen. SP2Bench: A
SPARQL Performance Benchmark. Semantic Web Information Management, 2009.
• [BS09] C. Bizer and A. Schultz. The Berlin SPARQL Benchmark. Int. J. Semantic
Web and Inf. Sys., 5(2), 2009.
• [BSBM] Berlin SPARQL Benchmark (BSBM) Specification - V3.1. http://wifo5-
3.informatik.unimannheim.de/bizer/berlinsparqlbenchmark/spec/index.html.
• [RU09] N. Redaschi and UniProt Consortium. UniProt in RDF: Tackling Data
Integration and Distributed Annotation with the Semantic Web. In Biocuration
Conference, 2009.
11/13/2016 84
85. References
• [UniProtKB] UniProtKB Queries. http://www.uniprot.org/help/query-fields.
• [SKW07]F. M. Suchanek, G. Kasneci and G. Weikum. YAGO: A Core of Semantic Knowledge
Unifying WordNet and Wikipedia, In WWW 2007.
• [Barton] The MIT Barton Library dataset. http://simile.mit.edu/rdf-test-data/
• [PHS10] H. Patni, C. Henson, and A. Sheth. Linked sensor data. 2010
• [TPC-H] The TPC-H Homepage. http://www.tpc.org/tpch/
• [WordNet] WordNet: A lexical database for English. http://wordnet.princeton.edu/
• [MLA+14] M. Morsey, J. Lehmann, S. Auer, A-C. Ngonga Ngomo. Dbpedia SPARQL Benchmark
• [SP+12] Görlitz, Olaf, Matthias Thimm, and Steffen Staab. Splodge: Systematic generation of
sparql benchmark queries for linked open data. International Semantic Web Conference. Springer
Berlin Heidelberg, 2012.
• [BNE14] P. Boncz, T. Neumann, O. Erling. TPC-H Analyzed: Hidden Messages and Lessons Learned
from an Influential Benchmark. Performance Characterization and Benchmarking. In TPCTC 2013,
Revised Selected Papers.
11/13/2016 85
86. References
• [NS09] A–C. Ngonga Ngomo and D. Schumacher. Borderflow: A local graph clustering algorithm
for natural language processing. In CICLing, 2009.
• [AHO+14]G. Aluc, O. Hartig, T. Ozsu, K. Daudjee. Diversifed Stress Testing of RDF Data
Management Systems. In ISWC, 2014.
• [SMN15] M. Saleem, Q. Mehmood, and A–C. Ngonga Ngomo. FEASIBLE: A Feature-Based SPARQL
Benchmark Generation Framework. ISWC 2015.
• [DKS+11] S. Duan, A. Kementsietsidis, Kavitha Srinivas and Octavian Udrea. Apples and oranges: a
comparison of RDF benchmarks and real RDF datasets. In SIGMOD, 2011.
• [FK16] I.Fundulaki, A.Kementsietsidis Assessing the performance of RDF Engines: Discussing RDF
Benchmarks, Tutorial at ESWC2016
• [FB+11] Schmidt, Michael, et al. Fedbench: A benchmark suite for federated semantic data query
processing. International Semantic Web Conference. Springer Berlin Heidelberg, 2011.
• [LB+16] M.Saleem, A.Hasnain, A–C. Ngonga Ngomo. LargeRDFBench: A Billion Triples Benchmark
for SPARQL Query Federation, Submitted to Journal of Web Semantics
11/13/2016 86