This document discusses query optimization in Apache Tajo. It describes how Tajo generates logical, distributed, and local query plans and optimizes them using rule-based and cost-based techniques. Some key optimization techniques include pushing down filters, selecting join algorithms, and optimizing data partitioning progressively during query execution based on intermediate statistics. The document provides examples of query planning and optimization in Tajo.
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
Apache Tajo is a SQL-on-Hadoop system that provides both fast interactive analysis and stable long-running extract-transform-load (ETL) jobs. It supports various data formats and storage systems. Companies like SK Telecom and Bluehole Studio use Tajo for tasks such as data warehousing, game log analysis, and music streaming data discovery. Tajo is optimized for performance and supports features like cost-based query optimization and off-heap processing. Benchmark tests show it outperforms other SQL-on-Hadoop systems like Hive and Spark SQL.
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldJihoon Son
This slide was presented at the SK Telecom T Developer Forum. It contains the brief evaluation results of the query execution performance of Tajo on Swift.
I conducted two kinds of experiments; The first experiment was to compare the performance of Tajo with on another distributed storage, i.e., HDFS. And the second experiment was the scalability test of Swift.
Interestingly, the scan performance on Swift is slower more than two times than that on HDFS. In addition, the task scheduling time on Swift is much greater than that on HDFS, which means the query initialization cost is very high.
This document evaluates the performance of Apache Tajo compared to other systems like Hive and Spark SQL. It runs benchmarks using the TPC-DS benchmark on Google Cloud Platform. The evaluation shows that Tajo is generally the fastest system and scales well to larger datasets and clusters. Some queries that involve complex joins and aggregations see more room for improvement in Tajo. The document encourages contributors to get involved in optimizing and enhancing Tajo's performance.
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...PingCAP
Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory
footprint is often determined by how hash tables and the
tuples within them are represented. In this work, we propose three complementary techniques to improve this representation:
Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices.
By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Selfaligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure.
We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.
This document summarizes the results of a performance test comparing Hive, Impala, and Tajo on queries against a 1.7TB dataset. Tajo outperformed Hive and Impala on scans with filters and joins. For queries with grouping, aggregation, and sorting, Tajo was faster than Hive and similar to or faster than Impala. The author concludes that even though Tajo materializes all results to HDFS, its performance is promising compared to Impala due to its dynamic task scheduling. Further performance enhancements are expected as the Tajo project continues.
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit
This document discusses query optimization and just-in-time (JIT)-based vectorized execution in Apache Tajo. It outlines Tajo's query optimization techniques, including join order optimization and progressive optimization. It also describes Tajo's new JIT-based vectorized query execution engine, which improves performance by using vectorized processing, unsafe memory structures for vectors, and JIT compilation of vectorization primitives. The speaker is a director of research at Gruter who contributes to Apache Tajo and Apache Giraph.
In this talk I'll discuss how we can combine the power of PostgreSQL with TensorFlow to perform data analysis. By using the pl/python3 procedural language we can integrate machine learning libraries such as TensorFlow with PostgreSQL, opening the door for powerful data analytics combining SQL with AI. Typical use-cases might involve regression analysis to find relationships in an existing dataset and to predict results based on new inputs, or to analyse time series data and extrapolate future data taking into account general trends and seasonal variability whilst ignoring noise. Python is an ideal language for building custom systems to do this kind of work as it gives us access to a rich ecosystem of libraries such as Pandas and Numpy, in addition to TensorFlow itself.
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
Apache Tajo is a SQL-on-Hadoop system that provides both fast interactive analysis and stable long-running extract-transform-load (ETL) jobs. It supports various data formats and storage systems. Companies like SK Telecom and Bluehole Studio use Tajo for tasks such as data warehousing, game log analysis, and music streaming data discovery. Tajo is optimized for performance and supports features like cost-based query optimization and off-heap processing. Benchmark tests show it outperforms other SQL-on-Hadoop systems like Hive and Spark SQL.
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldJihoon Son
This slide was presented at the SK Telecom T Developer Forum. It contains the brief evaluation results of the query execution performance of Tajo on Swift.
I conducted two kinds of experiments; The first experiment was to compare the performance of Tajo with on another distributed storage, i.e., HDFS. And the second experiment was the scalability test of Swift.
Interestingly, the scan performance on Swift is slower more than two times than that on HDFS. In addition, the task scheduling time on Swift is much greater than that on HDFS, which means the query initialization cost is very high.
This document evaluates the performance of Apache Tajo compared to other systems like Hive and Spark SQL. It runs benchmarks using the TPC-DS benchmark on Google Cloud Platform. The evaluation shows that Tajo is generally the fastest system and scales well to larger datasets and clusters. Some queries that involve complex joins and aggregations see more room for improvement in Tajo. The document encourages contributors to get involved in optimizing and enhancing Tajo's performance.
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...PingCAP
Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory
footprint is often determined by how hash tables and the
tuples within them are represented. In this work, we propose three complementary techniques to improve this representation:
Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices.
By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Selfaligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure.
We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.
This document summarizes the results of a performance test comparing Hive, Impala, and Tajo on queries against a 1.7TB dataset. Tajo outperformed Hive and Impala on scans with filters and joins. For queries with grouping, aggregation, and sorting, Tajo was faster than Hive and similar to or faster than Impala. The author concludes that even though Tajo materializes all results to HDFS, its performance is promising compared to Impala due to its dynamic task scheduling. Further performance enhancements are expected as the Tajo project continues.
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit
This document discusses query optimization and just-in-time (JIT)-based vectorized execution in Apache Tajo. It outlines Tajo's query optimization techniques, including join order optimization and progressive optimization. It also describes Tajo's new JIT-based vectorized query execution engine, which improves performance by using vectorized processing, unsafe memory structures for vectors, and JIT compilation of vectorization primitives. The speaker is a director of research at Gruter who contributes to Apache Tajo and Apache Giraph.
In this talk I'll discuss how we can combine the power of PostgreSQL with TensorFlow to perform data analysis. By using the pl/python3 procedural language we can integrate machine learning libraries such as TensorFlow with PostgreSQL, opening the door for powerful data analytics combining SQL with AI. Typical use-cases might involve regression analysis to find relationships in an existing dataset and to predict results based on new inputs, or to analyse time series data and extrapolate future data taking into account general trends and seasonal variability whilst ignoring noise. Python is an ideal language for building custom systems to do this kind of work as it gives us access to a rich ecosystem of libraries such as Pandas and Numpy, in addition to TensorFlow itself.
This document provides an overview of using PostgreSQL for IoT applications. Chris Ellis discusses why PostgreSQL is a good fit for IoT due to its flexibility and extensibility. He describes various ways of storing, loading, and processing IoT time series and sensor data in PostgreSQL, including partitioning, batch loading, and window functions. The document also briefly mentions the TimescaleDB extension for additional time series functionality.
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
In the last few years, Netflix’s S3 data warehouse has grown to more than 100 PB. In that time, the company has shared several techniques and released open source tools for working around S3’s quirks, including s3mper to work around eventual consistency, S3 multipart committers to commit data without renames, and the batchid pattern for cross-partition atomic commits.
Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3 that is replacing many of the company’s current tools. Iceberg enables a new generation of improvements, including:
* Snapshot isolation with no directory listing or file renames
* Distributed planning to relieve metastore bottlenecks
* Improved data layout for S3 performance
* Immediately available writes from streaming applications
* Opportunistic compaction and data optimization
This document summarizes an update on OpenTSDB, an open source time series database. It discusses OpenTSDB's ability to store trillions of data points at scale using HBase, Cassandra, or Bigtable as backends. Use cases mentioned include systems monitoring, sensor data, and financial data. The document outlines writing and querying functionality and describes the data model and table schema. It also discusses new features in OpenTSDB 2.2 and 2.3 like downsampling, expressions, and data stores. Community projects using OpenTSDB are highlighted and the future of OpenTSDB is discussed.
InfluxDB is an open source time series database designed to handle high write and query speeds for real-time metrics, events, and sensor data. It uses a schemaless data model and stores data as time-stamped points in measurements, which can be queried using a SQL-like language. InfluxDB excels at aggregating and analyzing time series data for use cases like monitoring, analytics, and alerting.
Another year, another talk about OpenTSDB running on HBase.
We'll discuss topics like:
Yahoo's append co-processor saving CPU resources by resolving atomic appends at compaction or query time.
The pros and cons of HBASE-15181, Date Tiered compaction for time series data.
Yahoo's experiments with unbounded secondary index on HBase.
OpenTSDB's 3.0 featuring a new query engine and API.
by Chris Larsen of Yahoo!
Data Structures and Performance for Scientific Computing with Hadoop and Dumb...Austin Benson
This document discusses matrix storage and data serialization techniques for scientific computing with Hadoop and Dumbo. It provides examples of storing matrices in HDFS using different approaches like storing each row separately, storing two rows per record, or flattening the matrix into a single list. It also discusses optimizing data serialization and switching programming languages. The document then presents an example of outputting many small matrices to disk and compares two MapReduce implementations for computing the Cholesky QR decomposition, identifying which approach is usually better and why.
openTSDB - Metrics for a distributed worldOliver Hankeln
These are the slides for my talk at the IPC13/WTC13 in Munich on openTSDB. openTSDB ist the software that we at gutefrage.net use to store about 200 million data points in several thousand time series per day.
I will talk about how openTSDB stores the data to efficiently query them afterwards. Some cultural issues and some myths are also covered.
The document discusses various tuning knobs in the HDF5 library that can improve I/O performance for sequential and parallel access of HDF files. For sequential performance, it describes file level knobs like setting the metadata block size and cache size, as well as data transfer level knobs like buffer and sieve buffer size. For parallel performance, it discusses data alignment, using HDF5 split driver to separate metadata and raw data into different files, and passing MPI-IO hints. Benchmark results demonstrate the performance gains from using these tuning knobs.
OpenTSDB is used at Criteo for monitoring their large Hadoop infrastructure which includes over 2500 servers running many different services like HDFS, YARN, HBase, Kafka, and Storm. OpenTSDB was chosen because it can handle the scale of metrics collected, store metrics for long periods of time with fine-grained resolution, and is easily extensible to add new metrics. It uses HBase for storage which is optimized for the time series data stored in OpenTSDB and can scale to meet Criteo's needs of storing billions of data points and handling high query loads.
RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
This document summarizes Florian Lautenschlager's presentation at FOSDEM 2015 about using Apache Solr as a scalable time series database. The presentation discusses how to efficiently store billions of time-correlated data objects using data compression techniques and metadata in Solr documents. This allows fast retrieval of data points within milliseconds while using only 37GB of disk space for 68 billion objects. The document also outlines Solr's query capabilities and how custom functions can perform server-side decompression and aggregation for efficient querying of time series data stored in Solr.
This document discusses various techniques for enhancing the performance of .NET applications, including:
1) Implementing value types correctly by overriding Equals, GetHashCode, and IEquatable<T> to avoid boxing;
2) Applying precompilation techniques like NGen to improve startup time;
3) Using unsafe code and pointers for high-performance scenarios like reading structures from streams at over 100 million structures per second;
4) Choosing appropriate collection types like dictionaries for fast lookups or linked lists for fast insertions/deletions.
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through techniques like delta encoding, dictionary encoding, run-length encoding and binary packing designed for CPU and cache optimizations. Benchmark results show Parquet provides much better compression and faster query performance than other formats like text, Avro and RCFile. The project is developed as an open source community with contributions from many organizations.
Fabian Hueske – Juggling with Bits and BytesFlink Forward
This document discusses how Apache Flink operates on binary data. Flink adopts a database management system approach by serializing data objects into fixed memory segments for efficient in-memory and out-of-memory processing. This approach improves memory safety, reduces garbage collection overhead, and allows for efficient algorithms to operate directly on the binary data representations. It requires significant implementation effort compared to using generic Java collections, but provides benefits like predictable performance and resource usage.
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
Advanced MySql Data-at-Rest Encryption in Percona ServerSeveralnines
Iwo Panowicz - Percona & Bart Oles - Severalnines AB
The purpose of the talk is to present data-at-rest encryption implementation in Percona Server for MySQL.
Differences between Oracle's MySQL and MariaDB implementation.
- How it is implemented?
- What is encrypted:
- Tablespaces?
- General tablespace?
- Double write buffer/parallel double write buffer?
- Temporary tablespaces? (KEY BLOCKS)
- Binlogs?
- Slow/general/error logs?
- MyISAM? MyRocks? X?
- Performance overhead.
- Backups?
- Transportable tablespaces. Transfer key.
- Plugins
- Keyrings in general
- Key rotation?
- General-Purpose Keyring Key-Management Functions
- Keyring_file
- Is useful? How to make it profitable?
- Keyring Vault
- How does it work?
- How to make a transition from keyring_file
Speakers: Chris Larsen (Limelight Networks) and Benoit Sigoure (Arista Networks)
The OpenTSDB community continues to grow and with users looking to store massive amounts of time-series data in a scalable manner. In this talk, we will discuss a number of use cases and best practices around naming schemas and HBase configuration. We will also review OpenTSDB 2.0's new features, including the HTTP API, plugins, annotations, millisecond support, and metadata, as well as what's next in the roadmap.
- The document summarizes a webinar about WSO2 Complex Event Processor (CEP) 2.0.1.
- It introduces CEP and the Siddhi runtime engine, describes the WSO2 CEP architecture including brokers, buckets, and management UI.
- Performance tests show WSO2 CEP can process a high throughput of events. Features like high availability, persistence, scaling, and integration with BAM are covered.
- A demo scenario is presented to detect significant stock price changes and high Twitter mentions within a minute using CEP queries.
Intro to Distributed Database Management SystemAli Raza
This document outlines the key topics covered in distributed database management systems (DDBMS). It introduces DDBMS and discusses their advantages over centralized systems, including improved performance, reliability, availability, and scalability. The document also summarizes major challenges in DDBMS, such as distributed database design, query processing, concurrency control, and reliability.
This document discusses various techniques for optimizing queries in MySQL databases. It covers storage engines like InnoDB and MyISAM, indexing strategies including different index types and usage examples, using explain plans to analyze query performance, and rewriting queries to improve efficiency by leveraging indexes and removing unnecessary functions. The goal of these optimization techniques is to reduce load on database servers and improve query response times as data volumes increase.
This document provides an overview of using PostgreSQL for IoT applications. Chris Ellis discusses why PostgreSQL is a good fit for IoT due to its flexibility and extensibility. He describes various ways of storing, loading, and processing IoT time series and sensor data in PostgreSQL, including partitioning, batch loading, and window functions. The document also briefly mentions the TimescaleDB extension for additional time series functionality.
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
In the last few years, Netflix’s S3 data warehouse has grown to more than 100 PB. In that time, the company has shared several techniques and released open source tools for working around S3’s quirks, including s3mper to work around eventual consistency, S3 multipart committers to commit data without renames, and the batchid pattern for cross-partition atomic commits.
Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3 that is replacing many of the company’s current tools. Iceberg enables a new generation of improvements, including:
* Snapshot isolation with no directory listing or file renames
* Distributed planning to relieve metastore bottlenecks
* Improved data layout for S3 performance
* Immediately available writes from streaming applications
* Opportunistic compaction and data optimization
This document summarizes an update on OpenTSDB, an open source time series database. It discusses OpenTSDB's ability to store trillions of data points at scale using HBase, Cassandra, or Bigtable as backends. Use cases mentioned include systems monitoring, sensor data, and financial data. The document outlines writing and querying functionality and describes the data model and table schema. It also discusses new features in OpenTSDB 2.2 and 2.3 like downsampling, expressions, and data stores. Community projects using OpenTSDB are highlighted and the future of OpenTSDB is discussed.
InfluxDB is an open source time series database designed to handle high write and query speeds for real-time metrics, events, and sensor data. It uses a schemaless data model and stores data as time-stamped points in measurements, which can be queried using a SQL-like language. InfluxDB excels at aggregating and analyzing time series data for use cases like monitoring, analytics, and alerting.
Another year, another talk about OpenTSDB running on HBase.
We'll discuss topics like:
Yahoo's append co-processor saving CPU resources by resolving atomic appends at compaction or query time.
The pros and cons of HBASE-15181, Date Tiered compaction for time series data.
Yahoo's experiments with unbounded secondary index on HBase.
OpenTSDB's 3.0 featuring a new query engine and API.
by Chris Larsen of Yahoo!
Data Structures and Performance for Scientific Computing with Hadoop and Dumb...Austin Benson
This document discusses matrix storage and data serialization techniques for scientific computing with Hadoop and Dumbo. It provides examples of storing matrices in HDFS using different approaches like storing each row separately, storing two rows per record, or flattening the matrix into a single list. It also discusses optimizing data serialization and switching programming languages. The document then presents an example of outputting many small matrices to disk and compares two MapReduce implementations for computing the Cholesky QR decomposition, identifying which approach is usually better and why.
openTSDB - Metrics for a distributed worldOliver Hankeln
These are the slides for my talk at the IPC13/WTC13 in Munich on openTSDB. openTSDB ist the software that we at gutefrage.net use to store about 200 million data points in several thousand time series per day.
I will talk about how openTSDB stores the data to efficiently query them afterwards. Some cultural issues and some myths are also covered.
The document discusses various tuning knobs in the HDF5 library that can improve I/O performance for sequential and parallel access of HDF files. For sequential performance, it describes file level knobs like setting the metadata block size and cache size, as well as data transfer level knobs like buffer and sieve buffer size. For parallel performance, it discusses data alignment, using HDF5 split driver to separate metadata and raw data into different files, and passing MPI-IO hints. Benchmark results demonstrate the performance gains from using these tuning knobs.
OpenTSDB is used at Criteo for monitoring their large Hadoop infrastructure which includes over 2500 servers running many different services like HDFS, YARN, HBase, Kafka, and Storm. OpenTSDB was chosen because it can handle the scale of metrics collected, store metrics for long periods of time with fine-grained resolution, and is easily extensible to add new metrics. It uses HBase for storage which is optimized for the time series data stored in OpenTSDB and can scale to meet Criteo's needs of storing billions of data points and handling high query loads.
RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
This document summarizes Florian Lautenschlager's presentation at FOSDEM 2015 about using Apache Solr as a scalable time series database. The presentation discusses how to efficiently store billions of time-correlated data objects using data compression techniques and metadata in Solr documents. This allows fast retrieval of data points within milliseconds while using only 37GB of disk space for 68 billion objects. The document also outlines Solr's query capabilities and how custom functions can perform server-side decompression and aggregation for efficient querying of time series data stored in Solr.
This document discusses various techniques for enhancing the performance of .NET applications, including:
1) Implementing value types correctly by overriding Equals, GetHashCode, and IEquatable<T> to avoid boxing;
2) Applying precompilation techniques like NGen to improve startup time;
3) Using unsafe code and pointers for high-performance scenarios like reading structures from streams at over 100 million structures per second;
4) Choosing appropriate collection types like dictionaries for fast lookups or linked lists for fast insertions/deletions.
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through techniques like delta encoding, dictionary encoding, run-length encoding and binary packing designed for CPU and cache optimizations. Benchmark results show Parquet provides much better compression and faster query performance than other formats like text, Avro and RCFile. The project is developed as an open source community with contributions from many organizations.
Fabian Hueske – Juggling with Bits and BytesFlink Forward
This document discusses how Apache Flink operates on binary data. Flink adopts a database management system approach by serializing data objects into fixed memory segments for efficient in-memory and out-of-memory processing. This approach improves memory safety, reduces garbage collection overhead, and allows for efficient algorithms to operate directly on the binary data representations. It requires significant implementation effort compared to using generic Java collections, but provides benefits like predictable performance and resource usage.
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
Advanced MySql Data-at-Rest Encryption in Percona ServerSeveralnines
Iwo Panowicz - Percona & Bart Oles - Severalnines AB
The purpose of the talk is to present data-at-rest encryption implementation in Percona Server for MySQL.
Differences between Oracle's MySQL and MariaDB implementation.
- How it is implemented?
- What is encrypted:
- Tablespaces?
- General tablespace?
- Double write buffer/parallel double write buffer?
- Temporary tablespaces? (KEY BLOCKS)
- Binlogs?
- Slow/general/error logs?
- MyISAM? MyRocks? X?
- Performance overhead.
- Backups?
- Transportable tablespaces. Transfer key.
- Plugins
- Keyrings in general
- Key rotation?
- General-Purpose Keyring Key-Management Functions
- Keyring_file
- Is useful? How to make it profitable?
- Keyring Vault
- How does it work?
- How to make a transition from keyring_file
Speakers: Chris Larsen (Limelight Networks) and Benoit Sigoure (Arista Networks)
The OpenTSDB community continues to grow and with users looking to store massive amounts of time-series data in a scalable manner. In this talk, we will discuss a number of use cases and best practices around naming schemas and HBase configuration. We will also review OpenTSDB 2.0's new features, including the HTTP API, plugins, annotations, millisecond support, and metadata, as well as what's next in the roadmap.
- The document summarizes a webinar about WSO2 Complex Event Processor (CEP) 2.0.1.
- It introduces CEP and the Siddhi runtime engine, describes the WSO2 CEP architecture including brokers, buckets, and management UI.
- Performance tests show WSO2 CEP can process a high throughput of events. Features like high availability, persistence, scaling, and integration with BAM are covered.
- A demo scenario is presented to detect significant stock price changes and high Twitter mentions within a minute using CEP queries.
Intro to Distributed Database Management SystemAli Raza
This document outlines the key topics covered in distributed database management systems (DDBMS). It introduces DDBMS and discusses their advantages over centralized systems, including improved performance, reliability, availability, and scalability. The document also summarizes major challenges in DDBMS, such as distributed database design, query processing, concurrency control, and reliability.
This document discusses various techniques for optimizing queries in MySQL databases. It covers storage engines like InnoDB and MyISAM, indexing strategies including different index types and usage examples, using explain plans to analyze query performance, and rewriting queries to improve efficiency by leveraging indexes and removing unnecessary functions. The goal of these optimization techniques is to reduce load on database servers and improve query response times as data volumes increase.
データサイエンティストのための
Spark 入門
昨今、データサイエンティストの間で「Spark」の人気が高まっています。データをインメモリで高速に処理できるSparkを使うと、大規模なデータを扱う際にもストレスなく分析できます。
今回の第3回 Big Data University - 東京ミートアップでは、RStudioで作ったプログラムを Spark上で実行してみる方法を解説します。
また、合わせて、Rや Sparkとの対話環境である Data Scientist Workbench の使い方を紹介します。
こちらは前半資料となります。
Siddhi: A Second Look at Complex Event Processing ImplementationsSrinath Perera
Today there are so much data being available from sources like sensors (RFIDs, Near Field Communication), web activities, transactions, social networks, etc. Making sense of this avalanche of data requires efficient and fast processing.
Processing of high volume of events to derive higher-level information is a vital part of taking critical decisions, and
Complex Event Processing (CEP) has become one of the most rapidly emerging fields in data processing. e-Science
use-cases, business applications, financial trading applications, operational analytics applications and business activity monitoring applications are some use-cases that directly use CEP. This paper discusses different design decisions associated
with CEP Engines, and proposes some approaches to improve CEP performance by using more stream processing
style pipelines. Furthermore, the paper will discuss Siddhi, a CEP Engine that implements those suggestions. We
present a performance study that exhibits that the resulting CEP Engine—Siddhi—has significantly improved performance.
Primary contributions of this paper are performing a critical analysis of the CEP Engine design and identifying
suggestions for improvements, implementing those improvements
through Siddhi, and demonstrating the soundness of those suggestions through empirical evidence.
Video: https://www.facebook.com/atscaleevents/videos/1693888610884236/ . Talk by Brendan Gregg from Facebook's Performance @Scale: "Linux performance analysis has been the domain of ancient tools and metrics, but that's now changing in the Linux 4.x series. A new tracer is available in the mainline kernel, built from dynamic tracing (kprobes, uprobes) and enhanced BPF (Berkeley Packet Filter), aka, eBPF. It allows us to measure latency distributions for file system I/O and run queue latency, print details of storage device I/O and TCP retransmits, investigate blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. This talk will summarize this new technology and some long-standing issues that it can solve, and how we intend to use it at Netflix."
Data Enginering from Google Data Warehousearungansi
BigQuery is Google's serverless data warehouse that allows querying large datasets. It separates compute and storage, has no servers to manage, and includes machine learning features. Best practices for BigQuery include filtering on partitioned columns to improve query performance, using clustered tables, avoiding SELECT *, and optimizing joins. BigQuery also supports machine learning with SQL, requiring no code and being directly integrated with the data warehouse.
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...Denodo
When it comes to optimizing access to your data, there is no 'one size fits all' technique that truly works for all data sources - that's why the Denodo Platform has a whole spectrum of techniques and options in all levels of the stack that are designed to give you the best performance, lowest latency and highest throughput for all of your data. This webinar will provide a deep dive into these optimization techniques and will show them in action with some real world examples.
More information and FREE registrations to this webinar: http://goo.gl/QB48O3
To learn more click to this link: http://go.denodo.com/a2a
Join the conversation at #Architect2Architect
Agenda:
Denodo Platform Performance Overview
Query optimization
Caching
Resource Management
The document summarizes an Admin Tools and Best Practices presentation given by Patson Settachatgul and Mark Jones at the #TargetXSummit. It discusses the TargetX Permission Scanner tool for validating user permissions, using permission sets to control access to packaged components, challenges with managing data sources and test migrations between orgs. It also demonstrates the Salesforce DX development model and highlights new features in Analytics, Process Builder, Communities and Fields/Formulas in the Summer '18 release.
The document summarizes a group project to redesign a code configurator tool for Genfare. It includes:
1. An introduction of the team and an overview of the project which aims to improve the usability and efficiency of Genfare's existing configurator tool.
2. An analysis of problems with the current tool including unorganized categories and hardcoded settings files, which leads to complexity, human errors, and high rework costs.
3. The group's proposed solution of a new 3-level tree structure tool with a separated backend database and frontend GUI for improved usability.
4. Details on development goals, deliverables, unit/efficiency testing plans, and recommendations upon project completion.
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.
To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.
Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.
In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.
We will also provide guidelines and best practices with regards to Druid.
Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
Have a big data product / database / DBMS? need to test it? don't know where to start? some things to consider while you design your Automation QA.
Link to Video
https://www.youtube.com/watch?v=MlT4pP7BGFQ
This document discusses Flipkart's search architecture and how it addresses challenges for e-commerce search. It has a diverse catalog of 13 million products across 900 categories. It needs high performance with 99.99% availability and 1000 queries per second. There are also high rates of updates. Solutions discussed include caching, external source fields for sorting/faceting/filtering, and relevance optimizations. Caching improves performance 10-50x by caching results. External fields help with updates and partitioning. Relevance is tuned using boosts, user feedback, and query classification.
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
This document provides an overview of production-ready machine learning workflows. It discusses challenges of big ML including skill gaps, dimensionality, and model complexity. The solution is presented as a workflow that includes preprocessing, naive implementation, monitoring with dashboards, optimization, A/B testing, and iteration. Key steps are to measure first before optimizing, start small and grow, test infrastructure, and establish a baseline before optimizing models. The document provides examples of applying these workflows at Waze for tasks like irregular traffic event detection, dangerous place identification, and speed limit inference.
When setting up a new project we have some tips and tricks to help you do this in the best way possible, incl. infrastructure, database, standard attributes, logging, code alignment, and service center.
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks
1) The document discusses using black box optimization algorithms to automate the tuning of a search engine's configuration parameters to improve search relevancy.
2) It describes using a test collection of queries and relevance judgments, or search logs, to evaluate how changes to parameters impact relevancy metrics. An optimization algorithm would intelligently search the parameter space.
3) Care must be taken to validate any improved parameters on a separate test set to avoid overfitting and ensure gains generalize to new data. The approach holds promise for automating what can otherwise be a slow manual tuning process.
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
1) Bayesian optimization can be used to efficiently tune the hyperparameters of machine learning models, requiring far fewer evaluations than standard random search or grid search methods to find good hyperparameters.
2) It builds a statistical model called a Gaussian process to model the objective function based on previous evaluations, and uses this to select the most promising hyperparameters to evaluate next in order to optimize an objective metric like accuracy.
3) SigOpt is a service that uses Bayesian optimization to tune machine learning models, outperforming expert humans on tasks like classifying images from CIFAR10 and reducing error rates more than standard methods.
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
1. Tuning machine learning models is challenging due to the large number of non-intuitive hyperparameters.
2. Traditional tuning methods like grid search are computationally expensive and can find local optima rather than global optima.
3. Bayesian optimization uses Gaussian processes to build statistical models from prior evaluations to determine the most promising hyperparameters to test next, requiring far fewer evaluations than traditional methods to find better performing models.
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Jonathan Singer
Guild members join us on Thursday November 14th at 6pm for our class on Splunk. Our Analyze Guild Master Jonathan Singer will be hitting on Centralized Logging, SEIM, Big Data, and much more.
Tuning for Systematic Trading: Talk 2: Deep LearningSigOpt
This talk explains how to train deep learning and other expensive models with parallelism and multitask optimization to reduce wall clock time. Tobias Andreassen, who supports a number of our systematic trading customers, presented the intuition behind Bayesian optimization for model optimization with a single or multiple (often competing) metrics. Many times it makes sense to analyze a second metric to avoid myopic training runs that overfit on your data, or otherwise don’t represent or impede performance in real-world scenarios.
This project report discusses two machine learning strategies for statistical arbitrage trading: a deep learning strategy using recurrent neural networks and a statistical arbitrage strategy. The RNN strategy uses LSTM units to model stock price and volume time series data and predict future price movements. Different models are evaluated on validation data, with the best performing model using stochastic thresholds on correlated pair performance. For the statistical arbitrage strategy, correlated stock pairs are identified and traded by linearly regressing mid-price returns between pairs to construct a minimum risk portfolio. Parameter tuning is done through grid search and random search to optimize the strategy. The statistical arbitrage strategy is shown to perform well on test set results.
This document discusses techniques for faceted search and result reordering in Solr. It introduces facets, multi-select faceting, and dynamic range faceting. It also discusses reordering documents using the default scoring formula, function queries, and the query elevation component. Specific techniques are provided for boosting results based on price, release date, margins, and other fields. The document provides code examples for implementing these features in Solr.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
2. About Me
● Jihoon Son (@jihoonson)
○ Tajo project co-founder
○ Committer and PMC member of Apache Tajo
○ Research engineer at Gruter
2
3. ● Introduction to Tajo
● Query processing in Tajo
○ Query plans in Tajo
○ Query processing example
● Query optimization in Tajo
○ Introduction to query optimization
○ Query optimization techniques in Tajo
Outline
3
4. ● Apache Top-level Project
○ Data warehouse system
■ Efficient processing of analytic queries
■ ANSI-SQL compliant
○ Scalable and rapid query execution with own engine
■ Distributed query processing
■ Fault-tolerance
○ Beyond SQL-on-Hadoop
■ Support various types of storage
● HDFS, S3, hbase, rdbms, ...
What is Tajo?
4
5. Highlighted Features
● Support long-running batch queries as well as
interactive ad-hoc queries
○ Fast query processing
■ Optimized scan performance
● 120 MB/sec per physical disk (SATA)
○ Reliability
■ Fault tolerance
■ No single point of failure with HA support
5
6. Highlighted Features
● Support of various kinds of data sources
○ HDFS, Amazon S3, Google Cloud Storage, HBase,
RDBMS, ...
● Mature SQL support
○ Various kinds of join support
○ Window function support
○ Cost-based query optimization
● Integration with other systems
○ Notebooks like Zeppelin
○ BI tools
6
7. Recent Release: 0.11
● Feature highlights
○ Query federation
○ JDBC-based storage support
○ Self-describing data formats support
○ Multi-query support
○ More stable and efficient join execution
○ Index support
○ Python UDF/UDAF support
7
8. Tajo Master
Catalog Server
Tajo Master
Catalog Server
Architecture Overview
DBMS
HCatalog
Tajo Master
Catalog Server
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
JDBC client
TSQLWebUI
REST API
Storage
Submit
a query
Manage
metadataAllocate
a query
Send tasks
& monitor
Send tasks
& monitor
8
9. Tajo Worker
Query Master
Tajo Worker
Query Master
Tajo Worker
Query Master
Query Execution Steps
9
Tajo Master
Catalog Server
Tajo Client
① Submit a
query
DBMS
② Assign a
query
● Initializing a query execution
③ Build a query
execution plan
10. Tajo Worker
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Executor
Storage Service
Query Execution Steps
10
Storage
⑥ Send status
and progress
⑤ Read and
process data
④ Send tasks
& monitor
● Executing a query
Tajo Master
11. Tajo Worker
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Executor
Storage Service
Query Execution Steps
11
Tajo Client
Storage
⑧ Notify that query
execution is completed
⑦ Store the result
on storage
⑨ Send the
result location
⑩ Read the
result
● Finalizing the query execution
Tajo Master
13. ● Given a user query, a query execution plan is an
ordered set of steps to execute the query
○ Example
■ Read data from storage, and then do join on some join
keys, and finally aggregate with some aggregation keys
● In Tajo, there are three kinds of query plans
○ Query master generates a logical query plan and a
distributed query plan
○ Query executor of tajo workers generates a local query
plan
Query Execution Plan
13
14. Query Planning Steps in Tajo
14
SQL
SQL
Analyzer
Algebraic
Expression
Logical
Planner
Logical Query
Plan
Global
Planner
Distributed
Query Plan
Physical
Planner
Local Query
Plan
Query Executor
Query Master
Distributed to
tajo workers
15. Join
Logical Query Plan
● A tree of relational algebras
● Example
15
SELECT
item.brand,
sum(price)
FROM
sales,
item
WHERE
sales.item_key =
item.item_key
GROUP BY
item.brand,
Scan on
item
Scan on
sales
Group by
< SQL > < Logical query plan >
key: item_key
key: brand
func: sum(price)
16. Distributed Query Plan
● A plan with additional annotations for distributed
execution
○ Data exchange (shuffle) keys, methods, ...
16
< Distributed query plan >
Join
Scan on
item
Scan on
sales
Group by
< Logical query plan >
key: item_key
key: brand
func: sum(price)
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Hash shuffle with
item_key
Hash shuffle with
item_key
Range shuffle
with brand
17. Local Query Plan
● A plan with additional annotations for local execution
○ In-memory algorithm, disk-based algorithm, …
17
< Distributed query plan >
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Hash shuffle with
item_key
Hash shuffle with
item_key
Range shuffle
with brand
< Local query plan >
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Hash shuffle with
item_key
Hash shuffle with
item_key
Range shuffle
with brandSort-merge
join
Hash
aggregation
18. Query Processing in Tajo
● A query is executed by executing multiple stages
subsequently
○ A stage is a minimum unit to execute at least a single
operator
● Each stage is processed by multiple query executors of
tajo worker in parallel
18
Join
Scan on
item
Scan on
sales
key: item_key
Stage 2
Stage 1
19. ● SQL ● Logical query plan
Query Processing Example
19
Join
SELECT
item.brand,
sum(price)
FROM
sales,
item
WHERE
sales.item_key =
item.item_key
GROUP BY
item.brand,
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
20. ● Logical query plan ● Distributed query plan
Query Processing Example
20
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle
with item_key
Range shuffle
with brand
Hash shuffle
with item_key
21. Query Processing Example
● Distributed query plan
21
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle
with item_key
Range shuffle
with brand
Hash shuffle
with item_key
item item sales sales sales
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
● Distributed processing
22. Query Processing Example
22
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle
with item_key
Range shuffle
with brand
Hash shuffle
with item_key
item item sales sales sales
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Join
Worker
Join
Worker
Join
Worker
Join
Worker
Join
shuffle
● Distributed query plan ● Distributed processing
23. Query Processing Example
● Distributed query plan
23
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle
with item_key
Range shuffle
with brand
Hash shuffle
with item_key
item item sales sales sales
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Join
Worker
Join
Worker
Join
Worker
Join
Worker
Join
Worker
Group by
Worker
Group by
Worker
Group by
Worker
Group by
Worker
Group by
shuffle
shuffle
● Distributed processing
25. Query Optimization
● Mostly, user queries are not optimized for
performance
● The query optimizer attempts to determine the most
efficient way to execute a user query
○ Considering the possible query plans, and choosing the
best one
25
26. Extreme Example
● Query
○ select * from t where name like 'tajo%' order by id;
● Possible plans
26
Scan
Sort
Filter
Scan with
Filter
Sort● Naive plan
○ Filtering out tuples
after sort
○ Large cost for sort
● Better plan
○ Filtering out tuples
after scan immediately
○ Small cost for sort
○ Reduced number of
operations
27. Two Kinds of Query Optimization
● Rule-based optimization
○ A set of predefined rules is used to choose a good plan
○ Usually, heuristic approaches are used
■ Ex) filters should be pushed down to the lower part of the
query plan as much as possible
● Cost-based optimization
○ Enumerating possible query plans and choosing the one
having the lowest cost
○ Cost function has an important role
● Tajo utilizes both types of optimization
27
28. Query Optimization in Tajo
● Difference from traditional query optimization
○ Unlike traditional database systems, pre-collected
statistics is not so important
■ Data may be added or updated by several systems
including Flume, Kafka, Tajo, …
■ Pre-collected statistics can be useful, but is not fully
trustworthy
○ It is important to optimize query plans with minimal
statistics
■ Volume of input relations
28
29. Query Optimization in Tajo
● Tajo has two different approaches for query
optimization
○ Static optimization
■ Traditional approach
■ Optimizing the plan during the query planning phase
○ Progressive optimization
■ Optimizing the plan based on the intermediate statistics
while executing the query
● A query plan can be optimized without pre-collected
statistics
● Especially effective for queries which require multiple stage
execution 29
30. Logical Query Plan Optimization
● Rule-based optimization
○ Access path rewrite rule
■ Choosing access path to data
■ Index scan has the highest priority if available
○ Distributivity rule
■ Reducing filters based on distributivity
○ Filter pushdown rule
■ Pushing down filters to the lowest part as much as
possible
○ In-subquery rewrite rule
■ Transforming subqueries in 'IN' filters to semi(anti) joins
30
31. Logical Query Plan Optimization
● Rule-based optimization (cont')
○ Projection pushdown rule
■ Pushing down projections to the lowest part as much as
possible
● Cost-based optimization
○ Join order optimization
■ Finding a join order of lowest cost
■ Greedy heuristic: ordering relations from small ones to
large ones
● Very effective in single computing environment
● Need to improve for parallel computing environment
31
32. Distributed Query Plan Optimization
● Rule-based optimization
○ Two-phase execution of operators
■ Operators which require data shuffling like aggregation,
join, or sort are executed in two-phase
■ First phase is for local computing to reduce the amount of
shuffled data
■ Second phase is to get the result of the operation
32
33. Two-phase Execution Example
● Logical query plan
33
● Distributed query plan
Group by
Scan
Sort
Group by
Scan
SortStage 3
Stage 2
Stage 1
Group by
Sort
Local
group by
Local
sort
34. Distributed Query Plan Optimization
● Distributed join algorithm selection
○ Two representative distributed join algorithms
■ Join cannot be performed within a single stage in
distributed systems
● Tuples of the same join key may be distributed over cluster
nodes
■ Repartition join
● Both input relations are shuffled with the join key columns
■ Broadcast join
● Small relations are broadcasted to every node before join
34
35. Example of Repartition Join
● select … from employee e, department d where e.DeptName = d.
DeptName
35
36. Example of Broadcast Join
● select … from employee e, department d where e.DeptName = d.
DeptName
36
37. Distributed Join Algorithm Selection
● Repartition join VS broadcast join
○ Given a set of joins, some parts can be executed with
broadcast join while remaining parts are executed with
repartition join
● Which parts will be executed with broadcast join?
○ Greedy heuristic: broadcast join is used as many as
possible
■ The size of input relation should be smaller than pre-
defined threshold
■ The total volume of broadcasted relations should not
exceed pre-defined threshold 37
39. Local Query Plan Optimization
● Selecting the best algorithm based on the current
resource status
○ Aggregation
■ Hash aggregation, sort aggregation
○ Join
■ Hash join, sort-merge join
● For sort, hash sort is basically used with spilling data to
disk when it doesn't fit into memory
39
40. Progressive Optimization
● Data repartition
○ Some operators like join or aggregation require to
shuffle data with keys
○ The number of result partitions of shuffle should be
carefully decided
■ The number of partitions is related to the number of tasks
of the next stage
● At the beginning of each stage, the number of
partitions is decided based on the input size
40
41. Progressive Optimization Example
41
Group by
Scan on item
(100GB)
SortStage 3
Stage 2
Stage 1
Group by
Sort
# of partitions: 100
● If the default task size is 1GB,
Group by
Scan on item
SortStage 3
Stage 2
Stage 1
Group by
(50GB)
Sort
# of partitions: 50
# of tasks: 100
# of tasks: 50
42. Future Work
● Adding more optimization methods
● Improve cost functions for more effective cost-based
optimization
● Adding new approaches for progressive optimization
○ Runtime query rewriting
○ Integrating with genetic algorithm
○ …
42
43. 43
Get Involved!
● General
○ http://tajo.apache.org
● Getting Started
○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads
○ http://tajo.apache.org/downloads.html
● Jira – Issue Tracker
○ https://issues.apache.org/jira/browse/TAJO
● Join the mailing list
○ dev-subscribe@tajo.apache.org
○ issues-subscribe@tajo.apache.org