This document describes Orca, a new query optimizer architecture developed by Pivotal for its data management products. Orca is designed to be modular and portable, allowing it to optimize queries for both massively parallel processing (MPP) databases and Hadoop systems. The key features of Orca include its use of a memo structure to represent the search space of query plans, a job scheduler to efficiently explore the search space in parallel, and an extensible framework for property enforcement during query optimization. Performance tests showed that Orca provided query speedups of 10x to 1000x over previous optimization systems.
GPORCA is newly open source advanced query optimizer that is a subproject of Greenplum Database open source project. GPORCA is the query optimizer used in commercial distributions of both Greenplum and HAWQ. In these distributions GPORCA has achieved 1000x performance improvement across TPC-DS queries by focusing on three distinct areas: Dynamic Partition Elimination, SubQuery Unnesting, and Common Table Expression.
Now that GPORCA is open source, we are looking for collaborators to help us realize the ultimate dream for GPORCA - to work with any database.
The new breed of data management systems in Big Data have to process so much data that optimization mistakes are magnified in traditional optimizers. Furthermore, coding and manual optimization of complex queries has proven to be hard.
In this session, Venkatesh will discuss:
- Overview of GPORCA
- How to add GPORCA to HAWQ with a build option
- How GPORCA could be made to work with any database
- Future vision for GPORCA and more immediate plans
- How to work with GPORCA, and how to contribute to GPORCA
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...PingCAP
This paper proposes interleaving with coroutines for
any type of index join. It showcases the proposal on SAP
HANA by implementing binary search and CSB+-tree traversal for an instance of index join related to dictionary compression. Coroutine implementations not only perform similarly to prior interleaving techniques, but also resemble the original code closely, while supporting both interleaved and non-interleaved execution. Thus, this paper claims that coroutines
make interleaving practical for use in real DBMS codebases.
Paper: http://www.vldb.org/pvldb/vol11/p230-psaropoulos.pdf
Follow PingCAP on Twitter: https://twitter.com/PingCAP
Follow PingCAP on LinkedIn: https://www.linkedin.com/company/13205484/
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer.
In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture.
In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.
GPORCA is query optimizer used inside Greenplum database, the first open source MPP solution based on PostgreSQL.
These are slides presented at the PGConf Seattle 2017. It introduced the internals of GPORCA, and provide OSS developers context to contribute back to the project.
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKPrincipled Technologies
OpenJDK is an efficient foundation for distributed data processing and analytics using Apache Hadoop. In our testing of a Hortonworks HDP 2.0 distribution running on Red Hat Enterprise Linux 6.5, we found that Hadoop performance using OpenJDK was comparable to the performance using Oracle JDK. Comparable performance paired with automatic updates means that OpenJDK can benefit organizations using Red Hat Enterprise Linux -based Hadoop deployments.
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...PingCAP
Being one of the most complex components of a DBMS, query optimizers could benefit from adaptive policies that are learned systematically from the data and the query workload. This paper takes the approach used by Marcus et al. in Bao and adapt it to SCOPE, a big data system used internally at Microsoft. Along the way, multiple new challenges have been solved. This paper also evaluates the efficacy of the approach on production workloads that include 150K daily jobs.
Paper:
https://dl.acm.org/doi/pdf/10.1145/3448016.3457568
GPORCA is newly open source advanced query optimizer that is a subproject of Greenplum Database open source project. GPORCA is the query optimizer used in commercial distributions of both Greenplum and HAWQ. In these distributions GPORCA has achieved 1000x performance improvement across TPC-DS queries by focusing on three distinct areas: Dynamic Partition Elimination, SubQuery Unnesting, and Common Table Expression.
Now that GPORCA is open source, we are looking for collaborators to help us realize the ultimate dream for GPORCA - to work with any database.
The new breed of data management systems in Big Data have to process so much data that optimization mistakes are magnified in traditional optimizers. Furthermore, coding and manual optimization of complex queries has proven to be hard.
In this session, Venkatesh will discuss:
- Overview of GPORCA
- How to add GPORCA to HAWQ with a build option
- How GPORCA could be made to work with any database
- Future vision for GPORCA and more immediate plans
- How to work with GPORCA, and how to contribute to GPORCA
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...PingCAP
This paper proposes interleaving with coroutines for
any type of index join. It showcases the proposal on SAP
HANA by implementing binary search and CSB+-tree traversal for an instance of index join related to dictionary compression. Coroutine implementations not only perform similarly to prior interleaving techniques, but also resemble the original code closely, while supporting both interleaved and non-interleaved execution. Thus, this paper claims that coroutines
make interleaving practical for use in real DBMS codebases.
Paper: http://www.vldb.org/pvldb/vol11/p230-psaropoulos.pdf
Follow PingCAP on Twitter: https://twitter.com/PingCAP
Follow PingCAP on LinkedIn: https://www.linkedin.com/company/13205484/
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer.
In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture.
In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.
GPORCA is query optimizer used inside Greenplum database, the first open source MPP solution based on PostgreSQL.
These are slides presented at the PGConf Seattle 2017. It introduced the internals of GPORCA, and provide OSS developers context to contribute back to the project.
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKPrincipled Technologies
OpenJDK is an efficient foundation for distributed data processing and analytics using Apache Hadoop. In our testing of a Hortonworks HDP 2.0 distribution running on Red Hat Enterprise Linux 6.5, we found that Hadoop performance using OpenJDK was comparable to the performance using Oracle JDK. Comparable performance paired with automatic updates means that OpenJDK can benefit organizations using Red Hat Enterprise Linux -based Hadoop deployments.
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...PingCAP
Being one of the most complex components of a DBMS, query optimizers could benefit from adaptive policies that are learned systematically from the data and the query workload. This paper takes the approach used by Marcus et al. in Bao and adapt it to SCOPE, a big data system used internally at Microsoft. Along the way, multiple new challenges have been solved. This paper also evaluates the efficacy of the approach on production workloads that include 150K daily jobs.
Paper:
https://dl.acm.org/doi/pdf/10.1145/3448016.3457568
Arun C Murthy, Founder and Architect at Hortonworks Inc., talks about the upcoming Next Generation Apache Hadoop MapReduce framework at the Hadoop Summit, 2011.
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryDatabricks
The capacity of data grows rapidly in big data area, more and more memory are consumed either in the computation or holding the intermediate data for analytic jobs. For those memory intensive workloads, end-point users have to scale out the computation cluster or extend memory with storage like HDD or SSD to meet the requirement of computing tasks. For scaling out the cluster, the extra cost from cluster management, operation and maintenance will increase the total cost if the extra CPU resources are not fully utilized. To address the shortcoming above, Intel Optane DC persistent memory (Optane DCPM) breaks the traditional memory/storage hierarchy and scale up the computing server with higher capacity persistent memory. Also it brings higher bandwidth & lower latency than storage like SSD or HDD. And Apache Spark is widely used in the analytics like SQL and Machine Learning on the cloud environment. For cloud environment, low performance of remote data access is typical a stop gap for users especially for some I/O intensive queries. For the ML workload, it's an iterative model which I/O bandwidth is the key to the end-2-end performance. In this talk, we will introduce how to accelerate Spark SQL with OAP (https://github.com/Intel-bigdata/OAP) to accelerate SQL performance on Cloud to archive 8X performance gain and RDD cache to improve K-means performance with 2.5X performance gain leveraging Intel Optane DCPM. Also we will have a deep dive how Optane DCPM for these performance gains.
Speakers: Cheng Xu, Piotr Balcer
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
Apache Spark is rapidly becoming the de facto framework for big-data analytics. Spark’s built-in, large-scale Machine Learning Library (MLlib) uses traditional stochastic gradient descent (SGD) to solve standard ML algorithms. However, MlLib currently provides limited coverage of ML algorithms. Further, the convergence of the adopted SGD approach is heavily dictated by issues such as step-size selection, conditioning of the problem and so on, making it difficult for adoption by non-expert end users.
In this session, the speakers introduce a large-scale ML tool built on the Alternating Direction Method of Multipliers (ADMM) on Spark to solve a gamut of ML algorithms. The proposed approach decomposes most ML problems into smaller sub-problems suitable for distributed computation in Spark.
Learn how this toolkit provides a wider range of ML algorithms, better accuracy compared to MLlib, robust convergence criteria and a simple python API suitable for data scientists – making it easy for end users to develop advanced ML algorithms at scale, without worrying about the underlying intricacies of the optimization solver. It’s a useful arsenal for data scientists’ ML ecosystem on Spark.
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
SQL and Machine Learning on Hadoop using HAWQpivotalny
It is true to the extent it is almost considered rhetorical to say
“Many Enterprises have adopted HDFS as the foundational layer for their Data Lakes. HDFS provides the flexibility to store any kind of data and more importantly it’s infinitely scaleable on commodity hardware.”
But the conundrum till date is the solution for a low latency query engine for HDFS.
At Pivotal, we cracked that problem and the answer is HAWQ, which we intend to open source this year. During this event, we will present and demo HAWQ’s Architecture, it’s powerful ANSI SQL features and it’s ability to transcend traditional BI in the form of in-database analytics (or machine learning).
Exploiting machine learning to keep Hadoop clusters healthyDataWorks Summit
Oath has one of the largest footprint of Hadoop, with tens of thousands of jobs run every day. Reliability and consistency is the key here. With 50k+ nodes there will be considerable amount of nodes having disk, memory, network, and slowness issues. If we have any hosts with issues serving/running jobs can increase tight SLA bound jobs’ run times exponentially and frustrate users and support team to debug it.
We are constantly working to develop system that works in tandem with Hadoop to quickly identify and single out pressure points. Here we would like to concentrate on disk, as per our experience disk are the most trouble maker and fragile, specially the high density disks. Because of the huge scale and monetary impact because of slow performing disks, we took challenge to build system to predict and take worn-out disks before they become performance bottleneck and hit jobs’ SLAs. Now task is simple look into symptoms of hard drive failure and take them out? Right? No it’s not straight forward when we are talking about 200+k disk drives. Just collecting such huge data periodically and reliably is one of the small challenges as compared to analyzing such huge datasets and predicting bad disks. Now lets see data regarding each disk we have reallocated sectors count, reported uncorrectable errors, command timeout, and uncorrectable sector count. On top of it hard disk model has its own interpretation of the above-mentioned statistics. DHEERAJ KAPUR, Principal Engineer, Oath and SWETHA BANAGIRI
This is 101 introduction of GPORCA for Open Source developers. GPORCA is open source query optimizer for SQL on MPP (massive parallel processing) database system like Greenplum. You can find the overview of GPORCA, as well as how to debug and contribute back to OSS community.
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
Many organizations today have already migrated Hadoop workloads to cloud infrastructure or they are actively planning to do such a migration. A common question in this scenario is "Which instance types should I use for my Hadoop cluster?" There are nuances to cloud infrastructure that require careful consideration when deciding which instances types to use. This session will show the results of performance comparison of Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instance types commonly used in Hadoop clusters. More importantly, we will discuss the relative cost comparison of these instance types to demonstrate the which AWS instances offer the best price to performance ratio using standard benchmarks. Attendees of this session with leave with a better understanding of the performance of AWS EC2 instance types when used for Hadoop workloads and be able to make more informed decisions about which instance types makes the most sense for their needs.
Speakers
Michael Young, Senior Solutions Engineer, Hortonworks
Marcus Waineo, Principal Solutions Engineer, Hortonworks
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Hadoop distributions can be combination of 25+ open source projects. Enterprise adoptions have various kinds of workloads, environments with vectors like Operating systems, JDK, Database, Security, Ranger Authorization, Encryption, TDE and so on. Ensuring quality for a complex stack and the combinations can be overwhelming.
In this talk we will cover details of technologies involved in automated validation of the stack. Our testing journey begins with ingestion of commits from apache and meets the finish line as we GA the stack distribution. As we speak about this journey, we will walk through how quality is established at various stages like commit, nightly testing, pre prod and readiness. We will go over the challenges we face as we cater to several releases of major, maintenance, hot-fixes all at the same time , how we tackled them with the YARN on YARN infrastructure, using test methodologies to bring efficiencies and how LOG AI comes to the rescue. We will conclude with talk with a case study of end to end workflow test
Speaker
Sunitha Velpula, Director of Engineering Quality, Hortonworks
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
Apache Phoenix is an OLTP and operational analytics for Hadoop. To ensure operations correctness, Phoenix requires that a transaction processor guarantees that all data accesses satisfy the ACID properties. Traditionally, Apache Phoenix has been using the Apache Tephra transaction processing technology. Recently, we introduced into Phoenix the support for Apache Omid—an open source transaction processor for HBase that is used at Yahoo at a large scale.
A single Omid instance sustains hundreds of thousands of transactions per second and provides high availability at zero cost for mainstream processing. Omid, as well as Tephra, are now configurable choices for the Phoenix transaction processing backend, being enabled by the newly introduced Transaction Abstraction Layer (TAL) API. The integration requires introducing many new features and operations to Omid and will become generally available early 2018.
In this talk, we walk through the challenges of the project, focusing on the new use cases introduced by Phoenix and how we address them in Omid.
Speaker
Ohad Shacham, Senior Research Scientist, Yahoo Research, Oath
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013
(http://2013.datadaytexas.com/schedule)
Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib
MADlib: http://madlib.net
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuDatabricks
Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.
We are in the era of Cloud Computing and Big Data. Computing resources are becoming cheaper and data is becoming more valuable, programmers have a unique ability to scale businesses economically and strategically.
Mid-Tier businesses have opportunities to expand their scope of business. The role of the Atlas Software Group is to build capacity within companies by leveraging technical resources to improve operational efficiencies.
Arun C Murthy, Founder and Architect at Hortonworks Inc., talks about the upcoming Next Generation Apache Hadoop MapReduce framework at the Hadoop Summit, 2011.
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryDatabricks
The capacity of data grows rapidly in big data area, more and more memory are consumed either in the computation or holding the intermediate data for analytic jobs. For those memory intensive workloads, end-point users have to scale out the computation cluster or extend memory with storage like HDD or SSD to meet the requirement of computing tasks. For scaling out the cluster, the extra cost from cluster management, operation and maintenance will increase the total cost if the extra CPU resources are not fully utilized. To address the shortcoming above, Intel Optane DC persistent memory (Optane DCPM) breaks the traditional memory/storage hierarchy and scale up the computing server with higher capacity persistent memory. Also it brings higher bandwidth & lower latency than storage like SSD or HDD. And Apache Spark is widely used in the analytics like SQL and Machine Learning on the cloud environment. For cloud environment, low performance of remote data access is typical a stop gap for users especially for some I/O intensive queries. For the ML workload, it's an iterative model which I/O bandwidth is the key to the end-2-end performance. In this talk, we will introduce how to accelerate Spark SQL with OAP (https://github.com/Intel-bigdata/OAP) to accelerate SQL performance on Cloud to archive 8X performance gain and RDD cache to improve K-means performance with 2.5X performance gain leveraging Intel Optane DCPM. Also we will have a deep dive how Optane DCPM for these performance gains.
Speakers: Cheng Xu, Piotr Balcer
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
Apache Spark is rapidly becoming the de facto framework for big-data analytics. Spark’s built-in, large-scale Machine Learning Library (MLlib) uses traditional stochastic gradient descent (SGD) to solve standard ML algorithms. However, MlLib currently provides limited coverage of ML algorithms. Further, the convergence of the adopted SGD approach is heavily dictated by issues such as step-size selection, conditioning of the problem and so on, making it difficult for adoption by non-expert end users.
In this session, the speakers introduce a large-scale ML tool built on the Alternating Direction Method of Multipliers (ADMM) on Spark to solve a gamut of ML algorithms. The proposed approach decomposes most ML problems into smaller sub-problems suitable for distributed computation in Spark.
Learn how this toolkit provides a wider range of ML algorithms, better accuracy compared to MLlib, robust convergence criteria and a simple python API suitable for data scientists – making it easy for end users to develop advanced ML algorithms at scale, without worrying about the underlying intricacies of the optimization solver. It’s a useful arsenal for data scientists’ ML ecosystem on Spark.
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
SQL and Machine Learning on Hadoop using HAWQpivotalny
It is true to the extent it is almost considered rhetorical to say
“Many Enterprises have adopted HDFS as the foundational layer for their Data Lakes. HDFS provides the flexibility to store any kind of data and more importantly it’s infinitely scaleable on commodity hardware.”
But the conundrum till date is the solution for a low latency query engine for HDFS.
At Pivotal, we cracked that problem and the answer is HAWQ, which we intend to open source this year. During this event, we will present and demo HAWQ’s Architecture, it’s powerful ANSI SQL features and it’s ability to transcend traditional BI in the form of in-database analytics (or machine learning).
Exploiting machine learning to keep Hadoop clusters healthyDataWorks Summit
Oath has one of the largest footprint of Hadoop, with tens of thousands of jobs run every day. Reliability and consistency is the key here. With 50k+ nodes there will be considerable amount of nodes having disk, memory, network, and slowness issues. If we have any hosts with issues serving/running jobs can increase tight SLA bound jobs’ run times exponentially and frustrate users and support team to debug it.
We are constantly working to develop system that works in tandem with Hadoop to quickly identify and single out pressure points. Here we would like to concentrate on disk, as per our experience disk are the most trouble maker and fragile, specially the high density disks. Because of the huge scale and monetary impact because of slow performing disks, we took challenge to build system to predict and take worn-out disks before they become performance bottleneck and hit jobs’ SLAs. Now task is simple look into symptoms of hard drive failure and take them out? Right? No it’s not straight forward when we are talking about 200+k disk drives. Just collecting such huge data periodically and reliably is one of the small challenges as compared to analyzing such huge datasets and predicting bad disks. Now lets see data regarding each disk we have reallocated sectors count, reported uncorrectable errors, command timeout, and uncorrectable sector count. On top of it hard disk model has its own interpretation of the above-mentioned statistics. DHEERAJ KAPUR, Principal Engineer, Oath and SWETHA BANAGIRI
This is 101 introduction of GPORCA for Open Source developers. GPORCA is open source query optimizer for SQL on MPP (massive parallel processing) database system like Greenplum. You can find the overview of GPORCA, as well as how to debug and contribute back to OSS community.
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
Many organizations today have already migrated Hadoop workloads to cloud infrastructure or they are actively planning to do such a migration. A common question in this scenario is "Which instance types should I use for my Hadoop cluster?" There are nuances to cloud infrastructure that require careful consideration when deciding which instances types to use. This session will show the results of performance comparison of Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instance types commonly used in Hadoop clusters. More importantly, we will discuss the relative cost comparison of these instance types to demonstrate the which AWS instances offer the best price to performance ratio using standard benchmarks. Attendees of this session with leave with a better understanding of the performance of AWS EC2 instance types when used for Hadoop workloads and be able to make more informed decisions about which instance types makes the most sense for their needs.
Speakers
Michael Young, Senior Solutions Engineer, Hortonworks
Marcus Waineo, Principal Solutions Engineer, Hortonworks
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Hadoop distributions can be combination of 25+ open source projects. Enterprise adoptions have various kinds of workloads, environments with vectors like Operating systems, JDK, Database, Security, Ranger Authorization, Encryption, TDE and so on. Ensuring quality for a complex stack and the combinations can be overwhelming.
In this talk we will cover details of technologies involved in automated validation of the stack. Our testing journey begins with ingestion of commits from apache and meets the finish line as we GA the stack distribution. As we speak about this journey, we will walk through how quality is established at various stages like commit, nightly testing, pre prod and readiness. We will go over the challenges we face as we cater to several releases of major, maintenance, hot-fixes all at the same time , how we tackled them with the YARN on YARN infrastructure, using test methodologies to bring efficiencies and how LOG AI comes to the rescue. We will conclude with talk with a case study of end to end workflow test
Speaker
Sunitha Velpula, Director of Engineering Quality, Hortonworks
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
Apache Phoenix is an OLTP and operational analytics for Hadoop. To ensure operations correctness, Phoenix requires that a transaction processor guarantees that all data accesses satisfy the ACID properties. Traditionally, Apache Phoenix has been using the Apache Tephra transaction processing technology. Recently, we introduced into Phoenix the support for Apache Omid—an open source transaction processor for HBase that is used at Yahoo at a large scale.
A single Omid instance sustains hundreds of thousands of transactions per second and provides high availability at zero cost for mainstream processing. Omid, as well as Tephra, are now configurable choices for the Phoenix transaction processing backend, being enabled by the newly introduced Transaction Abstraction Layer (TAL) API. The integration requires introducing many new features and operations to Omid and will become generally available early 2018.
In this talk, we walk through the challenges of the project, focusing on the new use cases introduced by Phoenix and how we address them in Omid.
Speaker
Ohad Shacham, Senior Research Scientist, Yahoo Research, Oath
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013
(http://2013.datadaytexas.com/schedule)
Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib
MADlib: http://madlib.net
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuDatabricks
Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.
We are in the era of Cloud Computing and Big Data. Computing resources are becoming cheaper and data is becoming more valuable, programmers have a unique ability to scale businesses economically and strategically.
Mid-Tier businesses have opportunities to expand their scope of business. The role of the Atlas Software Group is to build capacity within companies by leveraging technical resources to improve operational efficiencies.
All of material inside is un-licence, kindly use it for educational only but please do not to commercialize it.
Based on 'ilman nafi'an, hopefully this file beneficially for you.
Thank you.
All of material inside is un-licence, kindly use it for educational only but please do not to commercialize it.
Based on 'ilman nafi'an, hopefully this file beneficially for you.
Thank you.
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC
This white paper discusses the EMC Isilon scale-out storage platform, which provides multitenancy through access zones that segregate tenants and their data sets for a scalable, multitenant storage solution for Hadoop and other analytics applications.
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
Performance evaluation and estimation model using regression method for hadoop word count.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...cscpconf
Across the various RDBMS vendors Oracle has more than 60% [6] of market share, with a
complete feature-rich and secure offering. This has made Oracle as default choice as the
database choice for systems of all sizes.
There many open source databases as MySQL, PostgreS, etc. which has now evolved into
complete feature rich offerings and come with zero-licensing fee. This makes it an attractive
proposition to migrate from Oracle to an open-source distribution, to cut-down on licensing
costs.
Migrating an application from a commercial vendor to open source is based on typical
concerns of functionality and performabilty. Though there are various tools and offerings
available to migrate but currently there exists no reference points for the exact effort and impact of migration on the application. Thus we did a study of impact analysis and effort involved in migrating on OLTP application. We successfully migrated the application and did a performance comparison, which is covered in the paper. The paper also covers the tool and methodology used, along with the limitations of MySQL and presents learnings of the entire exercise.
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
Big data is the biggest challenges as we need huge processing power system and good algorithms to make a decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar.
Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?Nicolas Morales
Abstract. Benchmarks are important tools to evaluate systems, as long as their results are transparent, reproducible and they are conducted with due diligence. Today, many SQL-on-Hadoop vendors use the data generators and the queries of existing TPC benchmarks, but fail to adhere to the rules, producing results that are not transparent. As the SQL-on-Hadoop movement continues to gain more traction, it is important to bring some order to this \wild west" of benchmarking. First, new rules and policies should be dened to satisfy the demands of the new generation SQL systems. The new benchmark evaluation schemes should be inexpensive, eective and open enough to embrace the variety of SQL-on-Hadoop systems and their corresponding vendors. Second, adhering to the new standards requires industry commitment and collaboration. In this paper, we discuss the problems we observe in the current practices of benchmarking, and present our proposal for bringing standardization in the SQL-on-Hadoop space.
Introduction to GCP DataFlow PresentationKnoldus Inc.
In this session, we will learn about how Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing.
A whitepaper is about How big data engines are used for exploring and preparing data, building pipelines, and delivering data sets to ML applications.
https://www.qubole.com/resources/white-papers/big-data-engineering-for-machine-learning
Similar to Orca: A Modular Query Optimizer Architecture for Big Data (20)
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
CloudBoost is a cloud-enabling solution from EMC
Facilitates secure, automatic, efficient data transfer to private and public clouds for Long-Term Retention (LTR) of backups. Seamlessly extends existing data protection solutions to elastic, resilient, scale-out cloud storage
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
With EMC XtremIO all-flash array, improve
1) your competitive agility with real-time analytics & development
2) your infrastructure agility with elastic provisioning for performance & capacity
3) your TCO with 50% lower capex and opex and double the storage lifecycle.
• Citrix & EMC XtremIO: Better Together
• XtremIO Design Fundamentals for VDI
• Citrix XenDesktop & XtremIO
-- Image Management & Storage
-- Demonstrations
-- XtremIO XenDesktop Integration
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
Explore findings from the EMC Forum IT Study and learn how cloud computing, social, mobile, and big data megatrends are shaping IT as a business driver globally.
Reference architecture with MIRANTIS OPENSTACK PLATFORM.The changes that are going on in IT with disruptions from technology, business and culture and so IT to solve the issues has to change from moving from traditional models to broker provider model.
Force Cyber Criminals to Shop Elsewhere
Learn the value of having an Identity Management and Governance solution and how retailers today are benefiting by strengthening their defenses and bolstering their Identity Management capabilities.
Container-based technology has experienced a recent revival and is becoming adopted at an explosive rate. For those that are new to the conversation, containers offer a way to virtualize an operating system. This virtualization isolates processes, providing limited visibility and resource utilization to each, such that the processes appear to be running on separate machines. In short, allowing more applications to run on a single machine. Here is a brief timeline of key moments in container history.
This white paper provides an overview of EMC's data protection solutions for the data lake - an active repository to manage varied and complex Big Data workloads
This infographic highlights key stats and messages from the analyst report from J.Gold Associates that addresses the growing economic impact of mobile cybercrime and fraud.
This white paper describes how an intelligence-driven governance, risk management, and compliance (GRC) model can create an efficient, collaborative enterprise GRC strategy across IT, Finance, Operations, and Legal areas.
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
This white paper discusses the results of a CIO UK survey on a“Trust Paradox,” defined as employees and business partners being both the weakest link in an organization’s security as well as trusted agents in achieving the company’s goals.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Orca: A Modular Query Optimizer Architecture for Big Data
1. Orca: A Modular Query Optimizer Architecture for Big Data
Mohamed A. Soliman∗
, Lyublena Antova∗
, Venkatesh Raghavan∗
, Amr El-Helw∗
,
Zhongxian Gu∗
, Entong Shen∗
, George C. Caragea∗
, Carlos Garcia-Alvarado∗
,
Foyzur Rahman∗
, Michalis Petropoulos∗
, Florian Waas‡
,
Sivaramakrishnan Narayanan§
, Konstantinos Krikellas†
, Rhonda Baldwin∗
∗
Pivotal Inc.
Palo Alto, USA
‡
Datometry Inc.
San Francisco, USA
†
Google Inc.
Mountain View, USA
§
Qubole Inc.
Mountain View, USA
ABSTRACT
The performance of analytical query processing in data man-
agement systems depends primarily on the capabilities of
the system’s query optimizer. Increased data volumes and
heightened interest in processing complex analytical queries
have prompted Pivotal to build a new query optimizer.
In this paper we present the architecture of Orca, the new
query optimizer for all Pivotal data management products,
including Pivotal Greenplum Database and Pivotal HAWQ.
Orca is a comprehensive development uniting state-of-the-
art query optimization technology with own original research
resulting in a modular and portable optimizer architecture.
In addition to describing the overall architecture, we high-
light several unique features and present performance com-
parisons against other systems.
Categories and Subject Descriptors
H.2.4 [Database Management]: Systems—Query pro-
cessing; Distributed databases
Keywords
Query Optimization, Cost Model, MPP, Parallel Processing
1. INTRODUCTION
Big Data has brought about a renewed interest in query
optimization as a new breed of data management systems
has pushed the envelope in terms of unprecedented scal-
ability, availability, and processing capabilities (cf. e.g.,
[9, 18, 20, 21]), which makes large datasets of hundreds of
terabytes or even petabytes readily accessible for analysis
through SQL or SQL-like interfaces. Differences between
good and mediocre optimizers have always been known to
be substantial [15]. However, the increased amount of data
these systems have to process magnifies optimization mis-
takes and stresses the importance of query optimization
more than ever.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00.
http://dx.doi.org/10.1145/2588555.2595637.
Despite a plethora of research in this area, most exist-
ing query optimizers in both commercial and open source
projects are still primarily based on technology dating back
to the early days of commercial database development [22],
and are frequently prone to produce suboptimal results.
Realizing this significant gap between research and prac-
tical implementations, we have set out to devise an architec-
ture that meets current requirements, yet promises enough
headroom for future developments.
In this paper, we describe Orca, the result of our recent re-
search and development efforts at Greenplum/Pivotal. Orca
is a state-of-the-art query optimizer specifically designed for
demanding analytics workloads. It is distinguished from
other optimizers in several important ways:
Modularity. Using a highly extensible abstraction of meta-
data and system description, Orca is no longer confined
to a specific host system like traditional optimizers. In-
stead it can be ported to other data management sys-
tems quickly through plug-ins supported by its Meta-
data Provider SDK.
Extensibility. By representing all elements of a query and
its optimization as first-class citizens of equal foot-
ing, Orca avoids the trap of multi-phase optimization
where certain optimizations are dealt with as an af-
terthought. Multi-phase optimizers are notoriously
difficult to extend as new optimizations or query con-
structs often do not match the previously set phase
boundaries.
Multi-core ready. Orca deploys a highly efficient multi-
core aware scheduler that distributes individual fine-
grained optimization subtasks across multiple cores for
speed-up of the optimization process.
Verifiability. Orca has special provisions for ascertaining
correctness and performance on the level of built-in
mechanisms. Besides improving engineering practices,
these tools enable rapid development with high confi-
dence and lead to reduced turnaround time for both
new features as well as bug fixes.
Performance. Orca is a substantial improvement over our
previous system and in many cases offers query speed-
up of 10x up to 1000x.
We describe the architecture of Orca and highlight some
of the advanced features enabled by its design. We provide
SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA.
2. 12/5/13 gp-dia-3-0.png (650×502)
www.gopivotal.com/assets/images/gp-dia-3-0.png 1/1
Figure 1: High level GPDB architecture
a blueprint of various components and detail the engineer-
ing practices we have pioneered and deployed to realize this
project. Lastly, we give performance results based on the
TPC-DS benchmark comparing Orca to other systems. In
particular, we focus on query processing systems contributed
to the open source space.
The remainder of this paper is organized as follows. We
give preliminaries on the computing architecture in Sec-
tion 2. In Section 3, we present the architecture of Orca
and describe its components. Section 4 presents the query
optimization workflow. Section 5 describes how Orca ex-
changes metadata with the backend database system. We
describe in Section 6 the tools we built to maintain a veri-
fiable query optimizer. Section 7 presents our experimental
study, followed by a discussion of related work in Section 8.
We summarize this paper with final remarks in Section 9.
2. PRELIMINARIES
We give preliminaries on massively parallel processing
databases (Section 2.1), and Hadoop query engines (Sec-
tion 2.2).
2.1 Massively Parallel Processing
Pivotal’s Greenplum Database (GPDB) [20] is a mas-
sively parallel processing (MPP) analytics database. GPDB
adopts a shared-nothing computing architecture with two
or more cooperating processors. Each processor has its own
memory, operating system and disks. GPDB leverages this
high-performance system architecture to distribute the load
of petabyte data warehouses, and use system resources in
parallel to process a given query.
Figure 1 shows a high level architecture of GPDB. Stor-
age and processing of large amounts of data are handled
by distributing the load across several servers or hosts to
create an array of individual databases, all working to-
gether to present a single database image. The master is
the entry point to GPDB, where clients connect and sub-
mit SQL statements. The master coordinates work with
other database instances, called segments, to handle data
processing and storage. When a query is submitted to the
master, it is optimized and broken into smaller components
dispatched to segments to work together on delivering the
final results. The interconnect is the networking layer re-
sponsible for inter-process communication between the seg-
ments. The interconnect uses a standard Gigabit Ethernet
switching fabric.
During query execution, data can be distributed to seg-
ments in multiple ways including hashed distribution, where
tuples are distributed to segments based on some hash func-
tion, replicated distribution, where a full copy of a table is
stored at each segment and singleton distribution, where the
whole distributed table is gathered from multiple segments
to a single host (usually the master).
2.2 SQL on Hadoop
Processing analytics queries on Hadoop is becoming in-
creasingly popular. Initially, the queries were expressed as
MapReduce jobs and the Hadoop’s appeal was attributed
to its scalability and fault-tolerance. Coding, manually op-
timizing and maintaining complex queries in MapReduce
though is hard, thus SQL-like declarative languages, such
as Hive [28], were developed on top of Hadoop. HiveQL
queries are compiled into MapReduce jobs and executed by
Hadoop. HiveQL accelerated the coding of complex queries
but also made apparent that an optimizer is needed in the
Hadoop ecosystem, since the compiled MapReduce jobs ex-
hibited poor performance.
Pivotal responded to the challenge by introducing
HAWQ [21], a massively parallel SQL-compliant engine on
top of HDFS. HAWQ employes Orca in its core to devise
efficient query plans minimizing the cost of accessing data
in Hadoop clusters. The architecture of HAWQ combines
an innovative state-of-the-art cost-based optimizer with the
scalability and fault-tolerance of Hadoop to enable interac-
tive processing of data at petabyte scale.
Recently, a number of other efforts, including Cloudera’s
Impala [17] and Facebook’s Presto [7], introduced new op-
timizers to enable SQL processing on Hadoop. Currently,
these efforts support only a subset of the SQL standard fea-
tures and their optimizations are restricted to rule-based.
In comparison, HAWQ has a full-fledged standard compli-
ant SQL interface and a cost-based optimizer, both of which
are unprecedented features in Hadoop query engines. We il-
lustrate in our experimental study in Section 7 the key role
that Orca plays in differentiating HAWQ from other Hadoop
SQL engines on both functional and performance sides.
3. ORCA ARCHITECTURE
Orca is the new query optimizer for Pivotal data man-
agement products, including GPDB and HAWQ. Orca is a
modern top-down query optimizer based on the Cascades op-
timization framework [13]. While many Cascades optimizers
are tightly-coupled with their host systems, a unique feature
of Orca is its ability to run outside the database system as a
stand-alone optimizer. This ability is crucial to supporting
products with different computing architectures (e.g., MPP
and Hadoop) using one optimizer. It also allows leverag-
ing the extensive legacy of relational optimization in new
query processing paradigms like Hadoop [7,10,16,17]. Fur-
thermore, running the optimizer as a stand-alone product
enables elaborate testing without going through the mono-
lithic structure of a database system.
DXL. Decoupling the optimizer from the database system
requires building a communication mechanism to process
queries. Orca includes a framework for exchanging informa-
tion between the optimizer and the database system called
Data eXchange Language (DXL). The framework uses an
XML-based language to encode the necessary information
3. Orca%
Database%System%
%
%
%
!
!
!
Parser! Catalog! Executor!
DXL!Query! DXL!MD! DXL!Plan!
Query2DXL! DXL2Plan!
Query! Results!
MD!Provider!
Figure 2: Interaction of Orca with database system
Search'
Property'Enforcement'
Memory'Manager'
Concurrency'
Control'
GPOS%
OS%
Orca%
Operators'
Transforma9ons'
Card.'Es9ma9on'
Cost'Model'
Op*mizer%Tools%
Job'Scheduler'
File'I/O'
Memo%
DXL'Query' DXL'Plan'
MD'Cache'
Excep9on'
Handling'
Figure 3: Orca architecture
for communication, such as input queries, output plans and
metadata. Overlaid on DXL is a simple communication pro-
tocol to send the initial query structure and retrieve the
optimized plan. A major benefit of DXL is packaging Orca
as a stand-alone product.
Figure 2 shows the interaction between Orca and an ex-
ternal database system. The input to Orca is a DXL query.
The output of Orca is a DXL plan. During optimization,
the database system can be queried for metadata (e.g., ta-
ble definitions). Orca abstracts metadata access details by
allowing database system to register a metadata provider
(MD Provider) that is responsible for serializing metadata
into DXL before being sent to Orca. Metadata can also be
consumed from regular files containing metadata objects se-
rialized in DXL format.
The database system needs to include translators that
consume/emit data in DXL format. Query2DXL transla-
tor converts a query parse tree into a DXL query, while
DXL2Plan translator converts a DXL plan into an executable
plan. The implementation of such translators is done com-
pletely outside Orca, which allows multiple systems to use
Orca by providing the appropriate translators.
The architecture of Orca is highly extensible; all compo-
nents can be replaced individually and configured separately.
Figure 3 shows the different components of Orca. We briefly
describe these components as follows.
Memo. The space of plan alternatives generated by the
optimizer is encoded in a compact in-memory data struc-
ture called the Memo [13]. The Memo structure consists of
a set of containers called groups, where each group contains
logically equivalent expressions. Memo groups capture the
different sub-goals of a query (e.g., a filter on a table, or a
join of two tables). Group members, called group expres-
sions, achieve the group goal in different logical ways (e.g.,
different join orders). Each group expression is an operator
that has other groups as its children. This recursive struc-
ture of the Memo allows compact encoding of a huge space
of possible plans as we illustrate in Section 4.1.
Search and Job Scheduler. Orca uses a search mecha-
nism to navigate through the space of possible plan alter-
natives and identify the plan with the least estimated cost.
The search mechanism is enabled by a specialized Job Sched-
uler that creates dependent or parallel work units to perform
query optimization in three main steps: exploration, where
equivalent logical expressions are generated, implementation
where physical plans are generated, and optimization, where
required physical properties (e.g., sort order) are enforced
and plan alternatives are costed. We discuss the details of
optimization jobs scheduling in Section 4.2.
Transformations. [13] Plan alternatives are generated
by applying transformation rules that can produce either
equivalent logical expressions (e.g., InnerJoin(A,B) → In-
nerJoin(B,A)), or physical implementations of existing ex-
pressions (e.g., Join(A,B) → HashJoin(A,B)). The results of
applying transformation rules are copied-in to the Memo,
which may result in creating new groups and/or adding new
group expressions to existing groups. Each transformation
rule is a self-contained component that can be explicitly ac-
tivated/deactivated in Orca configurations.
Property Enforcement. Orca includes an extensible
framework for describing query requirements and plan char-
acteristics based on formal property specifications. Prop-
erties have different types including logical properties (e.g.,
output columns), physical properties (e.g., sort order and
data distribution), and scalar properties (e.g., columns used
in join conditions). During query optimization, each oper-
ator may request specific properties from its children. An
optimized child plan may either satisfy the required proper-
ties on its own (e.g., an IndexScan plan delivers sorted data),
or an enforcer (e.g., a Sort operator) needs to be plugged in
the plan to deliver the required property. The framework
allows each operator to control enforcers placement based
on child plans’ properties and operator’s local behavior. We
describe this framework in more detail in Section 4.1.
Metadata Cache. Since metadata (e.g., table definitions)
changes infrequently, shipping it with every query incurs an
overhead. Orca caches metadata on the optimizer side and
only retrieves pieces of it from the catalog if something is
unavailable in the cache, or has changed since the last time
it was loaded in the cache. Metadata cache also abstracts
the database system details from the optimizer, which is
particularly useful during testing and debugging.
GPOS. In order to interact with operating systems with
possibly different APIs, Orca uses an OS abstraction layer
called GPOS. The GPOS layer provides Orca with an exten-
sive infrastructure including a memory manager, primitives
for concurrency control, exception handling, file I/O and
synchronized data structures.
4. QUERY OPTIMIZATION
We describe Orca’s optimization workflow in Section 4.1.
We then show how the optimization process can be con-
ducted in parallel in Section 4.2.
4. <?xml version="1.0" encoding="UTF -8"?>
<dxl:DXLMessage xmlns:dxl="http: // greenplum.com/dxl/v1">
<dxl:Query >
<dxl:OutputColumns >
<dxl:Ident ColId="0" Name="a" Mdid="0.23.1.0"/>
</ dxl:OutputColumns >
<dxl:SortingColumnList >
<dxl:SortingColumn ColId="0" OpMdid="0.97.1.0">
</ dxl:SortingColumnList >
<dxl:Distribution Type="Singleton" />
<dxl:LogicalJoin JoinType="Inner">
<dxl:LogicalGet >
<dxl:TableDescriptor Mdid=" 0.1639448.1.1 " Name="T1">
<dxl:Columns >
<dxl:Ident ColId="0" Name="a" Mdid="0.23.1.0"/>
<dxl:Ident ColId="1" Name="b" Mdid="0.23.1.0"/>
</dxl:Columns >
</ dxl:TableDescriptor >
</ dxl:LogicalGet >
<dxl:LogicalGet >
<dxl:TableDescriptor Mdid=" 0.2868145.1.1 " Name="T2">
<dxl:Columns >
<dxl:Ident ColId="2" Name="a" Mdid="0.23.1.0"/>
<dxl:Ident ColId="3" Name="b" Mdid="0.23.1.0"/>
</dxl:Columns >
</ dxl:TableDescriptor >
</ dxl:LogicalGet >
<dxl:Comparison Operator="=" Mdid="0.96.1.0">
<dxl:Ident ColId="0" Name="a" Mdid="0.23.1.0"/>
<dxl:Ident ColId="3" Name="b" Mdid="0.23.1.0"/>
</ dxl:Comparison >
</ dxl:LogicalJoin >
</dxl:Query >
</ dxl:DXLMessage >
Listing 1: DXL query message
4.1 Optimization Workflow
We illustrate query optimization workflow using the fol-
lowing running example:
SELECT T1.a FROM T1, T2
WHERE T1.a = T2.b
ORDER BY T1.a;
where the distribution of T1 is Hashed(T1.a) and the distri-
bution of T2 is Hashed(T2.a) (cf. Section 2.1).
Listing 1 shows the representation of the previous query
in DXL, where we give the required output columns, sort-
ing columns, data distribution and logical query. Metadata
(e.g., tables and operators definitions) are decorated with
metadata ids (Mdid’s) to allow requesting further informa-
tion during optimization. An Mdid is a unique identifier
composed of a database system identifier, an object identi-
fier and a version number. For example, ‘0.96.1.0’ refers to
GPDB’s integer equality operator with version ‘1.0’. Meta-
data versions are used to invalidate cached metadata objects
that have gone through modifications across queries. We
discuss metadata exchange in more detail in Section 5.
The DXL query message is shipped to Orca, where it is
parsed and transformed to an in-memory logical expression
tree that is copied-in to the Memo. Figure 4 shows the ini-
tial contents of the Memo. The logical expression creates
three groups for the two tables and the InnerJoin operation.
We omit the join condition for brevity. Group 0 is called
the root group since it corresponds to the root of the logical
expression. The dependencies between operators in the log-
ical expression are captured as references between groups.
For example, InnerJoin[1,2] refers to Group 1 and Group 2
as children. Optimization takes place as described in the
following steps.
Inner%Join%
(T1.a=T2.b)%%
Get(T1)% Get(T2)%
0:%Inner%Join%[1,2]%
0:%Get(T1)%[]%
GROUP&0&
GROUP&1&
0:%Get(T2)%[]%
GROUP&2&
Ini$al'Memo'
Logical'Expression'
Figure 4: Copying-in initial logical expression
(1) Exploration. Transformation rules that generate log-
ically equivalent expressions are triggered. For example, a
Join Commutativity rule is triggered to generate InnerJoin[2,1]
out of InnerJoin[1,2]. Exploration results in adding new
group expressions to existing groups and possibly creating
new groups. The Memo structure has a built-in duplicate de-
tection mechanism, based on expression topology, to detect
and eliminate any duplicate expressions created by different
transformations.
(2) Statistics Derivation. At the end of exploration,
the Memo maintains the complete logical space of the given
query. Orca’s statistics derivation mechanism is then trig-
gered to compute statistics for the Memo groups. A statis-
tics object in Orca is mainly a collection of column his-
tograms used to derive estimates for cardinality and data
skew. Derivation of statistics takes place on the compact
Memo structure to avoid expanding the search space.
In order to derive statistics for a target group, Orca picks
the group expression with the highest promise of deliver-
ing reliable statistics. Statistics promise computation is
expression-specific. For example, an InnerJoin expression
with a small number of join conditions is more promising
than another equivalent InnerJoin expression with a larger
number of join conditions (this situation could arise when
generating multiple join orders). The rationale is that the
larger the number of join conditions, the higher the chance
that estimation errors are propagated and amplified. Com-
puting a confidence score for cardinality estimation is chal-
lenging due to the need to aggregate confidence scores across
all nodes of a given expression. We are currently exploring
several methods to compute confidence scores in the com-
pact Memo structure.
After picking the most promising group expression in the
target group, Orca recursively triggers statistics derivation
on the child groups of the picked group expression. Finally,
the target group’s statistics object is constructed by com-
bining the statistics objects of child groups.
Figure 5 illustrates statistics derivation mechanism for the
running example. First, a top-down pass is performed where
a parent group expression requests statistics from its child
groups. For example, InnerJoin(T1,T2) on (a=b) requests
histograms on T1.a and T2.b. The requested histograms
are loaded on demand from the catalog through the regis-
tered MD Provider, parsed into DXL and stored in the MD
Cache to service future requests. Next, a bottom-up pass is
performed to combine child statistics objects into a parent
statistics object. This results in (possibly modified) his-
tograms on columns T1.a and T2.b, since the join condition
could impact columns’ histograms.
5. Inner%Join(a=b)%[1,2]%
GROUP%1%
{%%}%
{T1.a}%
GROUP%2%
{T2.b}%
Inner%Join(a=b)%[1,2]%
GROUP%1% GROUP%2%
Hist(T1.a)% Hist(T2.b)%
Hist(T1.a)%Hist(T2.b)%
0:%Get(T1)%[]%
Hist(T1.a)%
GROUP&1&
0:%Get(T1)%[]%
GROUP&2&
Hist(T2.b)%
GROUP&0&
Hist(T1.a)% Hist(T2.b)%
0:%Inner%Join%[1,2]%
1:%Inner%Join%[2,1]%
(a)$Top(down$sta.s.cs$requests$
(d)$Combined$sta.s.cs$are$
a8ached$to$parent$group$$
(b)$Computed$sta.s.cs$are$
a8ached$to$child$groups$
(c)$Bo8om(up$sta.s.cs$deriva.on$
Figure 5: Statistics derivation mechanism
Constructed statistics objects are attached to individual
groups where they can be incrementally updated (e.g., by
adding new histograms) during optimization. This is crucial
to keep the cost of statistics derivation manageable.
(3) Implementation. Transformation rules that create
physical implementations of logical expressions are trig-
gered. For example, Get2Scan rule is triggered to gener-
ate physical table Scan out of logical Get. Similarly, Inner-
Join2HashJoin and InnerJoin2NLJoin rules are triggered to
generate Hash and Nested Loops join implementations.
(4) Optimization. In this step, properties are enforced and
plan alternatives are costed. Optimization starts by submit-
ting an initial optimization request to the Memo’s root group
specifying query requirements such as result distribution and
sort order. Submitting a request r to a group g corresponds
to requesting the least cost plan satisfying r with a root
physical operator in g.
For each incoming request, each physical group expres-
sion passes corresponding requests to child groups depending
on the incoming requirements and operator’s local require-
ments. During optimization, many identical requests may
be submitted to the same group. Orca caches computed
requests into a group hash table. An incoming request is
computed only if it does not already exist in group hash
table. Additionally, each physical group expression main-
tains a local hash table mapping incoming requests to the
corresponding child requests. Local hash tables provide the
linkage structure used when extracting a physical plan from
the Memo, as we show later in this section.
Figure 6 shows optimization requests in the Memo for
the running example. The initial optimization request is
req. #1: {Singleton, <T1.a>}, which specifies that query
results are required to be gathered to the master based on
the order given by T1.a1
. We also show group hash ta-
bles where each request is associated with the best group
expression (GExpr) that satisfies it at the least estimated
cost. The black boxes indicate enforcer operators that are
plugged in the Memo to deliver sort order and data distri-
1
Required properties also include output columns, rewindability,
common table expressions and data partitioning. We omit these
properties due to space constraints.
bution. Gather operator gathers tuples from all segments to
the master. GatherMerge operator gathers sorted data from
all segments to the master, while keeping the sort order. Re-
distribute operator distributes tuples across segments based
on the hash value of given argument.
Figure 7 shows the optimization of req. #1 by Inner-
HashJoin[1,2]. For this request, one of the alternative plans
is aligning child distributions based on join condition, so
that tuples to be joined are co-located2
. This is achieved
by requesting Hashed(T1.a) distribution from group 1 and
Hashed(T2.b) distribution from group 2. Both groups are
requested to deliver Any sort order. After child best plans
are found, InnerHashJoin combines child properties to deter-
mine the delivered distribution and sort order. Note that
the best plan for group 2 needs to hash-distribute T2 on
T2.b, since T2 is originally hash-distributed on T2.a, while
the best plan for group 1 is a simple Scan, since T1 is already
hash-distributed on T1.a.
When it is determined that delivered properties do not
satisfy the initial requirements, unsatisfied properties have
to be enforced. Property enforcement in Orca in a flexible
framework that allows each operator to define the behav-
ior of enforcing required properties based on the properties
delivered by child plans and operator local behavior. For ex-
ample, an order-preserving NL Join operator may not need
to enforce a sort order on top of the join if the order is
already delivered by outer child.
Enforcers are added to the group containing the group
expression being optimized. Figure 7 shows two possible
plans that satisfy req. #1 through property enforcement.
The left plan sorts join results on segments, and then gather-
merges sorted results at the master. The right plan gathers
join results from segments to the master, and then sorts
them. These different alternatives are encoded in the Memo
and it is up to the cost model to differentiate their costs.
Finally, the best plan is extracted from the Memo based on
the linkage structure given by optimization requests. Fig-
ure 6 illustrates plan extraction for the running example.
We show the local hash tables of relevant group expressions.
Each local hash table maps incoming optimization request
to corresponding child optimization requests.
We first look-up the best group expression of req. #1 in
the root group, which leads to GatherMerge operator. The
corresponding child request in the local hash table of Gath-
erMerge is req #3. The best group expression for req #3
is Sort. Therefore, we link GatherMerge to Sort. The cor-
responding child request in the local hash table of Sort is
req #4. The best group expression for req #4 is Inner-
HashJoin[1,2]. We thus link Sort to InnerHashJoin. The same
procedure is followed to complete plan extraction leading to
the final plan shown in Figure 6.
The extracted plan is serialized in DXL format
and shipped to the database system for execution.
DXL2Plan translator at the database system translates
DXL plan to an executable plan based on the underling query
execution framework.
Multi-Stage Optimization. Our ongoing work in Orca
involves implementing multi-stage optimization. An opti-
2
There can be many other alternatives (e.g., request children to
be gathered to the master and perform the join there). Orca
allows extending each operator with any number of possible
optimization alternatives and cleanly isolates these alternatives
through property enforcement framework.
6. 3:#Inner#NLJoin#[1,2]#2:#Inner#NLJoin#[2,1]# 4:#Inner#HashJoin#[1,2]# 5:#Inner#HashJoin#[2,1]#
6:#Sort(T1.a)#[0]# 7:#Gather[0]#
2:#Sort(T1.a)#[1]# 3:#Replicate[1]#
## Opt.#Request# Best#GExpr#
1# Singleton,##<T1.a># 8"
2# Singleton,#Any# 7"
3# Any,##<T1.a># 6"
4# Any,##Any# 4"
8:#GatherMerge(T1.a)#[0]#
GROUP#0#
## Opt.#Request# Best#GExpr#
5# Any,#Any# 1"
6# Replicated,#Any# 3"
7# Hashed(T1.a),#Any# 1"
8# Any,#<T1.a># 2"
1:#Scan(T1)[]#
GROUP#1#
2:#Replicate[2]#
## Opt.#Request# Best#GExpr#
9# Any,#Any# 1"
10# Hashed(T2.b),#Any# 3"
11# Replicated,#Any# 2"
1:#Scan(T2)[]#
GROUP#2#
3:#Redistribute(T2.b)#[2]#
#4# #7,##10#
…# …#
#1# #3#
…# …#
…# …#
GatherMerge(T1.a)#
Sort(T1.a)#
Inner#Hash#Join#
Scan(T1)#
Scan(T2)#
Redistribute(T2.b)#
#3# #4#
…# …#
…# …#…# …#…# …#
#7#
…# …#
…# …# …# …#
#9#
…# …#
…# …# #10# #9#
…# …#
Groups#Hash#Tables# Memo# Extracted#final#plan#
Figure 6: Processing optimization requests in the Memo
Inner%Hash%Join%[1,2]%
GROUP%1%
{Singleton,%<T1.a>}%
{Hashed(T1.a),%Any}%
GROUP%2%
{Hashed(T2.b),%Any}%
Inner%Hash%Join%
{Singleton,%<T1.a>}%
Scan(T1)% Redistribute(T2.b)/
Scan(T2)%
Inner%Hash%Join%
Scan(T1)% Redistribute(T2.b)/
Scan(T2)%
Sort(T1.a)/
GatherMerge(T1.a)/
Inner%Hash%Join%
Scan(T1)% Redistribute(T2.b)/
Scan(T2)%
Gather/
Sort(T1.a)/
(a)$Passing$requests$to$child$groups$ (b)$Combining$child$groups$best$plans$
(c)$Enforcing$missing$proper:es$to$sa:sfy${Singleton,$<T1.a>}$request$
Figure 7: Generating InnerHashJoin plan alternatives
mization stage in Orca is defined as a complete optimiza-
tion workflow using a subset of transformation rules and
(optional) time-out and cost threshold. A stage terminates
when any of the following conditions is met: (1) a plan
with cost below cost threshold is found, (2) time-out oc-
curs, or (3) the subset of transformation rules is exhausted.
The specification of optimization stages can be given by the
user through Orca’s configuration. This technique allows
resource-constrained optimization where, for example, the
most expensive transformation rules are configured to run in
later stages to avoid increasing the optimization time. This
technique is also a foundation for obtaining a query plan
as early as possible to cut-down search space for complex
queries.
Query Execution. A copy of the final plan is dispatched
to each segment. During distributed query execution, a dis-
tribution enforcer on each segment acts as both sender and
receiver of data. For example, a Redistribute(T2.b) instance
running on segment S sends tuples on S to other segments
based on the hash value of T2.b, and also receives tuples
from other Redistribute(T2.b) instances on other segments.
4.2 Parallel Query Optimization
Query optimization is probably the most CPU-intensive
process in a database system. Effective usage of CPUs trans-
lates to better query plans and hence better system perfor-
mance. Parallelizing query optimizer is crucial to benefit
from advanced CPU designs that exploit an increasing num-
bers of cores.
Orca is a multi-core enabled optimizer. Optimization pro-
cess is broken to small work units called optimization jobs.
Orca currently has seven different types of optimization jobs:
• Exp(g): Generate logically equivalent expressions of all
group expressions in group g.
• Exp(gexpr): Generate logically equivalent expressions
of a group expression gexpr.
• Imp(g): Generate implementations of all group expres-
sions in group g.
• Imp(gexpr): Generate implementation alternatives of a
group expression gexpr.
• Opt(g, req): Return the plan with the least estimated
cost that is rooted by an operator in group g and sat-
isfies optimization request req.
• Opt(gexpr, req): Return the plan with the least esti-
mated cost that is rooted by gexpr and satisfies opti-
mization request req.
• Xform(gexpr, t) Transform group expression gexpr us-
ing rule t.
For a given query, hundreds or even thousands of job in-
stances of each type may be created. This introduces chal-
lenges for handling job dependencies. For example, a group
expression cannot be optimized until its child groups are
also optimized. Figure 8 shows a partial job graph, where
optimization of group g0 under optimization request req0
triggers a deep tree of dependent jobs. Dependencies are
encoded as child-parent links; a parent job cannot finish be-
fore its child jobs finish. While child jobs are progressing,
the parent job needs to be suspended. This allows child jobs
to pick up available threads and run in parallel, if they do
not depend on other jobs. When all child jobs complete, the
suspended parent job is notified to resume processing.
Orca includes a specialized job scheduler designed from
scratch to maximize the fan-out of job dependency graph
and provide the required infrastructure for parallel query
7. Opt(g0,(req0)(
Opt(g1,(req1)(
Imp(g1)(
Xform(g1.gexpr,(t)(
Opt(g0.gexpr,(req0)(
Exp(g1.gexpr)(
Xform(g1.gexpr,(t’)(
…(
…(
…(
…(
…( …(
…(
…(
…(
…(
Exp(g1)( Imp(g1.gexpr)( …(
explore'children'
of'g1.gexpr'
op1mize'group'
expressions'in'g0'''
op1mize'children'
of'g0.gexpr'
implement'group'
expressions'in'g1'
…(
…(
explore'group'
expressions'in'g1''
explora1on'rules'
of'g1.gexpr'
implementa1on'
rules'of'g1.gexpr'
Imp(g0)(
Exp(g0)(
…(
Opt(g1.gexpr,(req1)(
…(
op1mize'group'
expressions'in'g1''''
…(
…(
…(
Figure 8: Optimization jobs dependency graph
optimization. The scheduler provides APIs to define opti-
mization jobs as re-entrant procedures that can be picked
up by available processing threads. It also maintains job de-
pendency graph to identify opportunities of parallelism (e.g.,
running transformations in different groups), and notify sus-
pended jobs when the jobs they depend on have terminated.
During parallel query optimization, multiple concurrent
requests to modify a Memo group might be triggered by
different optimization requests. In order to minimize syn-
chronization overhead among jobs with the same goal (e.g.,
exploring the same group), jobs should not know about the
existence of each other. When an optimization job with
some goal is under processing, all other incoming jobs with
the same goal are forced to wait until getting notified about
the completion of the running job. At this point, the sus-
pended jobs can pick up the results of the completed job.
This functionality is enabled by attaching a job queue to
each group, such that incoming jobs are queued as long as
there exists an active job with the same goal.
5. METADATA EXCHANGE
Orca is designed to work outside the database system.
One major point of interaction between optimizer and
database system is metadata exchange. For example, the
optimizer may need to know whether indices are defined on
a given table to devise an efficient query plan. The access to
metadata is facilitated by a collection of Metadata Providers
that are system-specific plug-ins to retrieve metadata from
the database system.
Figure 9 shows how Orca exchanges metadata with differ-
ent backend systems. During query optimization, all meta-
data objects accessed by Orca are pinned in an in-memory
cache, and are unpinned when optimization completes or an
error is thrown. All accesses to metadata objects are accom-
plished via MD Accessor, which keeps track of objects being
accessed in the optimization session, and makes sure they
are released when they are no longer needed. MD Accessor
is also responsible for transparently fetching metadata data
from external MD Provider if the requested metadata object
is not already in the cache. Different MD Accessors serving
different optimization sessions may have different external
MD providers for fetching metadata.
In addition to system-specific providers, Orca implements
a file-based MD Provider to load metadata from a DXL file,
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Orca!
!
!
!
!
!
Op'miza'on!Engine!
!
File!
based!!
Provider!
!
GPDB! HAWQ! PG!
Vanilla!
Postgres!
DB!
Oracle,!MS!SQL!
DB2,!MySQL,!etc.!
NoSQL!
Hadoop,!
MongoDB!
MD!Provider!PlugIins!
MD!Cache!
MD!Accessor!
MD!Accessor!
MD!Accessor!
Figure 9: Metadata exchange framework
!
!
!
!
!
!
MD!Cache!
Op+miza+on!
Engine!
DXL!Query! DXL!MD! DXL!Plan!
DXLStacks!
Dumpfile!
!
!
!
!
Orca!
Figure 10: Replay of AMPERe dump
eliminating the need to access a live backend system. Orca
includes an automated tool for harvesting metadata that
optimizer needs into a minimal DXL file. We show in Sec-
tion 6.1 how this tool is used to replay optimization of cus-
tomer queries while the backend database system is offline.
6. VERIFIABILITY
Testing a query optimizer is a task as challenging as build-
ing one. Orca is built with testing in mind since the early
development phase. There is a built-in testing scheme that
makes it difficult for developers to introduce regressions as
part of adding new features, and makes it simple for test en-
gineers to add test cases to be verified with every build. In
addition, we leverage several tools and testing frameworks
we built to assure the quality and verifiability of Orca, in-
cluding a cardinality estimation testing framework, a num-
ber of benchmark tests at various scales, a data generator
that can generate data by reversing database statistics [24],
and two unique testing tools we discuss next.
The first tool, discussed in Section 6.1, is automatic cap-
turing and replaying of optimizer’s anomalies. The second
tool, discussed in Section 6.2, implements an automated
method to measure the accuracy of optimizer’s cost model.
6.1 Minimal Repros
AMPERe [3] is a tool for Automatic capture of Minimal
Portable and Executable Repros. The motivation for build-
ing AMPERe was to be able to reproduce and debug cus-
8. <?xml version="1.0" encoding="UTF -8"?>
<dxl:DXLMessage xmlns:dxl="http: // greenplum.com/dxl/v1">
<dxl:Thread Id="0">
<dxl:Stacktrace >
1 0x000e8106df gpos::CException::Raise
2 0x000137d853 COptTasks::PvOptimizeTask
3 0x000e81cb1c gpos::CTask::Execute
4 0x000e8180f4 gpos::CWorker::Execute
5 0x000e81e811 gpos::CAutoTaskProxy::Execute
</ dxl:Stacktrace >
<dxl:TraceFlags Value=" gp_optimizer_hashjoin "/>
<dxl:Metadata SystemIds="0. GPDB">
<dxl:Type Mdid="0.9.1.0" Name="int4"
IsRedistributable ="true" Length="4" />
<dxl:RelStats Mdid="2.688.1.1" Name="r" Rows="10"/>
<dxl:Relation Mdid="0.688.1.1" Name="r"
DistributionPolicy ="Hash"
DistributionColumns ="0">
<dxl:Columns >
<dxl:Column Name="a" Attno="1" Mdid="0.9.1.0"/>
</dxl:Columns >
</ dxl:Relation >
</ dxl:Metadata >
<dxl:Query >
<dxl:OutputColumns >
<dxl:Ident ColId="1" Name="a" Mdid="0.9.1.0"/>
</ dxl:OutputColumns >
<dxl:LogicalGet >
<dxl:TableDescriptor Mdid="0.688.1.1" Name="r">
<dxl:Columns >
<dxl:Column ColId="1" Name="a" Mdid="0.9.1.0"/>
</dxl:Columns >
</ dxl:TableDescriptor >
</ dxl:LogicalGet >
</dxl:Query >
</dxl:Thread >
</ dxl:DXLMessage >
Listing 2: Simplified AMPERe dump
tomer issues in the optimizer without having access to the
customer production system.
An AMPERe dump is automatically triggered when an un-
expected error is encountered, but can also be produced on
demand to investigate suboptimal query plans. The dump
captures the minimal amount of data needed to reproduce
a problem, including the input query, optimizer configura-
tions and metadata, serialized in DXL (cf. Section 3). If the
dump is generated due to an exception, it also includes the
exception’s stack trace.
Listing 2 shows an example of a simplified AMPERe dump.
The dump includes only the necessary data to reproduce the
problem. For example, the dump captures the state of MD
Cache which includes only the metadata acquired during the
course of query optimization. AMPERe is also built to be
extensible. Any component in Orca can register itself with
the AMPERe serializer to generate additional information in
the output dump.
AMPERe allows replaying a dump outside the system
where it was generated. Any Orca instance can load the
dump to retrieve the input query, metadata and configura-
tion parameters in order to invoke an optimization session
identical to the one that triggered the problematic situation
at hand. This process is depicted in Figure 10, where the
optimizer loads the input query from the dump, creates a
file-based MD Provider for the metadata, sets optimizer’s
configurations and then spawns the optimization threads to
reproduce the problem instantly.
AMPERe is also used as a testing framework, where a
dump acts as a test case that contains an input query and
its expected plan. When replaying the dump file, Orca
Actual Cost!
EstimatedCost!
p1!
p2! p3!
p4!
Figure 11: Plan space
might generate a plan different from the expected one (e.g.,
because of changes in the cost model). Such discrepancy
causes the test case to fail, and triggers investigating the
root cause of plan difference. Using this framework, any
bug with an accompanying AMPERe dump, whether filed
by internal testing or through customer reports, can be au-
tomatically turned into a self-contained test case.
6.2 Testing Optimizer Accuracy
The accuracy of Orca’s cost model can be impacted by
a number of error sources including inaccurate cardinality
estimates and not properly adjusted cost model parameters.
As a result, cost model provides imperfect prediction of the
wall clock time for the execution of a plan. Quantifying opti-
mizer’s accuracy is crucial to avoid performance regressions
introduced by bug fixes and newly added features.
Orca includes a built-in tool called TAQO [15] for Testing
the Accuracy of Query Optimizer. TAQO measures the abil-
ity of optimizer’s cost model to order any two given plans
correctly, i.e., the plan with the higher estimated cost will in-
deed run longer. For example, in Figure 11, the optimizer or-
ders (p1, p3) correctly, since their actual cost is directly pro-
portional to computed cost estimates. On the other hand,
the optimizer orders (p1, p2) incorrectly, since their actual
cost is inversely proportional to computed cost estimates.
TAQO measures the optimizer’s accuracy by costing and
executing plans that the optimizer considers when optimiz-
ing a given query. Evaluating each single plan in the search
space is infeasible, in general. This limitation can be over-
come by sampling plans uniformly from the search space.
Optimization requests’ linkage structure (cf. Section 4.1)
provides the infrastructure used by TAQO to build a uni-
form plan sampler based on the method introduced in [29].
Given a sample of plans from the search space of a given
query, TAQO computes a correlation score between the rank-
ing of sampled plans based on estimated costs and their
ranking based on actual costs. The correlation score com-
bines a number of measures including importance of plans
(the score penalizes optimizer more for cost miss-estimation
of very good plans), and distance between plans (the score
does not penalize optimizer for small differences in the es-
timated costs of plans that are actually close in execution
time). The correlation score also allows benchmarking the
optimizers of different database systems to evaluate their
relative quality. We discuss the testing methodology imple-
mented in TAQO in more detail in [15].
7. EXPERIMENTS
In our experimental study, we chose to conduct an end-
to-end evaluation of a database system equipped with Orca,
rather than evaluating Orca’s individual components, to
highlight the added value of our new query optimizer. We
9. first compare Orca to the legacy query optimizer of Pivotal
GPDB. We then compare Pivotal HAWQ (which employs
Orca in its core) to other popular SQL on Hadoop solutions.
7.1 TPC-DS Benchmark
Our experiments are based on the TPC-DS benchmark [1].
TPC-DS is a widely adopted decision support benchmark
that consists of a set of complex business analytic queries.
It has superseded the well-known TPC-H by providing a
much richer schema and a larger variety of business prob-
lems ranging from business reporting, ad-hoc exploration,
iterative queries to data mining. In our development process
we have observed that TPC-H often lacks the sophistication
of the workload from our enterprise customers. On the other
hand, TPC-DS with its 25 tables, 429 columns and 99 query
templates can well represent a modern decision-supporting
system and is an excellent benchmark for testing query op-
timizers. The rich SQL syntax (WITH clause, window func-
tions, subqueries, outer joins, CASE statement, Intersect,
Except, etc.) in the TPC-DS queries is a serious SQL com-
pliance test for any query engine.
7.2 MPP Databases
In this part, we compare the performance of Orca with the
GPDB legacy query optimizer (a.k.a. Planner) that inherits
part of its design from the PostgreSQL optimizer. The Plan-
ner is a robust optimizer that has been serving hundreds of
production systems well, and has been improving over the
past decade.
7.2.1 Experiment Setup
For the comparison between Orca and Planner, we use a
cluster of 16 nodes connected with 10Gbps Ethernet. Each
node has dual Intel Xeon processors at 3.33GHz, 48GB of
RAM and twelve 600GB SAS drives in two RAID-5 groups.
The operating system is Red Hat Enterprise Linux 5.5.
We installed two isolated instances of the same version of
GPDB (one uses Orca, and the other uses Planner). We
use 10 TB TPC-DS benchmark with partitioned tables for
performance evaluation.
7.2.2 Performance
We generated 111 queries out of the 99 templates of TPC-
DS. Both Orca and Planner support all the queries in their
original form without any re-writing. The full SQL compli-
ance provides maximum compatibility of BI tools and ease-
of-use for data analysts from different backgrounds. As we
show in the SQL on Hadoop experiments in Section 7.3,
many Hadoop SQL engines currently support only a small
subset of TPC-DS queries out of the box.
The performance speed up of Orca compared to Planner
for all queries is shown in Figure 12, where bars above the
speed-up ratio of 1 indicate performance improvement of
Orca. We observe that Orca is able to produce similar or
better query plan for 80% of the queries. For the entire
TPC-DS suite, Orca shows a 5x speed-up over Planner.
In particular, for 14 queries Orca achieves a speed-up ra-
tio of at least 1000x - this is due to a timeout we enforced
at 10000 seconds. These queries took more than 10000 sec-
onds with Planner’s plan while they were able to finish with
Orca’s plan in minutes.
The performance improvement provided by Orca is due to
a combination of salient features including the following:
• Join Ordering. Orca includes a number of join or-
dering optimizations based on dynamic programming,
left-deep join trees and cardinality-based join ordering.
• Correlated Subqueries. Orca adopts and extends a uni-
fied representation of subqueries to detect deeply cor-
related predicates and pull them up into joins to avoid
repeated execution of subquery expressions.
• Partition Elimination. Orca introduces a novel frame-
work for on-the-fly pruning of partitioned tables [2].
This feature is implemented by extending Orca’s en-
forcers framework to accommodate new properties.
• Common Expressions. Orca introduces a new
producer-consumer model for WITH clause. The model
allows evaluating a complex expression once, and con-
suming its output by multiple operators.
The interplay of the previous features is enabled by Orca’s
architecture and components abstraction. Each feature is
designed, implemented and tested with minimal changes in
the behavior of other features. The combined benefit and
clean interaction of features are manifested by Figure 12.
For a smaller number of queries, Orca produced sub-
optimal plans with up to 2x slow down compared to Planner.
These sub-optimal plans are partly due to cardinality esti-
mation errors or sub-optimal cost model parameters that
need further tuning. We are actively investigating these is-
sues and constantly improving Orca.
We have also measured optimization time and Orca’s
memory footprint when using the full set of transformation
rules. The average optimization time is around 4 seconds,
while the average memory footprint is around 200 MB. As
we mentioned in Section 4.1, our ongoing work involves im-
plementing techniques to shortcut optimization and improve
resource consumption for complex queries.
7.3 SQL on Hadoop
Hadoop has quickly become a popular analytics ecosys-
tem due to its scalability. In recent years, many Hadoop
systems have been developed with SQL or SQL-like query in-
terfaces. In this section, we compare the performance of Piv-
otal HAWQ (powered by Orca) against three Hadoop SQL
engines: Impala [17], Presto [7], and Stinger [16]. Please
refer to Section 8 for a discussion on these systems.
7.3.1 Experiment Setup
The experiments are conducted on a cluster with 10 nodes;
two for HDFS name node and coordinator services of SQL
engines, and eight for HDFS data nodes and worker nodes.
Each node has dual Intel Xeon eight-core processors at
2.7GHz, 64GB of RAM and 22 900GB disks in JBOD. The
operating system is Red Hat Enterprise Linux 6.2.
We used CDH 4.4 and Impala 1.1.1 for Impala, Presto
0.52, and Hive 0.12 for Stinger. We made our best efforts
to tune the optimal configurations of each system, including
enabling short circuit read, allocating as much memory as
possible to workers and having one standalone node for co-
ordinator services. For HAWQ, we used Pivotal HD version
1.1 in our experiment.
Optimization of TPC-DS queries in different systems
turned out to be quite challenging because systems currently
have limited SQL support. For example, Impala does not
yet support window functions, ORDER BY statement without
10. 0.5
5
1
3
5
6
8
10
12
14a
15
17
18a
20
22
23
25
27
28
30
32
34
36
37
39
41
43
45
47
49
51
52
54
56
58
60
62
64
66
67a
69
70a
72
74
76
77a
79
80a
82
84
86
87
89
91
93
95
97
99
OrcaSpeed-upratio
Query ID (every other ID is shown due to space limit)
1.0
1000x
1000x
1000x
1000x
1000x
1000x
1000x
1000x
1000x
1000x
Figure 12: Speed-up ratio of Orca vs Planner (TPC-DS 10TB)
1"
10"
100"
3"
4"
7"
11"
15"
19"
21"
22a"
25"
26"
27a"
29"
37"
42"
43"
46"
50"
52"
54"
55"
59"
68"
74"
75"
76"
79"
82"
85"
93"
96"
97"
HAWQspeed-upratio
Query ID
(*)"Query"runs"out"of"memory"in"Impala"
Figure 13: HAWQ vs Impala (TPC-DS 256GB)
LIMIT and some analytic functions like ROLLUP and CUBE.
Presto does not yet support non-equi joins. Stinger cur-
rently does not support WITH clause and CASE statement. In
addition, none of the systems supports INTERSECT, EXCEPT,
disjunctive join conditions and correlated subqueries. These
unsupported features have forced us to rule out a large num-
ber of queries from consideration.
After excluding unsupported queries, we needed to re-
write the remaining queries to work around parsing limita-
tions in different systems. For example, Stinger and Presto
do not support implicit cross-joins and some data types. Af-
ter extensive filtering and rewriting, we finally managed to
get query plans for 31 queries in Impala, 19 queries in Stinger
and 12 queries in Presto, out of the total 111 queries.
7.3.2 Performance
Our first attempt was to evaluate the different systems us-
ing 10TB TPC-DS benchmark. However, most of the queries
from Stinger did not return within a reasonable time limit,
and almost all the queries from Impala and Presto failed due
to an out of memory error. This mainly happens due to the
inability of these systems to spill partial results to disk when
an operator’s internal state overflows the memory limits.
To obtain a better coverage across the different systems,
we used 256GB TPC-DS benchmark, considering that the
total working memory of our cluster is about 400GB (50GB
× 8 nodes). Unfortunately even with this setting, we were
1"
10"
100"
3" 12" 17" 18" 20" 22" 25" 29" 37" 42" 52" 55" 67" 76" 82" 84" 86" 90" 98"
HAWQspeed-upratio
Query ID
Figure 14: HAWQ vs Stinger (TPC-DS 256GB)
111
31
12
19
111
20
0
19
0
111
HAWQ
Impala
Presto
S7nger
#
of
queries
op7miza7on
execu7on
Figure 15: TPC-DS query support
unable to successfully run any TPC-DS query in Presto (al-
though we managed to run much simpler join queries in
Presto). For Impala and Stinger, we managed to run a num-
ber of TPC-DS queries, as we discuss next.
Figure 15 summarizes the number of supported queries
in all the systems. We show the number of queries that
each system can optimize (i.e., return a query plan), and
the number of queries that can finish execution and return
query results for the 256GB dataset.
Figure 13 and Figure 14 show the speedup ratio of HAWQ
over Impala and Stinger. Since not all the queries are
supported by the two systems, we only list the successful
queries. The bars marked with ‘∗’ in Figure 13 indicate the
queries that run out of memory. For query 46, 59 and 68,
Impala and HAWQ have similar performance.
11. For queries where HAWQ has the most speedups, we find
that Impala and Stinger handle join orders as literally spec-
ified in the query, while Orca explores different join orders
to suggest the best one using a cost-based approach. For ex-
ample in query 25, Impala joins two fact tables store_sales
and store_returns first and then joins this huge interme-
diate results with another fact table catalog_sales, which
is quite inefficient. In comparison, Orca joins the fact tables
with dimension tables first to reduce intermediate results.
In general, join ordering is a non-trivial optimization that
requires extensive infrastructure on the optimizer side.
Impala recommends users to write joins in the descend-
ing order of the sizes of joined tables. However, this sug-
gestion ignores the filters (which may be selective) on the
tables, adds a non-trivial overhead to a database user for
complex queries and may not be supported by BI tools that
automatically generates queries. The lack of join ordering
optimizations in a query optimizer has negative impact on
the quality of produced plans. Other possible reasons for
HAWQ speedups such as resource management and query
execution is outside the scope of this paper.
The average speedup ratio of HAWQ in this set of exper-
iments is 6x against Impala and 21x against Stinger. Note
that the queries listed in Figure 13 and Figure 14 are rel-
atively simple queries in TPC-DS benchmark. More com-
plex queries (e.g., queries with correlated sub-queries) are
not supported by other systems yet, while being completely
supported by Orca. We plan to revisit TPC-DS benchmark
performance evaluation in the future when all the queries
are supported by other systems.
8. RELATED WORK
Query optimization has been a fertile field for several
ground breaking innovations over the past decades. In this
section, we discuss a number of foundational query optimiza-
tion techniques, and recent proposals in the space of MPP
databases and Hadoop-based systems.
8.1 Query Optimization Foundations
Volcano Parallel Database [12] introduced basic princi-
ples for achieving parallelism in databases. The proposed
framework introduced exchange operators, which enable two
means of parallelism, namely inter-operator parallelism, via
pipelining, and intra-operator parallelism, via partitioning of
tuples across operators running on different processes. The
proposed design allows each operator to execute indepen-
dently on local data, as well as work in parallel with other
copies of the operator running in other processes. Several
MPP databases [6,8,18,20,23] make use of these principles
to build commercially successful products.
Cascades [13] is an extensible optimizer framework
whose principles have been used to build MS-SQL Server,
SCOPE [6], PDW [23], and Orca, the optimizer we present
in this paper. The popularity of this framework is due to
its clean separation of the logical and physical plan spaces.
This is primarily achieved by encapsulating operators and
transformation rules into self-contained components. This
modular design enables Cascades to group logically equiv-
alent expressions to eliminate redundant work, allows rules
to be triggered on demand in contrast to Volcano’s [14] ex-
haustive approach, and permits ordering the application of
rules based on their usefulness to a given operator.
Building on the principles of Cascades, a parallel opti-
mization framework is proposed in [30] that enables building
Cascades-like optimizers for multi-core architectures. The
parallel query optimization framework in Orca (cf. Sec-
tion 4.2) is based on the principles introduced in [30].
8.2 SQL Optimization On MPP Databases
The exponential growth in the amount of data being
stored and queried has translated into a wider usage of
Massively Parallel Processing (MPP) systems such as Tera-
data [27], Oracle’s Exadata [31], Netezza [25], Pivotal Green-
plum Database [20], and Vertica [18]. Due to space limita-
tion, we summarize a few recent efforts in re-designing the
query optimizer to meet the challenges of big data.
SQL Server Parallel Data Warehouse (PDW) [23] makes
extensive re-use of the established Microsoft’s SQL Server
optimizer. For each query, PDW triggers an optimization
request to the SQL Server optimizer that works on a shell
database that maintains only the metadata and statistics of
the database and not its user data. The plan alternatives ex-
plored by SQL Server optimizer are then shipped to PDW’s
Data Movement Service (DMS) where these logical plans
are retrofitted with distribution information. While this ap-
proach avoids building an optimizer from scratch, it makes
debugging and maintenance harder since the optimization
logic is spread across two different processes and codebases.
Structured Computations Optimized for Parallel Execution
(SCOPE) [6] is Microsoft’s data analysis platform that lever-
ages characteristics of both parallel databases and MapRe-
duce systems. SCOPE’s scripting language, like Hive [28],
is based on SQL. SCOPE is developed for the Cosmos dis-
tributed data platform that employs an append-only file sys-
tem, while Orca is designed with a vision to work with mul-
tiple underlying data management systems.
SAP HANA [11] is a distributed in-memory database sys-
tem that handles business analytical and OLTP queries. An-
alytical queries in MPP databases can potentially generate
a large amount of intermediate results. Concurrent analyti-
cal queries can exhaust the available memory, most of which
is already consumed to store and index raw data, and will
trigger data to be spilled to disk that results in a negative
impact on query performance.
Vertica [18] is the commercialized MPP version of the C-
Store project [26] where the data is organized into projec-
tions and each projection is a subset of table attributes. The
initial StarOpt and its modified StratifiedOpt optimizer were
custom designed for queries over star/snowflake schemas,
where the join keys of the same range are co-located. When
data co-location cannot be achieved, the pertinent projec-
tions are replicated on all nodes to improve performance, as
addressed by Vertica’s V2Opt optimizer.
8.3 SQL On Hadoop
The classic approach of executing SQL on Hadoop is
converting queries into MapReduce jobs using Hive [28].
MapReduce performance can be unsatisfactory for interac-
tive analysis. Stinger [16] is an initiative to optimize query
evaluation on Hadoop by leveraging and extending Hive.
This approach, however, could entail significant redesign
of MapReduce computing framework to optimize passes on
data, and materialization of intermediate results on disk.
Several efforts have addressed interactive processing on
Hadoop by creating specialized query engines that allow
12. SQL-based processing of data in HDFS without the need
to use MapReduce. Impala [17], HAWQ [21] and Presto [7]
are key efforts in this direction. These approaches are dif-
ferent in the design and capabilities of their query optimiz-
ers and execution engines, both of which are differentiating
factors for query performance. Co-location of DBMS and
Hadoop technologies allows data to be processed natively
on each platform, using SQL in the DBMS and MapReduce
in HDFS. Hadapt [4] pioneered this approach. Microsoft has
also introduced PolyBase [10] to offer the ability to join ta-
bles from PDW [23] with data on HDFS in order to optimize
data exchange from one platform to another.
AsterixDB [5] is an open-source effort to efficiently store,
index and query semi-structured information based on a
NoSQL style data model. Currently, AsterixDB’s query
planner is driven by user hints rather than a cost driven ap-
proach like Orca. Dremel [19] is a scalable columnar solution
from Google used to analyze outputs of MapReduce pipeline.
Dremel provides a high level scripting languages similar to
AsterixDB’s scripting language (AQL) [5] and SCOPE [6]
to process read-only nested data.
9. SUMMARY
With the development of Orca, we aimed at developing
a query optimization platform that not only represents the
state-of-the-art but is also powerful and extensible enough to
support rapid development of new optimization techniques
and advanced query features.
In this paper, we described the engineering effort needed
to build such a system entirely from scratch. Integrating into
Orca a number of technical safeguards posed a significant
investment, yet, has paid already significant dividends in the
form of the rapid pace of its development and the resulting
high quality of the software. Orca’s modularity allows it to
be adapted easily to different data management systems by
encoding a system’s capabilities and metadata using a clean
and uniform abstraction.
10. REFERENCES
[1] TPC-DS. http://www.tpc.org/tpcds, 2005.
[2] L. Antova, A. ElHelw, M. Soliman, Z. Gu, M. Petropoulos,
and F. Waas. Optimizing Queries over Partitioned Tables
in MPP Systems. In SIGMOD, 2014.
[3] L. Antova, K. Krikellas, and F. M. Waas. Automatic
Capture of Minimal, Portable, and Executable Bug Repros
using AMPERe. In DBTest, 2012.
[4] K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and
E. Paulson. Efficient Processing of Data Warehousing
Queries in a Split Execution Environment. In SIGMOD,
2011.
[5] A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li,
N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou,
and V. J. Tsotras. ASTERIX: Towards a Scalable,
Semistructured Data Platform for Evolving-world Models.
Dist. Parallel Databases, 29(3), 2011.
[6] R. Chaiken, B. Jenkins, P.-˚A. Larson, B. Ramsey,
D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and
Efficient Parallel Processing of Massive Data Sets. PVLDB,
1(2), 2008.
[7] L. Chan. Presto: Interacting with petabytes of data at
Facebook. http://prestodb.io, 2013.
[8] Y. Chen, R. L. Cole, W. J. McKenna, S. Perfilov, A. Sinha,
and E. Szedenits, Jr. Partial Join Order Optimization in
the Paraccel Analytic Database. In SIGMOD, 2009.
[9] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J.
Furman, S. Ghemawat, A. Gubarev, C. Heiser,
P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li,
A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan,
R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor,
R. Wang, and D. Woodford. Spanner: Google’s
Globally-distributed Database. In OSDI, 2012.
[10] D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar,
J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling.
Split Query Processing in Polybase. In SIGMOD, 2013.
[11] F. F¨arber, S. K. Cha, J. Primsch, C. Bornh¨ovd, S. Sigg, and
W. Lehner. SAP HANA Database: Data Management for
Modern Business Applications. SIGMOD Rec., 40(4), 2012.
[12] G. Graefe. Encapsulation of Parallelism in the Volcano
Query Processing System. In SIGMOD, 1990.
[13] G. Graefe. The Cascades Framework for Query
Optimization. IEEE Data Eng. Bull., 18(3), 1995.
[14] G. Graefe and W. J. McKenna. The Volcano Optimizer
Generator: Extensibility and Efficient Search. In ICDE,
1993.
[15] Z. Gu, M. A. Soliman, and F. M. Waas. Testing the
Accuracy of Query Optimizers. In DBTest, 2012.
[16] Hortonworks. Stinger, Interactive query for Apache Hive.
http://hortonworks.com/labs/stinger/, 2013.
[17] M. Kornacker and J. Erickson. Cloudera Impala:
Real-Time Queries in Apache Hadoop, for Real.
http://www.cloudera.com/content/cloudera/en/
products-and-services/cdh/impala.html, 2012.
[18] A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver,
L. Doshi, and C. Bear. The Vertica Analytic Database:
C-store 7 Years Later. VLDB Endow., 5(12), 2012.
[19] S. Melnik, A. Gubarev, J. J. Long, G. Romer,
S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel:
Interactive Analysis of Web-Scale Datasets. PVLDB,
3(1):330–339, 2010.
[20] Pivotal. Greenplum Database.
http://www.gopivotal.com/products/pivotal-greenplum-
database, 2013.
[21] Pivotal. HAWQ. http://www.gopivotal.com/sites/
default/files/Hawq_WP_042313_FINAL.pdf, 2013.
[22] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A.
Lorie, and T. G. Price. Access Path Selection in a
Relational Database Management System. In SIGMOD,
1979.
[23] S. Shankar, R. Nehme, J. Aguilar-Saborit, A. Chung,
M. Elhemali, A. Halverson, E. Robinson, M. S.
Subramanian, D. DeWitt, and C. Galindo-Legaria. Query
Optimization in Microsoft SQL Server PDW. In SIGMOD,
2012.
[24] E. Shen and L. Antova. Reversing Statistics for Scalable
Test Databases Generation. In Proceedings of the Sixth
International Workshop on Testing Database Systems,
pages 7:1–7:6, 2013.
[25] M. Singh and B. Leonhardi. Introduction to the IBM
Netezza Warehouse Appliance. In CASCON, 2011.
[26] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen,
M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J.
O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik.
C-Store: A Column-oriented DBMS. In VLDB, 2005.
[27] Teradata. http://www.teradata.com/, 2013.
[28] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - A
Petabyte Scale Data Warehouse using Hadoop. In ICDE,
2010.
[29] F. Waas and C. Galindo-Legaria. Counting, Enumerating,
and Sampling of Execution Plans in a Cost-based Query
Optimizer. In SIGMOD, 2000.
[30] F. M. Waas and J. M. Hellerstein. Parallelizing Extensible
Query Optimizers. In SIGMOD Conference, pages 871–878,
2009.
[31] R. Weiss. A Technical Overview of the Oracle Exadata
Database Machine and Exadata Storage Server, 2012.