Hive contributors have striven to improve the capability of Hive in terms of both performance and functionality. We assert that understanding and analyzing Apache Hive query execution plan is crucial for performance debugging. In this talk, we study why Apache Hive’s query execution plan today are so difficult to analyze. We identify a set of pain points from our Apache Hive performance engineers, development engineers as well as real users/customers. We propose and show a new presentation data model that can well address the pain points. The three most critical parts of the presentation are (1) the estimated query execution cost, which is the planner's guess at how long it will take to run the query (measured in #rows); (2) the orchestration of the operator tree across consecutive M/R (Tez) jobs; and (3) integration and extension support with other presentation tools, e.g., Apache Ambari.
We can see that Hive old style explain is quite verbose, is it necessary?
We pick a real query out of the 500 queries.
Although Reduce Output Operator is crucial as it defines the boundary between a map task and a reduce task, we are thinking how much information of the reduce output operator is.... SQL users care about more about relational operators rather than MR operators.
We can also see more details about this join.
It seems that Postgres sets up a good example for hive to improve on.
So this is the basic introduction of the new style hive explain plan
This query involves 3 tables.
For Query 87
The next version of Hive explain plan will look like this. We highlight the changes that we want to make in red color.
We thank the audience for your attention.
I am ready to take any questions.
How to understand and analyze Apache Hive query execution plan for performance debugging