Query optimization techniques in Apache Hive

Query Optimization Techniques in
Hive: A Comparative Analysis
Presenter: Zara Tariq - 1573119
Supervisor: Mr. Rahim Hasnani
MSCS Batch 2015 | Independent Study

INTRODUCTION
Page # 2
 Today, organization of all sizes and types using the Apache Hive to process and
access their data from Hadoop in a familiar way, either using the broad tool set of
BI(business intelligence) or via directly on this proven interface. [1]
 Hive is being increasingly implemented and adopted in a vast range of organizations
[2], so there is a need of more efficient and flexible technique(with in Hive) to
optimize the queries performance.
 Optimizing the queries is directly related to infrastructure, size of data, organization
of data, storage formats and the data readers/ processors.

PROBLEM STATEMENT
Page # 3
Recently, there are many query optimization techniques have been introduced in HIVE but
still there is no clarity about the most effective technique, which can give most efficient
query processing time with least latency rate.
Objective: To determine which technique or a combination of techniques in Apache Hive is the most
efficient by means of low latency that takes low response time to run queries on large datasets among
the recent technical advancements in HIVE query processing engines [3].
Hypothesis: Cost-Based Optimization (CBO) techniques optimizes each job execution plan
physically and logically before submitting to final execution.

PURPOSE OF STUDY
Page # 4
The purpose of this research is to conduct a comparative analysis of some efficient and different technique
and to explore the most efficient technique among them in Apache Hive query execution engine.
Given below are the techniques which are selected for this study:
1. MapReduce
2. ORCFile (Optimized Row Columnar File)
3. Vectorization for query optimization model
4. Cost-Based Optimization
SIGNIFICANCE OF RESEARCH
There will be a practical comparison of some different and an efficient technique that is theoretically
introduced in recent advancements in Apache Hive [3] (2015) which can facilitate the data analyst to improve
the response time and decrease the job load overheads by applying them in daily lives for best practices.

LITERATURE REVIEW
Page # 5
MapReduce
 MapReduce simpliﬁes the deployment of Massively Parralel Processing systems by providing two
user-friendly interfaces: map and reduce.
 Break down workload into chunks of a cluster
 No need to wait for any other job to be completed first
 Different key-value pairs are generated [4] [5]
Vectorization
 The dataset is represented in batches of rows and values of columns are represented as a column
vector.
 Complete set of both row and column batches are configurable to fit in processor’s cache (by
default it is in the set of 1024) [3].
 Vectorization works more efficiently [6], if we apply this technique to ORC formatted data tables
within Hive

LITERATURE REVIEW
Page # 6
Cost Based Optimization
It uses two types of statistics for producing an optimal query plan in Metastore of
the Hive for query optimization.
 Table stats
 Column stats
Optimized Row Columnar File
ORC is the efficient way of storing all sort of data formats supported within
Apache Hive.
 This technique improves the read, write, processing operations and
performance of complex jobs.

3`RESEARCH METHODOLOGY
Page # 7
This research has been divided into five
phases:
1. Determine an efficient and effective
Query Processing Techniques that have
been recently introduced in Apache
Hive [9].
2. Selection and loading of Dataset.
3. Design a set of 10 queries with different
complexity levels [10].
4. Test all these queries on the set of
selected query optimization techniques
of Hive.
5. Compare and summarize the results
and suggest which technique worked
best in most of the scenario (queries).
Figure 1: Research Methodology

EXPERIEMENT
Page # 8
Dataset
The dataset selected for the analysis are taken from DBPedia Released in 2014 Ontology which consist of
47 classes and 70+GB of data but only single class is being used in the research, named: Species
Design Test Queries
Ten queries were designed on the benchmark [10] of interactive queries on Text data
Data Format Size # of Tables # of Rows # of Columns
Original Data 1.79GB 26 3, 76, 859 1, 059
Table 1: Detail of the data set used in the Experiment
Original Data (Text file) RCF File ORCSerde
Data Size in GB 1.79 1.50 0.38
Table 2: Comparison table shows the change of size the of dataset used for analysis (in GB)

Figure 2: Details of the set of queries designed for analysis
Interactive
Queries
Simple
Queries
Q1,2,3
Single Join
Queries
Q4,5,8,9
Nested Join
Queries
Q6,7,10
Page # 9

EXPERIEMENT(cont.)
Page # 10
Phase I Results:
Set of 10 queries were tested on the four core techniques of query optimization in Hive
query execution engine.
Techniques Q1 Q2 Q3 Q4 Q5 Q8 Q9 Q6 Q7 Q10 Avg. RT
Hive 264.5 122.8 141.1 174.7 267.6 316.6 83.4 5825.8 1103.3 6433.3 1473.3
MapReduce 139.7 126.9 112.3 202.9 325.7 152.3 26.6 4899.1 1050.4 4808.4 1184.4
ORC File 64.7 26.0 0.9 313.3 339.4 68.7 30.3 4923.6 1420.1 6348.8 1353.6
Vectorize 217.9 188.9 125.8 231.5 347.5 199.1 151.8 4646.3 1664.2 6718.9 1449.2
CBO 203.3 145.6 135.5 296.2 240.3 152.2 43.9 5186.1 1323.7 5119.6 1284.6
Simple Queries Single Join Queries Nested Join Queries
Table 3: Response time of *Queries at Phase I
20%
+Avg. RT= Average Resposne Time

Figure 3: MapReduce responded most efficiently as compare to other combination techniques
1473.3
1353.6
1449.2
1284.6
0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0
Hive
MapReduce
ORCFile
Vectorize
CBO
1184.4
Page # 11
time in seconds

EXPERIEMENT(cont.)
Page # 12
Phase II Results:
Same set of queries were again tested on ORC File formatted dataset on three techniques
in a combination of MapReduce, CBO, and Vectorization.
Techniques Q1 Q2 Q3 Q4 Q5 Q8 Q9 Q6 Q7 Q10 Avg. RT
Hive 264.5 122.8 141.1 174.7 267.6 316.6 83.4 5826 1103 6433 1473
Vectorize+ORC File 179.4 5.748 1.068 402.8 241.4 54.12 35.48 4042 1333 6010 1230
CBO+ORC File 71.17 1.435 0.428 363 163 45.58 38.75 4605 1897 5990 1318
MR+ORC File 55.54 0.634 0.144 263.8 191.5 45.09 26.77 4075 1203 4779 1064
Simple Queries Single Join Queries Nested Join Queries
Table 4: Response time of *Queries at Phase II
28%
+Avg. RT= Average Resposne Time

a
Figure 4: MapReduce + ORC responsed most efficiently as compare to other combination techniques
1473.3
1230.5
1314.5
1064
1284.6
0 200 400 600 800 1000 1200 1400 1600
Page # 13
time in seconds

RESULT
Page # 14
Techniques Simple Single Join Nested Join
Avg.
Response
Time
Improved
Performance of Hive
Hive 176.1 210.6 4454.1 1613.6 -
MapReduce 126.3 176.9 3585.9 1296.4 20X
ORCFile 30.5 188.0 4230.8 1483.1 8X
Vectorize 177.5 232.5 4343.1 1584.4 2X
CBO 161.4 183.1 3876.5 1407.0 13X
Vectorize+ORCFile 62.1 183.5 3794.8 1346.8 17X
CBO+ORCFile 24.3 152.6 4164.0 1447.0 11X
MR+ORCCFile 18.8 131.8 3352.3 1167.6 28X
Table 5: Most efficient (in term of speed) technique in Hive for query processing

Figure 5: Most Efficient Query Processing Technique in Hive
Page # 15
28%

CONCLUSION
Page # 16
After applying different query optimization techniques and executing set of queries on Apache Hive, this
research concludes that:
 By applying MapReduce Technique with OrcFile Formatted data, it would speed up performance
improvement upto 20X to 28X with low latencies, which will make this combination of techniques most
effective and well organized approach of optimizing interactive queries on this type of dataset.
 On the other hand, the most effectual and proficient independent technique is also MapReduce
Technique itself, which had successfully completed overall Hive jobs on average latency rate of 20X faster
than Hive default execution engine.
FUTURE DIRECTION
This research work can be carry forward in future by performing the same analysis on different big data tool
for analytics such as MapR, Horton Works, Microsoft Azure and all those ecosystems which support HiveQL.

REFERENCES
[1] Chen-Ching Liu,.: John Wiley & Sons, 2016, 01-Aug-2016, pp. 436-440.
[2] Josh Rosen, Matei Zaharia Reynold S. Xin, ”Shark: SQL and Rich Analytics at Scale,” in SIGMOD ’13
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New Yor,
2013.
[3] Yin, et al Huai, ”Major Technical Advancements in Apache Hive,” in Proceedings of the 2014 ACM
SIGMOD international conference on Management of data ACM, 2014.
[4] Sai, et al. Wu, ”Query optimization for massively parallel data processing.,” in Proceedings of the 2nd
ACM Symposium on Cloud Computing. ACM, 2011.
[5] ToddLipcon. (2011, June) MapReduce. [Online]. Https://wiki.apache.org/hadoop/MapReduce/
Page # 17
[6] Dominique A. Heger, ”A Brief Introduction to Apache Tez,” in DHTech-nologies & Data Nubes.
[7] Andreas Franois Vermeulen, Ankur Gupta, David Kjerrumgaard By Scott Shaw, Practical Hive: A Guide to
Hadoop’s Data Warehouse System, 1st ed.: Apress, 2016, 2016.
[8] Ofer Mendelevitch. (2015, February) 5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER. [Online].
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
[9] Apache Hive TM. (2016, June) Index of /dist/hive/hive-2.1.0. [Online]. http://www-
us.apache.org/dist/hive/hive-2.1.0/
[10] Carter Shanklin. (2014, June) BENCHMARKING APACHE HIVE 13 FOR ENTERPRISE HADOOP. [Online].
http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/

Query optimization techniques in Apache Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Query optimization techniques in Apache Hive

Similar to Query optimization techniques in Apache Hive (20)

More from Zara Tariq

More from Zara Tariq (12)

Recently uploaded

Recently uploaded (20)

Query optimization techniques in Apache Hive

Editor's Notes