In today’s world, data management has become more complex and essential part of every industry, organization and individual to function in order with the existing and continuously growing data. All type of organizations are witnessing an exponential growth in there massive volume of data and also demands for immediate results and readily available metrics in a cost-effective manner to take strategic decisions. This paper will explore most effective compression technique to optimize the query processing response time with low latency rate among different techniques that are recently introduced in Hive by testing a set of interactive queries on a single cluster of Big Data.
According to our results extracted out from this research, depicts that by applying MapReduce with ORCSerde formatted data, it can speed up the performance improvement with low latencies up to 28X, which makes this combination most effective method of optimizing interactive queries on this type of dataset. Conversely, MapReduce is the most efficient independent technique which has successfully completed overall Hive jobs, on average latency rate of 20X faster than Hive default execution engine.
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
Query optimization techniques in Apache Hive
1. Query Optimization Techniques in
Hive: A Comparative Analysis
Presenter: Zara Tariq - 1573119
Supervisor: Mr. Rahim Hasnani
MSCS Batch 2015 | Independent Study
2. INTRODUCTION
Page # 2
Today, organization of all sizes and types using the Apache Hive to process and
access their data from Hadoop in a familiar way, either using the broad tool set of
BI(business intelligence) or via directly on this proven interface. [1]
Hive is being increasingly implemented and adopted in a vast range of organizations
[2], so there is a need of more efficient and flexible technique(with in Hive) to
optimize the queries performance.
Optimizing the queries is directly related to infrastructure, size of data, organization
of data, storage formats and the data readers/ processors.
3. PROBLEM STATEMENT
Page # 3
Recently, there are many query optimization techniques have been introduced in HIVE but
still there is no clarity about the most effective technique, which can give most efficient
query processing time with least latency rate.
Objective: To determine which technique or a combination of techniques in Apache Hive is the most
efficient by means of low latency that takes low response time to run queries on large datasets among
the recent technical advancements in HIVE query processing engines [3].
Hypothesis: Cost-Based Optimization (CBO) techniques optimizes each job execution plan
physically and logically before submitting to final execution.
4. PURPOSE OF STUDY
Page # 4
The purpose of this research is to conduct a comparative analysis of some efficient and different technique
and to explore the most efficient technique among them in Apache Hive query execution engine.
Given below are the techniques which are selected for this study:
1. MapReduce
2. ORCFile (Optimized Row Columnar File)
3. Vectorization for query optimization model
4. Cost-Based Optimization
SIGNIFICANCE OF RESEARCH
There will be a practical comparison of some different and an efficient technique that is theoretically
introduced in recent advancements in Apache Hive [3] (2015) which can facilitate the data analyst to improve
the response time and decrease the job load overheads by applying them in daily lives for best practices.
5. LITERATURE REVIEW
Page # 5
MapReduce
MapReduce simplifies the deployment of Massively Parralel Processing systems by providing two
user-friendly interfaces: map and reduce.
Break down workload into chunks of a cluster
No need to wait for any other job to be completed first
Different key-value pairs are generated [4] [5]
Vectorization
The dataset is represented in batches of rows and values of columns are represented as a column
vector.
Complete set of both row and column batches are configurable to fit in processor’s cache (by
default it is in the set of 1024) [3].
Vectorization works more efficiently [6], if we apply this technique to ORC formatted data tables
within Hive
6. LITERATURE REVIEW
Page # 6
Cost Based Optimization
It uses two types of statistics for producing an optimal query plan in Metastore of
the Hive for query optimization.
Table stats
Column stats
Optimized Row Columnar File
ORC is the efficient way of storing all sort of data formats supported within
Apache Hive.
This technique improves the read, write, processing operations and
performance of complex jobs.
7. 3`RESEARCH METHODOLOGY
Page # 7
This research has been divided into five
phases:
1. Determine an efficient and effective
Query Processing Techniques that have
been recently introduced in Apache
Hive [9].
2. Selection and loading of Dataset.
3. Design a set of 10 queries with different
complexity levels [10].
4. Test all these queries on the set of
selected query optimization techniques
of Hive.
5. Compare and summarize the results
and suggest which technique worked
best in most of the scenario (queries).
Figure 1: Research Methodology
8. EXPERIEMENT
Page # 8
Dataset
The dataset selected for the analysis are taken from DBPedia Released in 2014 Ontology which consist of
47 classes and 70+GB of data but only single class is being used in the research, named: Species
Design Test Queries
Ten queries were designed on the benchmark [10] of interactive queries on Text data
Data Format Size # of Tables # of Rows # of Columns
Original Data 1.79GB 26 3, 76, 859 1, 059
Table 1: Detail of the data set used in the Experiment
Original Data (Text file) RCF File ORCSerde
Data Size in GB 1.79 1.50 0.38
Table 2: Comparison table shows the change of size the of dataset used for analysis (in GB)
9. Figure 2: Details of the set of queries designed for analysis
Interactive
Queries
Simple
Queries
Q1,2,3
Single Join
Queries
Q4,5,8,9
Nested Join
Queries
Q6,7,10
Page # 9
10. EXPERIEMENT(cont.)
Page # 10
Phase I Results:
Set of 10 queries were tested on the four core techniques of query optimization in Hive
query execution engine.
Techniques Q1 Q2 Q3 Q4 Q5 Q8 Q9 Q6 Q7 Q10 Avg. RT
Hive 264.5 122.8 141.1 174.7 267.6 316.6 83.4 5825.8 1103.3 6433.3 1473.3
MapReduce 139.7 126.9 112.3 202.9 325.7 152.3 26.6 4899.1 1050.4 4808.4 1184.4
ORC File 64.7 26.0 0.9 313.3 339.4 68.7 30.3 4923.6 1420.1 6348.8 1353.6
Vectorize 217.9 188.9 125.8 231.5 347.5 199.1 151.8 4646.3 1664.2 6718.9 1449.2
CBO 203.3 145.6 135.5 296.2 240.3 152.2 43.9 5186.1 1323.7 5119.6 1284.6
Simple Queries Single Join Queries Nested Join Queries
Table 3: Response time of *Queries at Phase I
20%
+Avg. RT= Average Resposne Time
11. Figure 3: MapReduce responded most efficiently as compare to other combination techniques
1473.3
1353.6
1449.2
1284.6
0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0
Hive
MapReduce
ORCFile
Vectorize
CBO
1184.4
Page # 11
time in seconds
12. EXPERIEMENT(cont.)
Page # 12
Phase II Results:
Same set of queries were again tested on ORC File formatted dataset on three techniques
in a combination of MapReduce, CBO, and Vectorization.
Techniques Q1 Q2 Q3 Q4 Q5 Q8 Q9 Q6 Q7 Q10 Avg. RT
Hive 264.5 122.8 141.1 174.7 267.6 316.6 83.4 5826 1103 6433 1473
Vectorize+ORC File 179.4 5.748 1.068 402.8 241.4 54.12 35.48 4042 1333 6010 1230
CBO+ORC File 71.17 1.435 0.428 363 163 45.58 38.75 4605 1897 5990 1318
MR+ORC File 55.54 0.634 0.144 263.8 191.5 45.09 26.77 4075 1203 4779 1064
Simple Queries Single Join Queries Nested Join Queries
Table 4: Response time of *Queries at Phase II
28%
+Avg. RT= Average Resposne Time
13. a
Figure 4: MapReduce + ORC responsed most efficiently as compare to other combination techniques
1473.3
1230.5
1314.5
1064
1284.6
0 200 400 600 800 1000 1200 1400 1600
Page # 13
time in seconds
14. RESULT
Page # 14
Techniques Simple Single Join Nested Join
Avg.
Response
Time
Improved
Performance of Hive
Hive 176.1 210.6 4454.1 1613.6 -
MapReduce 126.3 176.9 3585.9 1296.4 20X
ORCFile 30.5 188.0 4230.8 1483.1 8X
Vectorize 177.5 232.5 4343.1 1584.4 2X
CBO 161.4 183.1 3876.5 1407.0 13X
Vectorize+ORCFile 62.1 183.5 3794.8 1346.8 17X
CBO+ORCFile 24.3 152.6 4164.0 1447.0 11X
MR+ORCCFile 18.8 131.8 3352.3 1167.6 28X
Table 5: Most efficient (in term of speed) technique in Hive for query processing
15. Figure 5: Most Efficient Query Processing Technique in Hive
Page # 15
28%
16. CONCLUSION
Page # 16
After applying different query optimization techniques and executing set of queries on Apache Hive, this
research concludes that:
By applying MapReduce Technique with OrcFile Formatted data, it would speed up performance
improvement upto 20X to 28X with low latencies, which will make this combination of techniques most
effective and well organized approach of optimizing interactive queries on this type of dataset.
On the other hand, the most effectual and proficient independent technique is also MapReduce
Technique itself, which had successfully completed overall Hive jobs on average latency rate of 20X faster
than Hive default execution engine.
FUTURE DIRECTION
This research work can be carry forward in future by performing the same analysis on different big data tool
for analytics such as MapR, Horton Works, Microsoft Azure and all those ecosystems which support HiveQL.
17. REFERENCES
[1] Chen-Ching Liu,.: John Wiley & Sons, 2016, 01-Aug-2016, pp. 436-440.
[2] Josh Rosen, Matei Zaharia Reynold S. Xin, ”Shark: SQL and Rich Analytics at Scale,” in SIGMOD ’13
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New Yor,
2013.
[3] Yin, et al Huai, ”Major Technical Advancements in Apache Hive,” in Proceedings of the 2014 ACM
SIGMOD international conference on Management of data ACM, 2014.
[4] Sai, et al. Wu, ”Query optimization for massively parallel data processing.,” in Proceedings of the 2nd
ACM Symposium on Cloud Computing. ACM, 2011.
[5] ToddLipcon. (2011, June) MapReduce. [Online]. Https://wiki.apache.org/hadoop/MapReduce/
Page # 17
[6] Dominique A. Heger, ”A Brief Introduction to Apache Tez,” in DHTech-nologies & Data Nubes.
[7] Andreas Franois Vermeulen, Ankur Gupta, David Kjerrumgaard By Scott Shaw, Practical Hive: A Guide to
Hadoop’s Data Warehouse System, 1st ed.: Apress, 2016, 2016.
[8] Ofer Mendelevitch. (2015, February) 5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER. [Online].
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
[9] Apache Hive TM. (2016, June) Index of /dist/hive/hive-2.1.0. [Online]. http://www-
us.apache.org/dist/hive/hive-2.1.0/
[10] Carter Shanklin. (2014, June) BENCHMARKING APACHE HIVE 13 FOR ENTERPRISE HADOOP. [Online].
http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/
Editor's Notes
Hive query execution engine constantly strives to satisfy its users by fulfilling their requirements of processing progressively huge traffic of data in an efficient and in scalable way [18].
This opens up thousands of possibilities in terms query optimization but this research is focusing on some of the techniques in Hive.
LIMITATION of my Research
This research is conducted on Apache Hadoop on Cloudera QuickStart VM (single-node-cluster)
Specifications of the Machine:
CentOS 6.7
Cloudera Enterprise 6.2.0
Hadoop 2.6.0 CDH 5.8.0
4 GB RAM
VM Workstation 12 Player
64-bit Windows OS.
Hive 1.1.0-cdh5.8.0
MapReduce technique is used on Hive engine for managing a gigantic amount of workload to reduce processing time by distributing complex query tasks simultaneously on multiple nodes [6].
Recently introduced file format that is ORCFile (Optimized Row Columnar File). This file format works with recording data in a columnar storage organization which helps in providing data access and high storage efficiency with low overheads.
Vectorization for query optimization model is introduced in the recent advancements of Hive and can be used effectively for efficient query execution, response time and decreases the runtime more dramatically to improve efficiency rate by utilizing modern CPUs in a better way.
SIGNIFICANCE OF RESEARCH
It will beneficial for the data analysts who are currently using Apache Hive in their organizations for various data analytics and processing.
Also, there will be a practical comparison of some different and an efficient technique which can facilitate the data analyst to improve the response time and decrease the job load overheads by applying them in daily lives for best practices.
MapReduce
MR has been widely recognized as an efficient tool for large-scale data analysis.
MapReduce simplies the deployment of Massively Parralel Processing systems by providing two user-friendly interfaces: map and reduce.
Break down workload into chunks of a cluster
No need to wait for any other job to be completed first or any dependency of data occurred in this algorithm.
Different key-value pairs are generated and can be saved on the distributed filesystem where then the reducers merge the results at the end
Use of Vectorization
Its an execution model for query optimization,
The dataset is represented in batches of rows and values of columns are represented as a column vector.
Complete set of both row and column batches are configurable to fit in processor’s cache (by default it is in the set of 1024) [3].
This technique lowers the jobs latency by fetching 1024 rows at a time which optimizes the container’s usage.
Vectorization works more efficiently [6], if we apply this technique to ORC formatted data tables within Hive by reducing the CPU time at the improved utilization of clusters.
CBO: This feature is taken from the RDBMS systems. It is most useful for two types of queries that are queries with multiple joins and queries with more than one step of operations
Table stats: Contains the size of a table without compressing the data.
Column stats: Contain # of distinct values of data to optimize the cost for querying on big data.
ORC: Highly efficient way to store table data in Hive. It decreases the response time of query processing while processing data
---Decreases the response time of query processing while processing data [8].
Design Test Queries:
Ten queries were designed on the benchmark [10] of interactive queries on Text data, afterward, it transformed into ORC to test the same data stored in ORC file format which gave up to 78% optimization to the relational data storage.
CONCLUSION
After applying different query optimization techniques and executing set of queries on Apache Hive, this research concludes that:
By applying MapReduce Technique with OrcFile Formatted data, it would speed up performance improvement upto 20X to 28X with low latencies, which will make this combination of techniques most effective and well organized approach of optimizing interactive queries on this type of dataset.
On the other hand, the most effectual and proficient independent technique is also MapReduce Technique itself, which had successfully completed overall Hive jobs on average latency rate of 20X faster than Hive default execution engine.
Due to time constraint and aforementioned limitations, this research is purely conducted and focused on the CDH5.8 image of Cloudera ecosystem of for VMWare platform.
This research work can be carry forward in future by performing the same analysis on different big data tool for analytics such as MapR, Horton Works, Microsoft Azure and all those ecosystems which support HiveQL.