SlideShare a Scribd company logo
1 of 17
Query Optimization Techniques in
Hive: A Comparative Analysis
Presenter: Zara Tariq - 1573119
Supervisor: Mr. Rahim Hasnani
MSCS Batch 2015 | Independent Study
INTRODUCTION
Page # 2
 Today, organization of all sizes and types using the Apache Hive to process and
access their data from Hadoop in a familiar way, either using the broad tool set of
BI(business intelligence) or via directly on this proven interface. [1]
 Hive is being increasingly implemented and adopted in a vast range of organizations
[2], so there is a need of more efficient and flexible technique(with in Hive) to
optimize the queries performance.
 Optimizing the queries is directly related to infrastructure, size of data, organization
of data, storage formats and the data readers/ processors.
PROBLEM STATEMENT
Page # 3
Recently, there are many query optimization techniques have been introduced in HIVE but
still there is no clarity about the most effective technique, which can give most efficient
query processing time with least latency rate.
Objective: To determine which technique or a combination of techniques in Apache Hive is the most
efficient by means of low latency that takes low response time to run queries on large datasets among
the recent technical advancements in HIVE query processing engines [3].
Hypothesis: Cost-Based Optimization (CBO) techniques optimizes each job execution plan
physically and logically before submitting to final execution.
PURPOSE OF STUDY
Page # 4
The purpose of this research is to conduct a comparative analysis of some efficient and different technique
and to explore the most efficient technique among them in Apache Hive query execution engine.
Given below are the techniques which are selected for this study:
1. MapReduce
2. ORCFile (Optimized Row Columnar File)
3. Vectorization for query optimization model
4. Cost-Based Optimization
SIGNIFICANCE OF RESEARCH
There will be a practical comparison of some different and an efficient technique that is theoretically
introduced in recent advancements in Apache Hive [3] (2015) which can facilitate the data analyst to improve
the response time and decrease the job load overheads by applying them in daily lives for best practices.
LITERATURE REVIEW
Page # 5
MapReduce
 MapReduce simplifies the deployment of Massively Parralel Processing systems by providing two
user-friendly interfaces: map and reduce.
 Break down workload into chunks of a cluster
 No need to wait for any other job to be completed first
 Different key-value pairs are generated [4] [5]
Vectorization
 The dataset is represented in batches of rows and values of columns are represented as a column
vector.
 Complete set of both row and column batches are configurable to fit in processor’s cache (by
default it is in the set of 1024) [3].
 Vectorization works more efficiently [6], if we apply this technique to ORC formatted data tables
within Hive
LITERATURE REVIEW
Page # 6
Cost Based Optimization
It uses two types of statistics for producing an optimal query plan in Metastore of
the Hive for query optimization.
 Table stats
 Column stats
Optimized Row Columnar File
ORC is the efficient way of storing all sort of data formats supported within
Apache Hive.
 This technique improves the read, write, processing operations and
performance of complex jobs.
3`RESEARCH METHODOLOGY
Page # 7
This research has been divided into five
phases:
1. Determine an efficient and effective
Query Processing Techniques that have
been recently introduced in Apache
Hive [9].
2. Selection and loading of Dataset.
3. Design a set of 10 queries with different
complexity levels [10].
4. Test all these queries on the set of
selected query optimization techniques
of Hive.
5. Compare and summarize the results
and suggest which technique worked
best in most of the scenario (queries).
Figure 1: Research Methodology
EXPERIEMENT
Page # 8
Dataset
The dataset selected for the analysis are taken from DBPedia Released in 2014 Ontology which consist of
47 classes and 70+GB of data but only single class is being used in the research, named: Species
Design Test Queries
Ten queries were designed on the benchmark [10] of interactive queries on Text data
Data Format Size # of Tables # of Rows # of Columns
Original Data 1.79GB 26 3, 76, 859 1, 059
Table 1: Detail of the data set used in the Experiment
Original Data (Text file) RCF File ORCSerde
Data Size in GB 1.79 1.50 0.38
Table 2: Comparison table shows the change of size the of dataset used for analysis (in GB)
Figure 2: Details of the set of queries designed for analysis
Interactive
Queries
Simple
Queries
Q1,2,3
Single Join
Queries
Q4,5,8,9
Nested Join
Queries
Q6,7,10
Page # 9
EXPERIEMENT(cont.)
Page # 10
Phase I Results:
Set of 10 queries were tested on the four core techniques of query optimization in Hive
query execution engine.
Techniques Q1 Q2 Q3 Q4 Q5 Q8 Q9 Q6 Q7 Q10 Avg. RT
Hive 264.5 122.8 141.1 174.7 267.6 316.6 83.4 5825.8 1103.3 6433.3 1473.3
MapReduce 139.7 126.9 112.3 202.9 325.7 152.3 26.6 4899.1 1050.4 4808.4 1184.4
ORC File 64.7 26.0 0.9 313.3 339.4 68.7 30.3 4923.6 1420.1 6348.8 1353.6
Vectorize 217.9 188.9 125.8 231.5 347.5 199.1 151.8 4646.3 1664.2 6718.9 1449.2
CBO 203.3 145.6 135.5 296.2 240.3 152.2 43.9 5186.1 1323.7 5119.6 1284.6
Simple Queries Single Join Queries Nested Join Queries
Table 3: Response time of *Queries at Phase I
20%
+Avg. RT= Average Resposne Time
Figure 3: MapReduce responded most efficiently as compare to other combination techniques
1473.3
1353.6
1449.2
1284.6
0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0
Hive
MapReduce
ORCFile
Vectorize
CBO
1184.4
Page # 11
time in seconds
EXPERIEMENT(cont.)
Page # 12
Phase II Results:
Same set of queries were again tested on ORC File formatted dataset on three techniques
in a combination of MapReduce, CBO, and Vectorization.
Techniques Q1 Q2 Q3 Q4 Q5 Q8 Q9 Q6 Q7 Q10 Avg. RT
Hive 264.5 122.8 141.1 174.7 267.6 316.6 83.4 5826 1103 6433 1473
Vectorize+ORC File 179.4 5.748 1.068 402.8 241.4 54.12 35.48 4042 1333 6010 1230
CBO+ORC File 71.17 1.435 0.428 363 163 45.58 38.75 4605 1897 5990 1318
MR+ORC File 55.54 0.634 0.144 263.8 191.5 45.09 26.77 4075 1203 4779 1064
Simple Queries Single Join Queries Nested Join Queries
Table 4: Response time of *Queries at Phase II
28%
+Avg. RT= Average Resposne Time
a
Figure 4: MapReduce + ORC responsed most efficiently as compare to other combination techniques
1473.3
1230.5
1314.5
1064
1284.6
0 200 400 600 800 1000 1200 1400 1600
Page # 13
time in seconds
RESULT
Page # 14
Techniques Simple Single Join Nested Join
Avg.
Response
Time
Improved
Performance of Hive
Hive 176.1 210.6 4454.1 1613.6 -
MapReduce 126.3 176.9 3585.9 1296.4 20X
ORCFile 30.5 188.0 4230.8 1483.1 8X
Vectorize 177.5 232.5 4343.1 1584.4 2X
CBO 161.4 183.1 3876.5 1407.0 13X
Vectorize+ORCFile 62.1 183.5 3794.8 1346.8 17X
CBO+ORCFile 24.3 152.6 4164.0 1447.0 11X
MR+ORCCFile 18.8 131.8 3352.3 1167.6 28X
Table 5: Most efficient (in term of speed) technique in Hive for query processing
Figure 5: Most Efficient Query Processing Technique in Hive
Page # 15
28%
CONCLUSION
Page # 16
After applying different query optimization techniques and executing set of queries on Apache Hive, this
research concludes that:
 By applying MapReduce Technique with OrcFile Formatted data, it would speed up performance
improvement upto 20X to 28X with low latencies, which will make this combination of techniques most
effective and well organized approach of optimizing interactive queries on this type of dataset.
 On the other hand, the most effectual and proficient independent technique is also MapReduce
Technique itself, which had successfully completed overall Hive jobs on average latency rate of 20X faster
than Hive default execution engine.
FUTURE DIRECTION
This research work can be carry forward in future by performing the same analysis on different big data tool
for analytics such as MapR, Horton Works, Microsoft Azure and all those ecosystems which support HiveQL.
REFERENCES
[1] Chen-Ching Liu,.: John Wiley & Sons, 2016, 01-Aug-2016, pp. 436-440.
[2] Josh Rosen, Matei Zaharia Reynold S. Xin, ”Shark: SQL and Rich Analytics at Scale,” in SIGMOD ’13
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New Yor,
2013.
[3] Yin, et al Huai, ”Major Technical Advancements in Apache Hive,” in Proceedings of the 2014 ACM
SIGMOD international conference on Management of data ACM, 2014.
[4] Sai, et al. Wu, ”Query optimization for massively parallel data processing.,” in Proceedings of the 2nd
ACM Symposium on Cloud Computing. ACM, 2011.
[5] ToddLipcon. (2011, June) MapReduce. [Online]. Https://wiki.apache.org/hadoop/MapReduce/
Page # 17
[6] Dominique A. Heger, ”A Brief Introduction to Apache Tez,” in DHTech-nologies & Data Nubes.
[7] Andreas Franois Vermeulen, Ankur Gupta, David Kjerrumgaard By Scott Shaw, Practical Hive: A Guide to
Hadoop’s Data Warehouse System, 1st ed.: Apress, 2016, 2016.
[8] Ofer Mendelevitch. (2015, February) 5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER. [Online].
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
[9] Apache Hive TM. (2016, June) Index of /dist/hive/hive-2.1.0. [Online]. http://www-
us.apache.org/dist/hive/hive-2.1.0/
[10] Carter Shanklin. (2014, June) BENCHMARKING APACHE HIVE 13 FOR ENTERPRISE HADOOP. [Online].
http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/

More Related Content

What's hot

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Josh Elser
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 

What's hot (19)

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 

Similar to Query optimization techniques in Apache Hive

BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RIRJET Journal
 
Improving Query Processing Time of Olap Cube using Olap Operations
Improving Query Processing Time of Olap Cube using Olap OperationsImproving Query Processing Time of Olap Cube using Olap Operations
Improving Query Processing Time of Olap Cube using Olap OperationsIRJET Journal
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataEMC
 
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...Editor IJMTER
 
Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Computer Science Journals
 
Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule miningStudsPlanet.com
 
Mining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce FrameworkMining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce FrameworkIRJET Journal
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiersamreshkr19
 
Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule miningStudsPlanet.com
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...riyaniaes
 
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...paperpublications3
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 

Similar to Query optimization techniques in Apache Hive (20)

BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
50120140504006
5012014050400650120140504006
50120140504006
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
 
Improving Query Processing Time of Olap Cube using Olap Operations
Improving Query Processing Time of Olap Cube using Olap OperationsImproving Query Processing Time of Olap Cube using Olap Operations
Improving Query Processing Time of Olap Cube using Olap Operations
 
Journals analysis ppt
Journals analysis pptJournals analysis ppt
Journals analysis ppt
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big Data
 
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
 
Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.
 
Paper ijert
Paper ijertPaper ijert
Paper ijert
 
Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule mining
 
Mining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce FrameworkMining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce Framework
 
poster draft 5
poster draft 5poster draft 5
poster draft 5
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiers
 
Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule mining
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
 
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 

More from Zara Tariq

Implementation and comparison of Low pass filters in Frequency domain
Implementation and comparison of Low pass filters in Frequency domainImplementation and comparison of Low pass filters in Frequency domain
Implementation and comparison of Low pass filters in Frequency domainZara Tariq
 
Design and evaluation of an io controller for data protection
Design and evaluation of an io controller for data protectionDesign and evaluation of an io controller for data protection
Design and evaluation of an io controller for data protectionZara Tariq
 
Stochastic kronecker graphs
Stochastic kronecker graphsStochastic kronecker graphs
Stochastic kronecker graphsZara Tariq
 
Pull Vs. Push Production
Pull Vs. Push ProductionPull Vs. Push Production
Pull Vs. Push ProductionZara Tariq
 
Crime Record Management System (CRMS)
Crime Record Management System (CRMS)Crime Record Management System (CRMS)
Crime Record Management System (CRMS)Zara Tariq
 
Crime Record Management System (CRMS)
Crime Record Management System (CRMS)Crime Record Management System (CRMS)
Crime Record Management System (CRMS)Zara Tariq
 
Toys Vending Machine
Toys Vending MachineToys Vending Machine
Toys Vending MachineZara Tariq
 
An Integrated Cloud Computing Architectural Stack
An Integrated Cloud Computing Architectural Stack An Integrated Cloud Computing Architectural Stack
An Integrated Cloud Computing Architectural Stack Zara Tariq
 
Face Detection and Recognition System
Face Detection and Recognition SystemFace Detection and Recognition System
Face Detection and Recognition SystemZara Tariq
 

More from Zara Tariq (12)

Implementation and comparison of Low pass filters in Frequency domain
Implementation and comparison of Low pass filters in Frequency domainImplementation and comparison of Low pass filters in Frequency domain
Implementation and comparison of Low pass filters in Frequency domain
 
Design and evaluation of an io controller for data protection
Design and evaluation of an io controller for data protectionDesign and evaluation of an io controller for data protection
Design and evaluation of an io controller for data protection
 
Stochastic kronecker graphs
Stochastic kronecker graphsStochastic kronecker graphs
Stochastic kronecker graphs
 
Pull Vs. Push Production
Pull Vs. Push ProductionPull Vs. Push Production
Pull Vs. Push Production
 
Crime Record Management System (CRMS)
Crime Record Management System (CRMS)Crime Record Management System (CRMS)
Crime Record Management System (CRMS)
 
Crime Record Management System (CRMS)
Crime Record Management System (CRMS)Crime Record Management System (CRMS)
Crime Record Management System (CRMS)
 
Interrupts
Interrupts Interrupts
Interrupts
 
INTERRUPTS
INTERRUPTS INTERRUPTS
INTERRUPTS
 
Toys Vending Machine
Toys Vending MachineToys Vending Machine
Toys Vending Machine
 
An Integrated Cloud Computing Architectural Stack
An Integrated Cloud Computing Architectural Stack An Integrated Cloud Computing Architectural Stack
An Integrated Cloud Computing Architectural Stack
 
Face Detection and Recognition System
Face Detection and Recognition SystemFace Detection and Recognition System
Face Detection and Recognition System
 
JSON
JSONJSON
JSON
 

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

Query optimization techniques in Apache Hive

  • 1. Query Optimization Techniques in Hive: A Comparative Analysis Presenter: Zara Tariq - 1573119 Supervisor: Mr. Rahim Hasnani MSCS Batch 2015 | Independent Study
  • 2. INTRODUCTION Page # 2  Today, organization of all sizes and types using the Apache Hive to process and access their data from Hadoop in a familiar way, either using the broad tool set of BI(business intelligence) or via directly on this proven interface. [1]  Hive is being increasingly implemented and adopted in a vast range of organizations [2], so there is a need of more efficient and flexible technique(with in Hive) to optimize the queries performance.  Optimizing the queries is directly related to infrastructure, size of data, organization of data, storage formats and the data readers/ processors.
  • 3. PROBLEM STATEMENT Page # 3 Recently, there are many query optimization techniques have been introduced in HIVE but still there is no clarity about the most effective technique, which can give most efficient query processing time with least latency rate. Objective: To determine which technique or a combination of techniques in Apache Hive is the most efficient by means of low latency that takes low response time to run queries on large datasets among the recent technical advancements in HIVE query processing engines [3]. Hypothesis: Cost-Based Optimization (CBO) techniques optimizes each job execution plan physically and logically before submitting to final execution.
  • 4. PURPOSE OF STUDY Page # 4 The purpose of this research is to conduct a comparative analysis of some efficient and different technique and to explore the most efficient technique among them in Apache Hive query execution engine. Given below are the techniques which are selected for this study: 1. MapReduce 2. ORCFile (Optimized Row Columnar File) 3. Vectorization for query optimization model 4. Cost-Based Optimization SIGNIFICANCE OF RESEARCH There will be a practical comparison of some different and an efficient technique that is theoretically introduced in recent advancements in Apache Hive [3] (2015) which can facilitate the data analyst to improve the response time and decrease the job load overheads by applying them in daily lives for best practices.
  • 5. LITERATURE REVIEW Page # 5 MapReduce  MapReduce simplifies the deployment of Massively Parralel Processing systems by providing two user-friendly interfaces: map and reduce.  Break down workload into chunks of a cluster  No need to wait for any other job to be completed first  Different key-value pairs are generated [4] [5] Vectorization  The dataset is represented in batches of rows and values of columns are represented as a column vector.  Complete set of both row and column batches are configurable to fit in processor’s cache (by default it is in the set of 1024) [3].  Vectorization works more efficiently [6], if we apply this technique to ORC formatted data tables within Hive
  • 6. LITERATURE REVIEW Page # 6 Cost Based Optimization It uses two types of statistics for producing an optimal query plan in Metastore of the Hive for query optimization.  Table stats  Column stats Optimized Row Columnar File ORC is the efficient way of storing all sort of data formats supported within Apache Hive.  This technique improves the read, write, processing operations and performance of complex jobs.
  • 7. 3`RESEARCH METHODOLOGY Page # 7 This research has been divided into five phases: 1. Determine an efficient and effective Query Processing Techniques that have been recently introduced in Apache Hive [9]. 2. Selection and loading of Dataset. 3. Design a set of 10 queries with different complexity levels [10]. 4. Test all these queries on the set of selected query optimization techniques of Hive. 5. Compare and summarize the results and suggest which technique worked best in most of the scenario (queries). Figure 1: Research Methodology
  • 8. EXPERIEMENT Page # 8 Dataset The dataset selected for the analysis are taken from DBPedia Released in 2014 Ontology which consist of 47 classes and 70+GB of data but only single class is being used in the research, named: Species Design Test Queries Ten queries were designed on the benchmark [10] of interactive queries on Text data Data Format Size # of Tables # of Rows # of Columns Original Data 1.79GB 26 3, 76, 859 1, 059 Table 1: Detail of the data set used in the Experiment Original Data (Text file) RCF File ORCSerde Data Size in GB 1.79 1.50 0.38 Table 2: Comparison table shows the change of size the of dataset used for analysis (in GB)
  • 9. Figure 2: Details of the set of queries designed for analysis Interactive Queries Simple Queries Q1,2,3 Single Join Queries Q4,5,8,9 Nested Join Queries Q6,7,10 Page # 9
  • 10. EXPERIEMENT(cont.) Page # 10 Phase I Results: Set of 10 queries were tested on the four core techniques of query optimization in Hive query execution engine. Techniques Q1 Q2 Q3 Q4 Q5 Q8 Q9 Q6 Q7 Q10 Avg. RT Hive 264.5 122.8 141.1 174.7 267.6 316.6 83.4 5825.8 1103.3 6433.3 1473.3 MapReduce 139.7 126.9 112.3 202.9 325.7 152.3 26.6 4899.1 1050.4 4808.4 1184.4 ORC File 64.7 26.0 0.9 313.3 339.4 68.7 30.3 4923.6 1420.1 6348.8 1353.6 Vectorize 217.9 188.9 125.8 231.5 347.5 199.1 151.8 4646.3 1664.2 6718.9 1449.2 CBO 203.3 145.6 135.5 296.2 240.3 152.2 43.9 5186.1 1323.7 5119.6 1284.6 Simple Queries Single Join Queries Nested Join Queries Table 3: Response time of *Queries at Phase I 20% +Avg. RT= Average Resposne Time
  • 11. Figure 3: MapReduce responded most efficiently as compare to other combination techniques 1473.3 1353.6 1449.2 1284.6 0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0 Hive MapReduce ORCFile Vectorize CBO 1184.4 Page # 11 time in seconds
  • 12. EXPERIEMENT(cont.) Page # 12 Phase II Results: Same set of queries were again tested on ORC File formatted dataset on three techniques in a combination of MapReduce, CBO, and Vectorization. Techniques Q1 Q2 Q3 Q4 Q5 Q8 Q9 Q6 Q7 Q10 Avg. RT Hive 264.5 122.8 141.1 174.7 267.6 316.6 83.4 5826 1103 6433 1473 Vectorize+ORC File 179.4 5.748 1.068 402.8 241.4 54.12 35.48 4042 1333 6010 1230 CBO+ORC File 71.17 1.435 0.428 363 163 45.58 38.75 4605 1897 5990 1318 MR+ORC File 55.54 0.634 0.144 263.8 191.5 45.09 26.77 4075 1203 4779 1064 Simple Queries Single Join Queries Nested Join Queries Table 4: Response time of *Queries at Phase II 28% +Avg. RT= Average Resposne Time
  • 13. a Figure 4: MapReduce + ORC responsed most efficiently as compare to other combination techniques 1473.3 1230.5 1314.5 1064 1284.6 0 200 400 600 800 1000 1200 1400 1600 Page # 13 time in seconds
  • 14. RESULT Page # 14 Techniques Simple Single Join Nested Join Avg. Response Time Improved Performance of Hive Hive 176.1 210.6 4454.1 1613.6 - MapReduce 126.3 176.9 3585.9 1296.4 20X ORCFile 30.5 188.0 4230.8 1483.1 8X Vectorize 177.5 232.5 4343.1 1584.4 2X CBO 161.4 183.1 3876.5 1407.0 13X Vectorize+ORCFile 62.1 183.5 3794.8 1346.8 17X CBO+ORCFile 24.3 152.6 4164.0 1447.0 11X MR+ORCCFile 18.8 131.8 3352.3 1167.6 28X Table 5: Most efficient (in term of speed) technique in Hive for query processing
  • 15. Figure 5: Most Efficient Query Processing Technique in Hive Page # 15 28%
  • 16. CONCLUSION Page # 16 After applying different query optimization techniques and executing set of queries on Apache Hive, this research concludes that:  By applying MapReduce Technique with OrcFile Formatted data, it would speed up performance improvement upto 20X to 28X with low latencies, which will make this combination of techniques most effective and well organized approach of optimizing interactive queries on this type of dataset.  On the other hand, the most effectual and proficient independent technique is also MapReduce Technique itself, which had successfully completed overall Hive jobs on average latency rate of 20X faster than Hive default execution engine. FUTURE DIRECTION This research work can be carry forward in future by performing the same analysis on different big data tool for analytics such as MapR, Horton Works, Microsoft Azure and all those ecosystems which support HiveQL.
  • 17. REFERENCES [1] Chen-Ching Liu,.: John Wiley & Sons, 2016, 01-Aug-2016, pp. 436-440. [2] Josh Rosen, Matei Zaharia Reynold S. Xin, ”Shark: SQL and Rich Analytics at Scale,” in SIGMOD ’13 Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New Yor, 2013. [3] Yin, et al Huai, ”Major Technical Advancements in Apache Hive,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data ACM, 2014. [4] Sai, et al. Wu, ”Query optimization for massively parallel data processing.,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011. [5] ToddLipcon. (2011, June) MapReduce. [Online]. Https://wiki.apache.org/hadoop/MapReduce/ Page # 17 [6] Dominique A. Heger, ”A Brief Introduction to Apache Tez,” in DHTech-nologies & Data Nubes. [7] Andreas Franois Vermeulen, Ankur Gupta, David Kjerrumgaard By Scott Shaw, Practical Hive: A Guide to Hadoop’s Data Warehouse System, 1st ed.: Apress, 2016, 2016. [8] Ofer Mendelevitch. (2015, February) 5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER. [Online]. http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ [9] Apache Hive TM. (2016, June) Index of /dist/hive/hive-2.1.0. [Online]. http://www- us.apache.org/dist/hive/hive-2.1.0/ [10] Carter Shanklin. (2014, June) BENCHMARKING APACHE HIVE 13 FOR ENTERPRISE HADOOP. [Online]. http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/

Editor's Notes

  1. Hive query execution engine constantly strives to satisfy its users by fulfilling their requirements of processing progressively huge traffic of data in an efficient and in scalable way [18]. This opens up thousands of possibilities in terms query optimization but this research is focusing on some of the techniques in Hive.
  2. LIMITATION of my Research This research is conducted on Apache Hadoop on Cloudera QuickStart VM (single-node-cluster) Specifications of the Machine: CentOS 6.7 Cloudera Enterprise 6.2.0 Hadoop 2.6.0 CDH 5.8.0 4 GB RAM VM Workstation 12 Player 64-bit Windows OS. Hive 1.1.0-cdh5.8.0
  3. MapReduce technique is used on Hive engine for managing a gigantic amount of workload to reduce processing time by distributing complex query tasks simultaneously on multiple nodes [6]. Recently introduced file format that is ORCFile (Optimized Row Columnar File). This file format works with recording data in a columnar storage organization which helps in providing data access and high storage efficiency with low overheads.  Vectorization for query optimization model is introduced in the recent advancements of Hive and can be used effectively for efficient query execution, response time and decreases the runtime more dramatically to improve efficiency rate by utilizing modern CPUs in a better way. SIGNIFICANCE OF RESEARCH It will beneficial for the data analysts who are currently using Apache Hive in their organizations for various data analytics and processing. Also, there will be a practical comparison of some different and an efficient technique which can facilitate the data analyst to improve the response time and decrease the job load overheads by applying them in daily lives for best practices.
  4. MapReduce MR has been widely recognized as an efficient tool for large-scale data analysis. MapReduce simplies the deployment of Massively Parralel Processing systems by providing two user-friendly interfaces: map and reduce. Break down workload into chunks of a cluster No need to wait for any other job to be completed first or any dependency of data occurred in this algorithm. Different key-value pairs are generated and can be saved on the distributed filesystem where then the reducers merge the results at the end Use of Vectorization Its an execution model for query optimization, The dataset is represented in batches of rows and values of columns are represented as a column vector. Complete set of both row and column batches are configurable to fit in processor’s cache (by default it is in the set of 1024) [3]. This technique lowers the jobs latency by fetching 1024 rows at a time which optimizes the container’s usage. Vectorization works more efficiently [6], if we apply this technique to ORC formatted data tables within Hive by reducing the CPU time at the improved utilization of clusters. 
  5. CBO: This feature is taken from the RDBMS systems. It is most useful for two types of queries that are queries with multiple joins and queries with more than one step of operations Table stats: Contains the size of a table without compressing the data. Column stats: Contain # of distinct values of data to optimize the cost for querying on big data. ORC: Highly efficient way to store table data in Hive. It decreases the response time of query processing while processing data ---Decreases the response time of query processing while processing data [8].
  6. Design Test Queries: Ten queries were designed on the benchmark [10] of interactive queries on Text data, afterward, it transformed into ORC to test the same data stored in ORC file format which gave up to 78% optimization to the relational data storage.
  7. CONCLUSION After applying different query optimization techniques and executing set of queries on Apache Hive, this research concludes that: By applying MapReduce Technique with OrcFile Formatted data, it would speed up performance improvement upto 20X to 28X with low latencies, which will make this combination of techniques most effective and well organized approach of optimizing interactive queries on this type of dataset. On the other hand, the most effectual and proficient independent technique is also MapReduce Technique itself, which had successfully completed overall Hive jobs on average latency rate of 20X faster than Hive default execution engine. Due to time constraint and aforementioned limitations, this research is purely conducted and focused on the CDH5.8 image of Cloudera ecosystem of for VMWare platform. This research work can be carry forward in future by performing the same analysis on different big data tool for analytics such as MapR, Horton Works, Microsoft Azure and all those ecosystems which support HiveQL.