SlideShare a Scribd company logo
1 of 8
*Author for correspondence
Indian Journal of Science and Technology, Vol 10(18), DOI: 10.17485/ijst/2017/v10i18/111370, May 2017
ISSN (Print) : 0974-6846
ISSN (Online) : 0974-5645
Optimization of Multiple Correlated Queries by
Detecting Similar Data Source with Hadoop/Hive
Puneet Kansal*, Safdar Tanweer and Kavita Bhatnagar
Jamia Hamdard (Hamdard University), Hamdard Nagar, New Delhi – 110 062, India; Puneetkansal.iaf@gmail.com,
safdartanweer@yahoo.com, bhatnagarkavita@yahoo.in
Keywords: DBGEN, Hive Query Language (HiveQL), Online Analytical Processing (OLAP), Transaction Processing
Performance Council (TPC), TPC-H Queries
Abstract
Objectives: Generated new single Hive Query (HiveQL) by finding the similar type of operation and common data source
from two or more input query and compare the total execution time of both queries. Methods/Statistical Analysis: Map
Reduce concept of Hadoop Hive is utilized in this paper, a new single query is generated from two or more input queries
and 3 sample of data generated with size of 2, 5 and 10 GB using free database generation tool DBGEN. TPC-H queries
are executed on this data and total execution time of both the queries is compared to see the performance. Findings:
As Hive executes single query at a time, and in this research, multiple queries are provided to hive by converting them
into single query. This approach results in reduction of operation while executing the query, which further reduce the
execution time and improve the performance of Hive. Hive process the structured data of data warehouse system, so by
using this approach, the structured data can be process and analyzed in easily and convenient manner. Structured data is
used for processing OLAP (Online Analytical Processing) queries so Hive also helps to process OLAP queries. Hive works
in conjunction with Hadoop and it process or execute query on data which is stored on Hadoop. So firstly, Hadoop should
be running on the system to use Hive query. This research requires huge amount of data for testing, for this sample data is
generated using free data generation tool provided by TPC (Transaction Performance Council), DBGEN. TPC also provide
the different types of queries for testing the performance query execution tool, so in this research TPC-H queries are
utilized. Application/Improvements: By using the concept which is shown in this research, the total execution time of
Hive queries can be reduced drastically and performance of Hive can be increased.
1. Introduction
Now a days everybody is generating or collecting data
from various sources in different forms like pdf, mp3,
movie, document files, spreadsheet file, images etc. and
storing it at unprecedented rates. Due to this huge amount
of data, the traditional database management system or
currently available processing tools are unable to process
it or takes a very long time to produce meaningful out-
put. According to Gartner the Big Data term comprise of
3 V’s that is Volume, Velocity and Variety1
. Additionally,
there are more V’s in the concept Visualization, Veracity
(Reliability), Variability and Value2
. This huge amount of
data or big data can be collected, stored, processed and
analyzed using various approaches and tools, which are
available in the market. Hadoop is an open source soft-
ware which is based on java programming framework.
It uses Map-reduce algorithm for processing of large
data sets. Initially, Hadoop was developed by Yahoo in
the year of 2006. It has the capability to process huge
amount of data sets in the range of Petabyte or Zettabyte
or more, which is stored on the cluster of commodity
hardware (computers), connected to each other3
. The
large data sets, which requires distributed processing, is
processed by the Hadoop software library framework4
. It
is a highly scalable storage platform, which is designed
to process very large data sets across thousands of com-
puting nodes that operate in parallel. It provides a cost
Indian Journal of Science and TechnologyVol 10 (18) | May 2017 | www.indjst.org2
Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive
effective storage solution for large data volumes with
no formal requirements and uses MapReduce process-
ing technique. The MapReduce algorithm performs two
important steps for processing of large data sets viz. Map
and Reduce. In the first step (Map), it splits the large data
set into smaller parts and convert into (key, value) pair,
in the second step, the output of first steps became the
input of this step (Reduce) and performs reduce opera-
tion5-7
. These systems are extensively used for analytics
and data warehousing, either directly or through the use
of a high level query language that is compiled down to a
parallel dataflow graph for execution3
. Data warehousing
is a centralized repository of integrated data from vari-
ous sources. The data which is stored in data warehousing
is historical in nature and further used to create analyti-
cal reports for the enterprise. Data warehousing is used
to process OLAP queries and accompanied with batch
oriented query workload that is relatively static. Data
warehousing applications are non-interactive in nature,
which provides freedom to reorder and optimize que-
ries to improve overall performance. Map-Reduce task
on Hadoop uses three functions viz. mapping, shuffling
and reducing. Input Data format which Hadoop accept
is in the form of (key, value) pair, which is provided by
Hadoop Distributed File System (HDFS). As Hadoop uses
MapReduce framework, so the input is assigned to map-
per in the form of (key, value) pair. Mapper then process
the input and produce a set of (key, value) pair as output.
The mappers output (key, value) become the input for next
step, which is reduce. Before reduce operation the set of
(key, value) pair goes through the shuffle phase. In shuffle
phase the same k1 pair is assigned to the k1 reducer, same
way k2 pair is assigned to k2 reducer and so on as shown
in Figure 1 with Finally reducer does aggregation opera-
tion by creating the pair of similar k1 value, k2 values etc.,
and produce output. Hadoop methodology shows that, it
does not perform any optimization or global analysis to
share work between similar jobs because, Map-Reduce
operations are expressed as user-defined functions.
Optimization in Hadoop can be achieved by using Hive
or Pig as declarative query language8
. These both per-
form OLAP operation. The difference Hive and Pig is that
Pig supports structured as well as unstructured data and
Hive can process only structured data. Structured data
is generally used in data warehouse. Pig cannot be used
in handling data warehouse because it can handle both
structured and unstructured data which is most suitable
for streaming data. Data warehouses are not updated fre-
quently and fast response time in processing of queries is
not required9
. In this paper, The Hive framework is cho-
sen for experiment due to the following reason
•	 Hive process the structured data which is the
base of a data warehouse system.
•	 The applications which are available in a data
warehouse system developed using SQL based
framework like RDBMS. Which is also sup-
ported by the Hive
So based on above reason, it is very easy for user to
migrate data warehouse application into Hadoop appli-
cation. Hive execute each query independently means it
support the “query-at-a-time” model. Due to which it is
possible to optimize the execution of query by applying
the optimization technique10-11
. Two or more queries can
be converted into single query by detecting the correla-
tion and similar data source, and execution of this single
query will drastically reduce the total execution time of
query.
Figure 1. MapReduce tasks.
1.1 HIVE
Hive12,13
, is an open source SQL like interface, which
works on distributed data warehouse system. Hive resides
on top of the Hadoop framework and provides execu-
tion of schema based structured data queries are shown
in Figure 2. Hive cannot run independently, it requires
Hadoop to be run first. Hive stores data in the form of
Metadata (table and partitions) which is stored in data-
base like Derby, MySQL Oracle etc. It facilitates reading,
writing, and managing large datasets residing in distrib-
uted storage using SQL. It provides data summarization,
query, and analysis. It provides a command line tool
and JDBC driver to connect users to Hive console Refer
Figure 2. Facebook having a huge amount of data in
Exabyte, Petabyte on its server, for managing such huge
Indian Journal of Science and Technology 3Vol 10 (18) | May 2017 | www.indjst.org
Puneet Kansal, Safdar Tanweer and Kavita Bhatnagar
amount of data Facebook developed Hive, but after some
time Apache Software Foundation adopted it. After adop-
tion Apache made the Hive as open source and declared
it as Apache Hive. As Hive is an open source so many
other companies also developing the Hive like Netflix,
Amazon etc. Hive convert SQL query into MapReduce
and uses HiveQL for querying the data, which is stored
in Hadoop cluster. In terms of storage Hive can use any
file system supported by Hadoop, although HDFS is by
far the most common. Hive provides its own query lan-
guage HiveQL, which is similar to SQL, for querying data
on a Hadoop cluster. It can manage data in HDFS and
run jobs in MapReduce without translating the queries
into Java. While executing MapReduce jobs, Hive doesn’t
generate Java MapReduce programs, whereas, it uses
built-in, generic Mapper and Reducer modules that are
driven by an XML file representing the “job plan”. Hive
Queries are translated to a graph of Hadoop MapReduce
jobs that get executed on the Hadoop grid. Hive Query
Language (HQL) is based on SQL, and there are many
of the familiar constructs similar to SQL. The concept of
storing Database, Tables and data is similar to a RDBMS
in HIVE. As like SQL, Hive stores “tables” in a container
called “Database”. The data stored in a table according to
the “schema” defined by the user. The schema is defined in
the form of row and column. Hive is not suitable for exe-
cuting real time queries because it process OLAP (Online
Analysing Processing) queries.
1.1.1 HiveQL
The Query language for execution of queries in Hive is
known as HiveQL, is used to organize and query the data
stored in Hive. The current version of HiveQL provides
the feature to CREATE and DROP tables and partitions.
In HIVE, a table split into multiple parts on a specified
partition key, as well as query them with SELECT state-
ments. HiveQL does not support UPDATE and DELETE
Figure 2. Hive external architecture.
Indian Journal of Science and TechnologyVol 10 (18) | May 2017 | www.indjst.org4
Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive
functionalities yet now. The following are the most impor-
tant functionalities that are supported using the SELECT
statements in HiveQL are
•	 There is the possibility to join tables on a com-
mon key,
•	 Data can be filter using row selection techniques
•	 And columns can be projected
The above mentioned functionalities are similar to
functionality which is provided to the user in a RDBMS
(relational database management system). The following
example shows that how a SELECT statement looks in
HiveQL:
SELECT o1_orderkey, o1_custkey, c1_custkey
FROM customer c1 JOIN
Orders o1 ON c1.c1_custkey = o1.o1_custkey JOIN
Line item l1 ON o1.o1_orderkey = l1.l1_orderkey;
In this example, a simple join between the table cus-
tomer, order and line item is initiated on their respective
join keys, which are custkey and order key. The above
query also shows the projections in first part of the state-
ment which is pushed down to the table scan level by the
framework. Multiple HiveQL queries can also be execute
during one single operation in Hive. This is knowns as
parallelizing the execution in Hive. The above example
also shows that HiveQL statements with multiple tables
require specific columns on which those tables are joined.
Hive does not support cross products, which is com-
monly supported in database management systems such
as DB2 or Oracle.
1.1.2 Hive Architecture
Various components of Hive’s internal architecture
are depicted in the Figure 3 and their introduction in
appended blow:
User Interface: This is a first layer of Hive data ware-
house infrastructure. The purpose of user interface for
users to submit queries and other operations to the sys-
tem. User interface provide the facility to user to submit
queries and other operations to the system. Hive provides
different mechanism for user interaction Viz.
•	 Web UI
•	 Hive Command Line and
•	 HD insight
Metastore: It is an object store, which have a data-
base. The purpose of this database is to store schema or
Metadata for tables. It also store information about data-
bases, column, and their type and HDFS mapping13
. There
are two ways by which Metastore can be configured-:
•	 Remote
•	 Embedded
HiveQL Process Engine: HiveQL is a query language,
which is similar to SQL. The syntax in HiveQL is very
similar to SQL for creating, editing, deletion and query on
table. Customized MapReduce script can also be embed-
ded by the user in HiveQL.
Execution Engine: It works in conjunction with
HiveQL process Engine and MapReduce. The purpose of
execution engine is to execute the plan created by com-
piler which is in the form of DAG stages. It provides
connectivity between Hive and Hadoop for execution of
query. There is a similarity in the result which is produced
by MapReduce and execution engine.
Figure 3. Hive internal architecture.
2. TPC (Transaction Processing
Performance Council)
It is a nonprofit organization, which was founded in 1988.
It provides different types of benchmark like TPC-H,
TPC-C, TPC-E etc. are shown in Figure 4 to define the
performance of transaction processing. The performance
can be measured on the basis speed and accuracy of
result. The speed is measured in terms of number of que-
ries executed per second or minute and accuracy results
are expressed in percentage.
2.1 TPC-H14
It consists of various business oriented queries and data
modification is also permitted the user. Data can be pop-
ulated by user from 2 GB to TBs. The more data gives
the good performance result. Minimum 100 GB of data
Indian Journal of Science and Technology 5Vol 10 (18) | May 2017 | www.indjst.org
Puneet Kansal, Safdar Tanweer and Kavita Bhatnagar
is considered a good database for running TPC queries.
In this paper, TPC-H queries are merged based on their
correlation and similar data source and, executed on dif-
ferent data size. The execution time of queries on different
database size is compared to check the performance of
Hive query execution. TPC-H support the performance
metric, which is called TPC-H Composite Query-per-
Hour Performance Metric (QphH@size). The TPC-H
Price/Performance metric is expressed as $/QphH@size.
Figure 4. Different benchmark of TPC.
3. DBGEN
DBGEN is used with TPC-H to generate the database15
.
It can generate database from Gbs to TBs. In data genera-
tion it takes from minutes to days or weeks. The more data
generated using this takes more time. Generally minimum
100 GB of data is considered good for execution of TPC-h
queries. This tool is written in ANSI ‘C’ for portability and
provides support to various systems. This tool also provides
the source code, which makes the process of building a
compliant database population as simple as possible.
For multiple query optimization, different optimi-
zation techniques has been experimentally shown16
, in
this author proposed “the concept of SharedHive for
improving the performance of execution of queries”16
.
It is also experimentally shown in this paper that using
the “SharedHive concept the total execution time is
significantly reduced for query execution”. In17
user exper-
imentally shown that the execution time of query can be
drastically reduced if the correlated query can be detected
before execution. So, in this manuscript, it is experimen-
tally shown, that, if the correlated query that has similar
data source can be combined separately, and new query
is created then executed. So, the execution will be drasti-
cally reduced. This time will depend on the number of
correlated queries, who has similar data source, more
number of queries, more reduction in time, less number
of queries, less reduction in time.
4. Proposed Model/Solution
Here, brief information is provided about architecture
of proposed Hive (Correlated Query Optimizer Hive)
which is the modified version of Hadoop Hive with new
Multiple Query Optimization (MQO) component shown
in Figure 5.
Figure 5. Proposed hive architecture.
All queries which are entered into the Hive either
throughCommandLineInterfaceorWebInterface,passed
Indian Journal of Science and TechnologyVol 10 (18) | May 2017 | www.indjst.org6
Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive
through the layer called “Multiple Query Optimizer”. The
work of this layer is to examine input queries for similar
operation and same data source, if found then it generate
a single HiveQL query. If it does not find any similarity or
correlation in the input query with other query, the query
is appended at the last. The generated correlated query
is passed to next layer “compiler-optimizer-executer” as
input. The Metastore stores the relations, attributes and
partitions of RDBMSs. The execution control of HiveQL
queries is maintained by Driver component. All input
queries will passed through the proposed layer “Multiple
Query Optimizer”, the work of this layer is to check com-
plete input queries for similar type of operation and same
data source. If this layer find such relation in the input
queries then create a new single query and if it does not
find any similarity then do not create the query. This layer
scans all input queries one by one and keeps on append-
ing the similar operation in a new single query till it finds.
When proposed layer finishes the job and generates sin-
gle query then this new single query will go through next
layer called “Driver”. The Driver layer consist of three sub
parts viz Compiler, Optimizer and executer. The purpose
of driver layer is to do session management; the input
query sent to the compiler where compiler receives meta-
data from MetaStore and execution plan is created finally
this execution plan is executed by execution engine. In
a data warehouse, tables and partitions have structure
information like row, column details, column types, rela-
tion between column and attributes, these all information
are stored in MetaStore.
5. Processing of the Query in the
Proposed System
When a Hive query is submitted to the system using either
through Command Line Interface or web user interface
or thrift interface, it is scanned by proposed system for
any similar operation from same data source. After scan-
ning new single query is generated based on the similarity
of the operation from same data source. This merged
query is submitted to Driver component for further
execution. In Driver layer this query pass through vari-
ous stage like compiler, optimizer and executer. In Driver
layer, the execution plan of query and structure informa-
tion is created. This execution plan is further executed by
execution engine. Compiler creates the logical execution
plan of the query and Directed acyclic graph is produced
by HiveQL. The DAG defines the MapReduce task which
is to be executed by execution engine. In this system a
new layer called “Multiple Query Optimizer” is proposed
and integration of this layer into existing system does not
require any major changes. So the proposed system is easy
to use and integrate.
Table 1 shows the execution time of TPC-H queries
using Multiple Query Optimization algorithm. The pro-
posed algorithm is performed on 2 GB, 5 GB and 10 GB
of data, execution time on different size clearly shows that
after merging the query Q1 and Q6, execution time is
drastically down.
Table 1. Q1 and Q6 execution results with and without
MQO
DATA 2 GB 5 GB 10 GB
Q1 31.177 30.733 306.184
Q6 3.934 3.857 61.164
Q1+Q6 (With
MQO)
12.1 24.073 45.671
6. Methodology
Before execution of query, the proposed Hive architecture,
scan the input queries for similar data sources, similar
tasks and create a single query which is in the form of
HiveQL. The number of operations in a newly generated
single HiveQL is comparatively less than the two or more
input queries
Input: Group of input queries: Q= {q1,...,qn}.
Output: Generated new single HiveQL query Q
={q1,...,qm},
where m ≤ n and mds
= nds
.
The proposed system reduces the number of opera-
tion in query execution therefore reduce the time of total
execution. The reduction in execution time is achieved
by generating new single query from two or more input
query. If‘t’ is the execution time and m, n is the number
of operation then execution time difference between new
query and old can be shown like following:
∑∑ ==
<
n
i
i
m
i
i tt
11
1
)()(
The table structure consisting of tuples and columns
for the new single query and input query will be same. So
Indian Journal of Science and Technology 7Vol 10 (18) | May 2017 | www.indjst.org
Puneet Kansal, Safdar Tanweer and Kavita Bhatnagar
that the output of new single query will not be differ from
the output of input query. The main difference between
the output of old query and new query would be in num-
ber of tables. The input query will generate table for each
query but the new merged single query will generate
only single table for the output. There is also a difference
between number of operation in the input query and new
single query. The number of operation will be more in the
input query and will be less in new single query. Because
similar operation has been merged in the new query. As
number of operation is less in the execution of single
query so there will also be less number of MapReduce
operations thus increase the performance of Hive query
execution.
Resulting query after searching the similar operation,
same data source from input queries (Q1 and Q6)
DROP TABLE q1_price_smmry_rpt;
DROP TABLE q6_rvnu_chng;
CREATE TABLE q1_ price_smmry_rpt
(	 l1_returnflag string,
	 L1_linestatus string,
	 sum_qty_psr int,
	 sum_base_price_psr double,
	 sum_disc_price_psr double,
	 sum_charge_psr double,
	 avg_qty_psr int,
	 avg_price_psr double,
	 avg_disc_psr double,
	 count_order_psr int);
CREATE TABLE q6_ rvnu_chng (REVENUE dou-
ble);
FROM lineitem
INSERT OVERWRITE TABLE q1_ price_smmry_rpt
select l1_returnflag,
	l1_linestatus,
	sum(l1_quantity),
	sum(l1_extendedprice_psr),
	 sum(l1_extendedprice_psr * (1 – l1_discount)),
	 sum(l1_extendedprice * (1 – l1_discount) * (1 +
l1_tax)),
	avg(l1_quantity),
	avg(l1_extendedprice),
	avg(l1_discount),
	count(*)
where
	 l1_shipdate <= ‘1998-09-16’
group by
	l1_returnflag,
	l1_linestatus
order by
	l1_returnflag,
	l1_linestatus
INSERT OVERWRITE TABLE q6_ rvnu_chng
select
	 sum(l1_extendedprice * l1_discount) as revenue
where 	
	 l1_shipdate >= ‘1993-02-02’
	 and l1_shipdate < ‘1994-01-02’
	 and l1_discount between 0.06 - 0.01 and 0.06 +
0.01
	 and l1_quantity < 26;
The underlying SQL-to-MapReduce translator of Hive
uses one operation to one job model18
and opens a new job
for each operation (table scan, join, group by, etc.) in a
SQL statement. Significant performance increases can be
obtained by reducing the number of MapReduce tasks of
these jobs.
7. Summary and Conclusion
In this paper, it is experimentally shown that how the per-
formance of conventional Hive can be enhanced. As hive
does not support the execution of multiple queries, so if
we can find those query from given query set who have
similar data source and correlation operation, and cre-
ate a single query based on these parameters. Then these
all queries are executed in Hive and the execution time is
compared (Graph 1). It is clear from the graph that the
query who has similar data source takes very less time
in execution compare to individual query. For detection
of correlated and similar data source queries, new layer
“Multiple Query Optimization” introduced. The job of
this layer is to scan the existing query and generate the
new query based on correlation and similar data source.
For testing of this approach, huge data set is required, for
which TPC-H query set is chosen. For comparison of
execution time the queries executed on three variation
of database size viz. 2 GB, 5 GB and 10 GB. The perfor-
mance of generated new query is completely depends on
the correlation and similar data source, more correlation
more enhancement in performance, less correlation less
enhancement in performance. As Hive works on struc-
Indian Journal of Science and TechnologyVol 10 (18) | May 2017 | www.indjst.org8
Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive
tured data so as future work the same approach can be
applied using Pig because Pig uses both structured and
unstructured type of data, and result of both framework
(Hive and Pig) can be compared.
8. References
1.	 Gartner IT Glossary. Available from: http://www.gartner.
com/it-glossary/big-data/
2.	 The 7 V’s of Big Data. Available from: https://www.impac-
tradius.com/blog/7-vs-big-data/
3.	 What is Apache Hadoop. Available from: http://hadoop.
apache.org/
4.	 Dean J, Ghemawat S. MapReduce, simplified data process-
ing on large clusters. Commn ACM. 2008 Jan; 51(1):107–13.
5.	 Stonebraker M, Abadi D, Madden S, Paulson E, Pavlo A,
Rasin A. MapReduce and parallel DBMSs: friends or foes?
Commn ACM. 2010 Jan; 53(1):64–71.
6.	 DeWitt D, Stonebraker M. MapReduce, A major step back-
wards. 2008 Jan. p. 1-9.
7.	 He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z.
RCFILE, A fast and space-efficient data placement structure
in MapReduce-based warehouse systems. International,
Conference on Data Engineering; 2011 Apr. p. 1199–208.
8.	 Olston C, Reed B, Srivastava U, Kumar R, Tomkins A.
Pig latin, a not-so-foreign language for data process-
ing. Proceedings of the ACM SIGMOD International
Conference on Management of data (SIGMOD’08); 2008.
p. 1099-110.
9.	 Robert J, Stewart, Phil W, Trinder, Loidl HW. Comparing
high level MapReduce query language. Conference Paper
in International Workshop on Advance Parallel Processing
Technologies; 2011. p. 58-72.
10.	 Timos K, Sellis. Multiple-query optimization. ACM
Transactions on Database Systems. 1988; 13(1):23-52.
11.	 Bayir MA, Toroslu IH, Cosar A. Genetic algorithm for the
multiple query optimization problems. IEEE Transactions
on Systems, Man and Cybernetics, Part C: Applications and
Reviews. 2007; 37(1):147-53.
12.	 Hive Foundation project home page. Available from: http://
hive.apache.org
13.	 Design - Hive Architecture. Available from: https://cwiki.
apache.org/confluence/display/Hive/Design
14.	 TPC-H query benchmark reference. Available from: http://
www.tpc.org/tpch/default.asp.
15.	 DBGEN – query generator tool reference. Available from:
https://github.com/electrum/tpch-dbgen
16.	 Dokeroglu1 T, Ozal1 S, Bayir MA, Cinar MS, Cosar A.
Journal of Cloud Computing. 2014;3-12.
17.	 Garg V. Optimization of multiple queries for big data
with apache hadoop/hive. International Conference on
ComputationalIntelligenceandCommunicationNetworks;
2015. p. 938-41.
18.	 Giannikis G, Alonso G, Kossmann D. SharedDB: Killing
one thousand queries with one stone. Prof VLDB. 2010;
5(6):526–37.

More Related Content

What's hot

A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 

What's hot (20)

A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big data
Big dataBig data
Big data
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop
HadoopHadoop
Hadoop
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 

Similar to OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE WITH HADOOP/HIVE

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop add-on API for Advanced Content Based Search & Retrieval
Hadoop add-on API for Advanced Content Based Search & RetrievalHadoop add-on API for Advanced Content Based Search & Retrieval
Hadoop add-on API for Advanced Content Based Search & Retrievaliosrjce
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxAnonymous9etQKwW
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopNushrat
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 

Similar to OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE WITH HADOOP/HIVE (20)

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
M017619194
M017619194M017619194
M017619194
 
Hadoop add-on API for Advanced Content Based Search & Retrieval
Hadoop add-on API for Advanced Content Based Search & RetrievalHadoop add-on API for Advanced Content Based Search & Retrieval
Hadoop add-on API for Advanced Content Based Search & Retrieval
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Paper ijert
Paper ijertPaper ijert
Paper ijert
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 

Recently uploaded

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE WITH HADOOP/HIVE

  • 1. *Author for correspondence Indian Journal of Science and Technology, Vol 10(18), DOI: 10.17485/ijst/2017/v10i18/111370, May 2017 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive Puneet Kansal*, Safdar Tanweer and Kavita Bhatnagar Jamia Hamdard (Hamdard University), Hamdard Nagar, New Delhi – 110 062, India; Puneetkansal.iaf@gmail.com, safdartanweer@yahoo.com, bhatnagarkavita@yahoo.in Keywords: DBGEN, Hive Query Language (HiveQL), Online Analytical Processing (OLAP), Transaction Processing Performance Council (TPC), TPC-H Queries Abstract Objectives: Generated new single Hive Query (HiveQL) by finding the similar type of operation and common data source from two or more input query and compare the total execution time of both queries. Methods/Statistical Analysis: Map Reduce concept of Hadoop Hive is utilized in this paper, a new single query is generated from two or more input queries and 3 sample of data generated with size of 2, 5 and 10 GB using free database generation tool DBGEN. TPC-H queries are executed on this data and total execution time of both the queries is compared to see the performance. Findings: As Hive executes single query at a time, and in this research, multiple queries are provided to hive by converting them into single query. This approach results in reduction of operation while executing the query, which further reduce the execution time and improve the performance of Hive. Hive process the structured data of data warehouse system, so by using this approach, the structured data can be process and analyzed in easily and convenient manner. Structured data is used for processing OLAP (Online Analytical Processing) queries so Hive also helps to process OLAP queries. Hive works in conjunction with Hadoop and it process or execute query on data which is stored on Hadoop. So firstly, Hadoop should be running on the system to use Hive query. This research requires huge amount of data for testing, for this sample data is generated using free data generation tool provided by TPC (Transaction Performance Council), DBGEN. TPC also provide the different types of queries for testing the performance query execution tool, so in this research TPC-H queries are utilized. Application/Improvements: By using the concept which is shown in this research, the total execution time of Hive queries can be reduced drastically and performance of Hive can be increased. 1. Introduction Now a days everybody is generating or collecting data from various sources in different forms like pdf, mp3, movie, document files, spreadsheet file, images etc. and storing it at unprecedented rates. Due to this huge amount of data, the traditional database management system or currently available processing tools are unable to process it or takes a very long time to produce meaningful out- put. According to Gartner the Big Data term comprise of 3 V’s that is Volume, Velocity and Variety1 . Additionally, there are more V’s in the concept Visualization, Veracity (Reliability), Variability and Value2 . This huge amount of data or big data can be collected, stored, processed and analyzed using various approaches and tools, which are available in the market. Hadoop is an open source soft- ware which is based on java programming framework. It uses Map-reduce algorithm for processing of large data sets. Initially, Hadoop was developed by Yahoo in the year of 2006. It has the capability to process huge amount of data sets in the range of Petabyte or Zettabyte or more, which is stored on the cluster of commodity hardware (computers), connected to each other3 . The large data sets, which requires distributed processing, is processed by the Hadoop software library framework4 . It is a highly scalable storage platform, which is designed to process very large data sets across thousands of com- puting nodes that operate in parallel. It provides a cost
  • 2. Indian Journal of Science and TechnologyVol 10 (18) | May 2017 | www.indjst.org2 Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive effective storage solution for large data volumes with no formal requirements and uses MapReduce process- ing technique. The MapReduce algorithm performs two important steps for processing of large data sets viz. Map and Reduce. In the first step (Map), it splits the large data set into smaller parts and convert into (key, value) pair, in the second step, the output of first steps became the input of this step (Reduce) and performs reduce opera- tion5-7 . These systems are extensively used for analytics and data warehousing, either directly or through the use of a high level query language that is compiled down to a parallel dataflow graph for execution3 . Data warehousing is a centralized repository of integrated data from vari- ous sources. The data which is stored in data warehousing is historical in nature and further used to create analyti- cal reports for the enterprise. Data warehousing is used to process OLAP queries and accompanied with batch oriented query workload that is relatively static. Data warehousing applications are non-interactive in nature, which provides freedom to reorder and optimize que- ries to improve overall performance. Map-Reduce task on Hadoop uses three functions viz. mapping, shuffling and reducing. Input Data format which Hadoop accept is in the form of (key, value) pair, which is provided by Hadoop Distributed File System (HDFS). As Hadoop uses MapReduce framework, so the input is assigned to map- per in the form of (key, value) pair. Mapper then process the input and produce a set of (key, value) pair as output. The mappers output (key, value) become the input for next step, which is reduce. Before reduce operation the set of (key, value) pair goes through the shuffle phase. In shuffle phase the same k1 pair is assigned to the k1 reducer, same way k2 pair is assigned to k2 reducer and so on as shown in Figure 1 with Finally reducer does aggregation opera- tion by creating the pair of similar k1 value, k2 values etc., and produce output. Hadoop methodology shows that, it does not perform any optimization or global analysis to share work between similar jobs because, Map-Reduce operations are expressed as user-defined functions. Optimization in Hadoop can be achieved by using Hive or Pig as declarative query language8 . These both per- form OLAP operation. The difference Hive and Pig is that Pig supports structured as well as unstructured data and Hive can process only structured data. Structured data is generally used in data warehouse. Pig cannot be used in handling data warehouse because it can handle both structured and unstructured data which is most suitable for streaming data. Data warehouses are not updated fre- quently and fast response time in processing of queries is not required9 . In this paper, The Hive framework is cho- sen for experiment due to the following reason • Hive process the structured data which is the base of a data warehouse system. • The applications which are available in a data warehouse system developed using SQL based framework like RDBMS. Which is also sup- ported by the Hive So based on above reason, it is very easy for user to migrate data warehouse application into Hadoop appli- cation. Hive execute each query independently means it support the “query-at-a-time” model. Due to which it is possible to optimize the execution of query by applying the optimization technique10-11 . Two or more queries can be converted into single query by detecting the correla- tion and similar data source, and execution of this single query will drastically reduce the total execution time of query. Figure 1. MapReduce tasks. 1.1 HIVE Hive12,13 , is an open source SQL like interface, which works on distributed data warehouse system. Hive resides on top of the Hadoop framework and provides execu- tion of schema based structured data queries are shown in Figure 2. Hive cannot run independently, it requires Hadoop to be run first. Hive stores data in the form of Metadata (table and partitions) which is stored in data- base like Derby, MySQL Oracle etc. It facilitates reading, writing, and managing large datasets residing in distrib- uted storage using SQL. It provides data summarization, query, and analysis. It provides a command line tool and JDBC driver to connect users to Hive console Refer Figure 2. Facebook having a huge amount of data in Exabyte, Petabyte on its server, for managing such huge
  • 3. Indian Journal of Science and Technology 3Vol 10 (18) | May 2017 | www.indjst.org Puneet Kansal, Safdar Tanweer and Kavita Bhatnagar amount of data Facebook developed Hive, but after some time Apache Software Foundation adopted it. After adop- tion Apache made the Hive as open source and declared it as Apache Hive. As Hive is an open source so many other companies also developing the Hive like Netflix, Amazon etc. Hive convert SQL query into MapReduce and uses HiveQL for querying the data, which is stored in Hadoop cluster. In terms of storage Hive can use any file system supported by Hadoop, although HDFS is by far the most common. Hive provides its own query lan- guage HiveQL, which is similar to SQL, for querying data on a Hadoop cluster. It can manage data in HDFS and run jobs in MapReduce without translating the queries into Java. While executing MapReduce jobs, Hive doesn’t generate Java MapReduce programs, whereas, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the “job plan”. Hive Queries are translated to a graph of Hadoop MapReduce jobs that get executed on the Hadoop grid. Hive Query Language (HQL) is based on SQL, and there are many of the familiar constructs similar to SQL. The concept of storing Database, Tables and data is similar to a RDBMS in HIVE. As like SQL, Hive stores “tables” in a container called “Database”. The data stored in a table according to the “schema” defined by the user. The schema is defined in the form of row and column. Hive is not suitable for exe- cuting real time queries because it process OLAP (Online Analysing Processing) queries. 1.1.1 HiveQL The Query language for execution of queries in Hive is known as HiveQL, is used to organize and query the data stored in Hive. The current version of HiveQL provides the feature to CREATE and DROP tables and partitions. In HIVE, a table split into multiple parts on a specified partition key, as well as query them with SELECT state- ments. HiveQL does not support UPDATE and DELETE Figure 2. Hive external architecture.
  • 4. Indian Journal of Science and TechnologyVol 10 (18) | May 2017 | www.indjst.org4 Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive functionalities yet now. The following are the most impor- tant functionalities that are supported using the SELECT statements in HiveQL are • There is the possibility to join tables on a com- mon key, • Data can be filter using row selection techniques • And columns can be projected The above mentioned functionalities are similar to functionality which is provided to the user in a RDBMS (relational database management system). The following example shows that how a SELECT statement looks in HiveQL: SELECT o1_orderkey, o1_custkey, c1_custkey FROM customer c1 JOIN Orders o1 ON c1.c1_custkey = o1.o1_custkey JOIN Line item l1 ON o1.o1_orderkey = l1.l1_orderkey; In this example, a simple join between the table cus- tomer, order and line item is initiated on their respective join keys, which are custkey and order key. The above query also shows the projections in first part of the state- ment which is pushed down to the table scan level by the framework. Multiple HiveQL queries can also be execute during one single operation in Hive. This is knowns as parallelizing the execution in Hive. The above example also shows that HiveQL statements with multiple tables require specific columns on which those tables are joined. Hive does not support cross products, which is com- monly supported in database management systems such as DB2 or Oracle. 1.1.2 Hive Architecture Various components of Hive’s internal architecture are depicted in the Figure 3 and their introduction in appended blow: User Interface: This is a first layer of Hive data ware- house infrastructure. The purpose of user interface for users to submit queries and other operations to the sys- tem. User interface provide the facility to user to submit queries and other operations to the system. Hive provides different mechanism for user interaction Viz. • Web UI • Hive Command Line and • HD insight Metastore: It is an object store, which have a data- base. The purpose of this database is to store schema or Metadata for tables. It also store information about data- bases, column, and their type and HDFS mapping13 . There are two ways by which Metastore can be configured-: • Remote • Embedded HiveQL Process Engine: HiveQL is a query language, which is similar to SQL. The syntax in HiveQL is very similar to SQL for creating, editing, deletion and query on table. Customized MapReduce script can also be embed- ded by the user in HiveQL. Execution Engine: It works in conjunction with HiveQL process Engine and MapReduce. The purpose of execution engine is to execute the plan created by com- piler which is in the form of DAG stages. It provides connectivity between Hive and Hadoop for execution of query. There is a similarity in the result which is produced by MapReduce and execution engine. Figure 3. Hive internal architecture. 2. TPC (Transaction Processing Performance Council) It is a nonprofit organization, which was founded in 1988. It provides different types of benchmark like TPC-H, TPC-C, TPC-E etc. are shown in Figure 4 to define the performance of transaction processing. The performance can be measured on the basis speed and accuracy of result. The speed is measured in terms of number of que- ries executed per second or minute and accuracy results are expressed in percentage. 2.1 TPC-H14 It consists of various business oriented queries and data modification is also permitted the user. Data can be pop- ulated by user from 2 GB to TBs. The more data gives the good performance result. Minimum 100 GB of data
  • 5. Indian Journal of Science and Technology 5Vol 10 (18) | May 2017 | www.indjst.org Puneet Kansal, Safdar Tanweer and Kavita Bhatnagar is considered a good database for running TPC queries. In this paper, TPC-H queries are merged based on their correlation and similar data source and, executed on dif- ferent data size. The execution time of queries on different database size is compared to check the performance of Hive query execution. TPC-H support the performance metric, which is called TPC-H Composite Query-per- Hour Performance Metric (QphH@size). The TPC-H Price/Performance metric is expressed as $/QphH@size. Figure 4. Different benchmark of TPC. 3. DBGEN DBGEN is used with TPC-H to generate the database15 . It can generate database from Gbs to TBs. In data genera- tion it takes from minutes to days or weeks. The more data generated using this takes more time. Generally minimum 100 GB of data is considered good for execution of TPC-h queries. This tool is written in ANSI ‘C’ for portability and provides support to various systems. This tool also provides the source code, which makes the process of building a compliant database population as simple as possible. For multiple query optimization, different optimi- zation techniques has been experimentally shown16 , in this author proposed “the concept of SharedHive for improving the performance of execution of queries”16 . It is also experimentally shown in this paper that using the “SharedHive concept the total execution time is significantly reduced for query execution”. In17 user exper- imentally shown that the execution time of query can be drastically reduced if the correlated query can be detected before execution. So, in this manuscript, it is experimen- tally shown, that, if the correlated query that has similar data source can be combined separately, and new query is created then executed. So, the execution will be drasti- cally reduced. This time will depend on the number of correlated queries, who has similar data source, more number of queries, more reduction in time, less number of queries, less reduction in time. 4. Proposed Model/Solution Here, brief information is provided about architecture of proposed Hive (Correlated Query Optimizer Hive) which is the modified version of Hadoop Hive with new Multiple Query Optimization (MQO) component shown in Figure 5. Figure 5. Proposed hive architecture. All queries which are entered into the Hive either throughCommandLineInterfaceorWebInterface,passed
  • 6. Indian Journal of Science and TechnologyVol 10 (18) | May 2017 | www.indjst.org6 Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive through the layer called “Multiple Query Optimizer”. The work of this layer is to examine input queries for similar operation and same data source, if found then it generate a single HiveQL query. If it does not find any similarity or correlation in the input query with other query, the query is appended at the last. The generated correlated query is passed to next layer “compiler-optimizer-executer” as input. The Metastore stores the relations, attributes and partitions of RDBMSs. The execution control of HiveQL queries is maintained by Driver component. All input queries will passed through the proposed layer “Multiple Query Optimizer”, the work of this layer is to check com- plete input queries for similar type of operation and same data source. If this layer find such relation in the input queries then create a new single query and if it does not find any similarity then do not create the query. This layer scans all input queries one by one and keeps on append- ing the similar operation in a new single query till it finds. When proposed layer finishes the job and generates sin- gle query then this new single query will go through next layer called “Driver”. The Driver layer consist of three sub parts viz Compiler, Optimizer and executer. The purpose of driver layer is to do session management; the input query sent to the compiler where compiler receives meta- data from MetaStore and execution plan is created finally this execution plan is executed by execution engine. In a data warehouse, tables and partitions have structure information like row, column details, column types, rela- tion between column and attributes, these all information are stored in MetaStore. 5. Processing of the Query in the Proposed System When a Hive query is submitted to the system using either through Command Line Interface or web user interface or thrift interface, it is scanned by proposed system for any similar operation from same data source. After scan- ning new single query is generated based on the similarity of the operation from same data source. This merged query is submitted to Driver component for further execution. In Driver layer this query pass through vari- ous stage like compiler, optimizer and executer. In Driver layer, the execution plan of query and structure informa- tion is created. This execution plan is further executed by execution engine. Compiler creates the logical execution plan of the query and Directed acyclic graph is produced by HiveQL. The DAG defines the MapReduce task which is to be executed by execution engine. In this system a new layer called “Multiple Query Optimizer” is proposed and integration of this layer into existing system does not require any major changes. So the proposed system is easy to use and integrate. Table 1 shows the execution time of TPC-H queries using Multiple Query Optimization algorithm. The pro- posed algorithm is performed on 2 GB, 5 GB and 10 GB of data, execution time on different size clearly shows that after merging the query Q1 and Q6, execution time is drastically down. Table 1. Q1 and Q6 execution results with and without MQO DATA 2 GB 5 GB 10 GB Q1 31.177 30.733 306.184 Q6 3.934 3.857 61.164 Q1+Q6 (With MQO) 12.1 24.073 45.671 6. Methodology Before execution of query, the proposed Hive architecture, scan the input queries for similar data sources, similar tasks and create a single query which is in the form of HiveQL. The number of operations in a newly generated single HiveQL is comparatively less than the two or more input queries Input: Group of input queries: Q= {q1,...,qn}. Output: Generated new single HiveQL query Q ={q1,...,qm}, where m ≤ n and mds = nds . The proposed system reduces the number of opera- tion in query execution therefore reduce the time of total execution. The reduction in execution time is achieved by generating new single query from two or more input query. If‘t’ is the execution time and m, n is the number of operation then execution time difference between new query and old can be shown like following: ∑∑ == < n i i m i i tt 11 1 )()( The table structure consisting of tuples and columns for the new single query and input query will be same. So
  • 7. Indian Journal of Science and Technology 7Vol 10 (18) | May 2017 | www.indjst.org Puneet Kansal, Safdar Tanweer and Kavita Bhatnagar that the output of new single query will not be differ from the output of input query. The main difference between the output of old query and new query would be in num- ber of tables. The input query will generate table for each query but the new merged single query will generate only single table for the output. There is also a difference between number of operation in the input query and new single query. The number of operation will be more in the input query and will be less in new single query. Because similar operation has been merged in the new query. As number of operation is less in the execution of single query so there will also be less number of MapReduce operations thus increase the performance of Hive query execution. Resulting query after searching the similar operation, same data source from input queries (Q1 and Q6) DROP TABLE q1_price_smmry_rpt; DROP TABLE q6_rvnu_chng; CREATE TABLE q1_ price_smmry_rpt ( l1_returnflag string, L1_linestatus string, sum_qty_psr int, sum_base_price_psr double, sum_disc_price_psr double, sum_charge_psr double, avg_qty_psr int, avg_price_psr double, avg_disc_psr double, count_order_psr int); CREATE TABLE q6_ rvnu_chng (REVENUE dou- ble); FROM lineitem INSERT OVERWRITE TABLE q1_ price_smmry_rpt select l1_returnflag, l1_linestatus, sum(l1_quantity), sum(l1_extendedprice_psr), sum(l1_extendedprice_psr * (1 – l1_discount)), sum(l1_extendedprice * (1 – l1_discount) * (1 + l1_tax)), avg(l1_quantity), avg(l1_extendedprice), avg(l1_discount), count(*) where l1_shipdate <= ‘1998-09-16’ group by l1_returnflag, l1_linestatus order by l1_returnflag, l1_linestatus INSERT OVERWRITE TABLE q6_ rvnu_chng select sum(l1_extendedprice * l1_discount) as revenue where l1_shipdate >= ‘1993-02-02’ and l1_shipdate < ‘1994-01-02’ and l1_discount between 0.06 - 0.01 and 0.06 + 0.01 and l1_quantity < 26; The underlying SQL-to-MapReduce translator of Hive uses one operation to one job model18 and opens a new job for each operation (table scan, join, group by, etc.) in a SQL statement. Significant performance increases can be obtained by reducing the number of MapReduce tasks of these jobs. 7. Summary and Conclusion In this paper, it is experimentally shown that how the per- formance of conventional Hive can be enhanced. As hive does not support the execution of multiple queries, so if we can find those query from given query set who have similar data source and correlation operation, and cre- ate a single query based on these parameters. Then these all queries are executed in Hive and the execution time is compared (Graph 1). It is clear from the graph that the query who has similar data source takes very less time in execution compare to individual query. For detection of correlated and similar data source queries, new layer “Multiple Query Optimization” introduced. The job of this layer is to scan the existing query and generate the new query based on correlation and similar data source. For testing of this approach, huge data set is required, for which TPC-H query set is chosen. For comparison of execution time the queries executed on three variation of database size viz. 2 GB, 5 GB and 10 GB. The perfor- mance of generated new query is completely depends on the correlation and similar data source, more correlation more enhancement in performance, less correlation less enhancement in performance. As Hive works on struc-
  • 8. Indian Journal of Science and TechnologyVol 10 (18) | May 2017 | www.indjst.org8 Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive tured data so as future work the same approach can be applied using Pig because Pig uses both structured and unstructured type of data, and result of both framework (Hive and Pig) can be compared. 8. References 1. Gartner IT Glossary. Available from: http://www.gartner. com/it-glossary/big-data/ 2. The 7 V’s of Big Data. Available from: https://www.impac- tradius.com/blog/7-vs-big-data/ 3. What is Apache Hadoop. Available from: http://hadoop. apache.org/ 4. Dean J, Ghemawat S. MapReduce, simplified data process- ing on large clusters. Commn ACM. 2008 Jan; 51(1):107–13. 5. Stonebraker M, Abadi D, Madden S, Paulson E, Pavlo A, Rasin A. MapReduce and parallel DBMSs: friends or foes? Commn ACM. 2010 Jan; 53(1):64–71. 6. DeWitt D, Stonebraker M. MapReduce, A major step back- wards. 2008 Jan. p. 1-9. 7. He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. RCFILE, A fast and space-efficient data placement structure in MapReduce-based warehouse systems. International, Conference on Data Engineering; 2011 Apr. p. 1199–208. 8. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin, a not-so-foreign language for data process- ing. Proceedings of the ACM SIGMOD International Conference on Management of data (SIGMOD’08); 2008. p. 1099-110. 9. Robert J, Stewart, Phil W, Trinder, Loidl HW. Comparing high level MapReduce query language. Conference Paper in International Workshop on Advance Parallel Processing Technologies; 2011. p. 58-72. 10. Timos K, Sellis. Multiple-query optimization. ACM Transactions on Database Systems. 1988; 13(1):23-52. 11. Bayir MA, Toroslu IH, Cosar A. Genetic algorithm for the multiple query optimization problems. IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews. 2007; 37(1):147-53. 12. Hive Foundation project home page. Available from: http:// hive.apache.org 13. Design - Hive Architecture. Available from: https://cwiki. apache.org/confluence/display/Hive/Design 14. TPC-H query benchmark reference. Available from: http:// www.tpc.org/tpch/default.asp. 15. DBGEN – query generator tool reference. Available from: https://github.com/electrum/tpch-dbgen 16. Dokeroglu1 T, Ozal1 S, Bayir MA, Cinar MS, Cosar A. Journal of Cloud Computing. 2014;3-12. 17. Garg V. Optimization of multiple queries for big data with apache hadoop/hive. International Conference on ComputationalIntelligenceandCommunicationNetworks; 2015. p. 938-41. 18. Giannikis G, Alonso G, Kossmann D. SharedDB: Killing one thousand queries with one stone. Prof VLDB. 2010; 5(6):526–37.