SQL Optimizer vs Hive

A STUDY OF SQL OPTIMIZER AND HIVE
COEN 380 Project

Project Group : 1
Bhide, Aishwarya
Patnaik, Anita
Sekar, Vishaka Balasubramanian
Yoloye, Mose
2

Project Goal
● Understand how SQL Optimizer works
● Generate query plans using Oracle Explain
● Understand the basic principles of Hive
● Execute queries on Hive
● Compare query execution using Oracle and Hive
3

The SQL Optimizer
● Why do we need the optimizer?
Select * from Books where author = ‘Ernest Hemingway’;
Two ways to execute it –
• Full table scan
• Index on author
Is there a difference?
• 10 rows
• 10 million rows
4

The SQL Optimizer
● SQL is a declarative language
○ Query specifies what, the SQL engine decides how
○ How does understanding SQL optimizer help?
5

Data Set Up
6Reference : Database System Concepts - Silberschatz,Korth,Sudarshan

Data Set Up
● Queries
○ single relation
○ join ( 2-way and 3-way join)
○ aggregate function
○ Aggregates with grouping
○ Set function – Union, Except
○ Sub queries
○ Sub queries using with clause
○ Update and Delete 7

Project Execution
● Set up Oracle database
● Generate query optimizer plan using Oracle
Explain
● Set up tables and insert data in Hive
● Execute queries on Hive
8

Oracle Query Plan Results - 1
Query using single relation
SELECT title FROM course WHERE dept_name = 'Comp. Sci.' AND credits = 3;
9

Query using single relation
SELECT title FROM course WHERE dept_name = 'Comp. Sci.' AND credits = 3;
10

Query using 2-way join
SELECT DISTINCT ID FROM takes WHERE (takes.course_id , takes.sec_id, takes.semester,
takes.year) IN (SELECT course_id, sec_id,semester, year FROM teaches NATURAL JOIN
instructor WHERE name = 'Einstein');
11

SELECT DISTINCT ID FROM takes WHERE (takes.course_id , takes.sec_id, takes.semester,
takes.year) IN (SELECT course_id, sec_id,semester, year FROM teaches NATURAL JOIN
instructor WHERE name = 'Einstein');
Relational Algebra Expression:
12
Based on Oracle generated query Plan Self created query Plan

SELECT name, title FROM (instructor NATURAL JOIN teaches) JOIN course USING
(course_id);
13

SELECT name, title FROM (instructor NATURAL JOIN teaches) JOIN course USING
(course_id);
Equivalent Expression:
Based on Oracle generated query Plan
instructor(ID, name, dept_name,salary)
teaches(ID, course_id, sec_id, semester, year)
course(course_id, title, dept_name, credits) 14
Self Created Query Plan

Query using aggregate function
SELECT MAX(salary) FROM instructor;
15

Query for aggregate with grouping
SELECT COUNT(ID), course_id, sec_id FROM section NATURAL JOIN takes
WHERE semester='Fall' AND year=2009 GROUP BY course_id, sec_id;
16

Query for aggregate with grouping
17
SELECT COUNT(ID), course_id, sec_id FROM section NATURAL JOIN takes
WHERE semester='Fall' AND year=2009 GROUP BY course_id, sec_id;

Query using union operation
(SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) UNION
(SELECT course_id FROM section WHERE semester='Spring' AND year=2010);
18

Query using union operation (Expected plan)
19

Query using union operation (Oracle plan)
20

Query using except (intersect) operation
(SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009)
INTERSECT (SELECT course_id FROM section WHERE semester='Spring' AND
year=2010);
21

Query using except (intersect) operation (Expected Plan)
year=2010);
22

Query using except (intersect) operation (Oracle Plan)
year=2010);
23

Query using a subquery
SELECT name FROM instructor WHERE salary = (SELECT MAX(salary) FROM
instructor);
24

Query using a subquery
SELECT name FROM instructor WHERE salary = (SELECT MAX(salary) FROM
instructor);
25
Expected
Oracle

Oracle Query Plan Results # 9
Query using subquery and rename operation
SELECT MAX(enrollment), course_id FROM (SELECT Count(ID) as enrollment, sec_id, course_id
FROM takes WHERE year=2009 and semester='Fall' GROUP BY sec_id, course_id) GROUP BY
course_id;
26

Query Plan # 9 - using subquery
SELECT MAX(enrollment),
course_id
FROM (SELECT Count(ID) as
enrollment, sec_id, course_id FROM
takes
WHERE year=2009 and
semester='Fall'
GROUP BY sec_id, course_id)
GROUP BY course_id;
27
Matches with Oracle’s plan

Find the maximum enrollment across all sections in Fall 2009
WITH enrollment(course_id, sec_id, total) AS (SELECT course_id, sec_id, COUNT(ID) FROM
section NATURAL JOIN takes WHERE semester='Fall' and year='2009' GROUP BY course_id,
sec_id) SELECT MAX(total) FROM enrollment;
28

Query # 10 subquery and aggregation
SELECT COUNT(ID) as id FROM
section NATURAL JOIN takes
WHERE semester='Fall' and
year=2009 GROUP BY course_id,
sec_id
select max(id)
29
Matches with Oracle’s plan

Oracle Query Plan Results -11
Increase salary of each instructor in comp. sci dept. by 10%
UPDATE instructor SET salary = salary * 1.10 WHERE dept_name = 'Comp. Sci.';
30

Query #11 update query
instructor<- ∏name, ID,dept_name,(salary*0.10)((σinstructor..dept_name =
‘Comp Sci’ ) U ( σ instructor..dept_name <> ‘Comp Sci’) )
31

Oracle Query Plan Results -12
Delete all courses that have never been offered
DELETE FROM course
WHERE course_id IN (SELECT course_id FROM course MINUS SELECT course_id FROM course
NATURAL JOIN section);
32

Oracle Optimizer - Summary
The purpose of the Oracle Optimizer is to determine the most efficient
execution plan for the queries
Explain plan is the most efficient tool to see why the current plan was chosen
It chooses the best plan by reviewing four key elements of queries:
cardinality, access methods, join methods, and join orders
34

Hive
● Why Hive?
Rapidly increasing size of datasets - 700TB data set
Warehouse built using RDBMS failed to scale
Need for scalable analysis on large data sets
Hadoop was not easy for the end users
Need for improved querying capability
Need for diverse applications and users
35

Hive is NOT
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
36

Hive - Features
● Features of Hive
○ It stores schema in a database and processed data into HDFS.
○ It is designed for OLAP.
○ It provides SQL type language for querying called HiveQL or HQL.
○ It is familiar, fast, scalable, and extensible.
37

HiveQL - Query Language
Query Language (HiveQL)
subset of SQL queries - SQL like language
metadata browsing capabilities
explain plan capabilities (naive rule based optimizer)
seamless plugging in of map-reduce programs
eg. FROM(
MAP doctext USING ‘python wc_mapper.py’ AS (word,cnt)
FROM docs
CLUSTER BY word
) a
REDUCE word, cnt USING ‘python wc_reduce.py’;
38

Data Model and Query Language
HiveQL - Limitations
No support for where clause subqueries (not in the initial version)
Only equality predicates supported for join
Does not support inserting into an existing table (UPDATE, DELETE
or INSERT INTO are not supported)
Why is this not a problem at FB?
Almost all queries can be expressed using equi-join
Data is loaded in separate partitions
No Complex locking protocol required
39

Hive Query Execution
Parse the query
Type Checking and Semantic Analysis
Optimization
performs a chain of transformations
Walks the DAG, checks for Rule condition fulfillment, rule execution
40

Hive - Query Optimizer
Query Optimizer - Transformations
Column Pruning
Predicate Pushdown
Partition pruning
Map side joins
small tables kept in all mappers memory
minimizes cost of sorting and merging
Join Reordering 41

Hive: Comparison with RDBMS
● Hive
designed for analytics performed on static data
lack of record level update/delete functionality
Write once read many times
process massive amount of data
supports subset of sql queries
● RDBMS
designed for transaction processing and analytics on dynamic data
does support record level update/delete
Read and write many times 42

Hive Query Execution Results
(Simple Select Query)
43
SELECT title FROM course WHERE dept_name = 'Comp. Sci.' AND credits = 3

(subquery in FROM clause)
44

(Aggregation & Join)
45

(subquery)
46
SELECT name,salary FROM instructor i WHERE salary = (SELECT MAX(salary) FROM
instructor)

Hive Query Execution Inference
● Queries which include subqueries in Where or Having clause, e.g.
SELECT t.sec_id, t.course_id FROM takes t WHERE t.year=2009 AND
t.semester='Fall' HAVING count(t.ID) IN (SELECT MAX(enrollment) FROM
(SELECT COUNT(tin.ID) AS enrollment, tin.sec_id, tin.course_id FROM takes
tin WHERE tin.year=2009 AND tin.semester='Fall' GROUP BY
tin.sec_id,tin.course_id))
Queries which include subqueries in From clause, e.g.,
SELECT MAX(enrollment), s.course_id FROM (SELECT Count(t.ID) as
enrollment, t.sec_id, t.course_id FROM takes t WHERE t.year=2009 and
t.semester='Fall' GROUP BY t.sec_id,t.course_id) s GROUP BY s.course_id")
47

Hive - Use cases
● Hive should be used for analytical querying of data collected over a period of
time - for instance, to calculate trends or website logs.
● Hive should not be used for real-time querying
● It provides us data warehousing facilities on top of an existing Hadoop
cluster. Along with that it provides an SQL like interface which makes work
easier.
● create tables in Hive and store data there. Along with that, an existing HBase
tables can be mapped to Hive and operate on them.
48

Hive Query execution inference
Data Size: 20MB
49
HADOOP ORACLE
Hardware
Configuration
Environment: Cloudera CDH-5.6 -
YARN (MapReduce v2) and
Spark (1.5)
Worker Nodes: 24
Cores: 96 (4 cores per node)
Threads: 192
RAM: 768GB
● AMD A8-4555M APU
with Radeon HD
Graphics 1.60 GHz
● 4 cores
● 8GB Ram
● 64-bit operating
system
Average
Execution time
of queries
31.85 seconds 1 second

Hive Query execution inference
50
Executed Queries Failed Queries
● simple SELECT queries
● join
● subqueries within FROM
clause
● Union
● Intersection (sub-queries
within FROM clause)
● Aggregation with grouping
● Update
● Delete
● Queries with ‘WITH’ clause
● Sub queries within WHERE
clause

Demo
● Oracle Explain
● Hive
51

SQL Optimizer vs Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to SQL Optimizer vs Hive

Similar to SQL Optimizer vs Hive (20)

Recently uploaded

Recently uploaded (20)

SQL Optimizer vs Hive