Apache hive

–Johnny Appleseed
Apache Hive
Mr.Inthra Onsap - Neng

Agenda
• Apache Hive
• What is Hive?
• Hive in Hadoop Ecosystem
• Built-in Data Types, Operators and Functions
• User-Deﬁned Functions
• Partitions and Buckets
• File Formats
• Demo
• Conclusion

What is Hive?
• The software facilitates reading,writing and managing large datasets residing
in distributed storage using SQL-like syntax
• HQL (Hive Query Language)
• Developed by Facebook
• Now owned by Apache (Apache License)
• Not designed for On-line transaction processing
• Does not offer Real-time query
• Best used in batch processing job
• Schema-on-Read
• Use mostly in Data warehouse
• Hive’s Connector Supports: Java, Python, PHP, NodeJs, Ruby and C++

Built-in Data Types, Operators
and Functions
• Data Types
• Primitive: boolean, tinyint, smallint, int, begint, ﬂoat, double, decimal, string, varchar, char,
binary, timestamp, date
• Complex: array, map, struct, union
• Operators
• Relational: =(==), <>(!=), <, <=, >, >=, BETWEEN, NOT BETWEEN, IS NULL, IS NOT NULL,
LIKE, NOT LIKE and RLIKE(REGEXP)
• Arithmetic: +, -, *, /, %, &, ^ and ~
• Logical: AND(&&), OR(||), NOT(!), IN, NOT IN, EXISTS and NOT EXISTS
• Functions
• String Functions, Conditional Functions, Date Functions, Mathematical Functions and
Aggregate Functions
• Also Where, Joins, Order, Groups and Sort

User-Defined Functions
• UDF (User-Defined Functions) 3 Types
• UDF - User-Defined Function
• UDAF - User-Defined Aggregate Function
• UDTF - User-Defined Table-generating Function
• Writing UDFs requires Java skill

Partitions and Buckets
• Partitions
• Buckets
CREATE TABLE logs (timestamp int, line string)
PARTITIONED BY (created date, country string);
CREATE TABLE user_bucket (id int, name string)
CLUSTERED BY (id) INTO 4 BUCKETS;

File Formats
TEXTFILE
0%
RCFILE
(~15% Smaller)
PARQUET
(~60% Smaller)
ORC
(~75% Smaller)
Poor
Best

DEMO
department.csv 
employee.csv 
salary.csv
HDFS
Put Files
Create Schema
Do Query

DEMO
Commands
$> hadoop fs -copyFromLocal /hive_data /user/hdfs
$> CREATE TABLE employee (
id int,
ﬁrstname varchar(50),
lastname varchar(50),
dept_id int
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;

$> CREATE TABLE department (
id int,
name varchar(50)
STORED AS ORC;
$> CREATE TABLE salary (
id int,
user_id int,
salary decimal(12, 2),
created timestamp
STORED AS ORC;

$> LOAD DATA INPATH ‘/user/hduser/hive_data/employee.csv’
OVERWRITE INTO TABLE employee;
$> SELECT e.ﬁrstname, d.name, sum(s.salary)
FROM employee e
JOIN department d ON (e.dept_id = d.id)
JOIN salary s ON (e.id = s.user_id)
GROUP BY e.ﬁrstname, d.name;
$> CREATE TABLE salary_bucket(
id int,
user_id int,
salary decimal(12, 2),
created date
)
PARTITIONED BY(dt date)
CLUSTERED BY(user_id) INTO 10 BUCKETS
ROW FORMAT DELIMITED
STORED AS ORC;

$> INSERT OVERWRITE TABLE salary_bucket PARTITION (dt)
SELECT id, user_id, salary, created, created as dt FROM
salary;

Conclusion
• Hive suits for bach processing like Data warehouse.
• Hive runs on top of Hadoop HDFS and MapReduce.
• Hive’s User-Deﬁned Functions requires Java skill.
• Partitioning and Bucketing are help improving query
performance.
• Hive suits for programmers. (Less learning curve)
• Parquet and ORC formats are preferred.

SQL vs HQL
Features SQL HiveQL
Updates
INSERT, UPDATE,
DELETE
INSERT, UPDATE,
DELETE
Transactions Supported Limited Supported
Indexes Supported Supported
Joins Supported Supported
Subqueries Supported Supported
Views
Updatable (Materialized
or Non-Materialized)
Read-only
Extension Points
User-deﬁned function,
Stored Procedues
User-deﬁned functions,
MapReduce Scripts

Apache hive

More Related Content

What's hot

Similar to Apache hive

Recently uploaded

Apache hive