–Johnny Appleseed
Apache Hive
Mr.Inthra Onsap - Neng
Agenda
• Apache Hive
• What is Hive?
• Hive in Hadoop Ecosystem
• Built-in Data Types, Operators and Functions
• User-Defined Functions
• Partitions and Buckets
• File Formats
• Demo
• Conclusion
What is Hive?
• The software facilitates reading,writing and managing large datasets residing
in distributed storage using SQL-like syntax
• HQL (Hive Query Language)
• Developed by Facebook
• Now owned by Apache (Apache License)
• Not designed for On-line transaction processing
• Does not offer Real-time query
• Best used in batch processing job
• Schema-on-Read
• Use mostly in Data warehouse
• Hive’s Connector Supports: Java, Python, PHP, NodeJs, Ruby and C++
Hive in Hadoop ecosystem
Built-in Data Types, Operators
and Functions
• Data Types
• Primitive: boolean, tinyint, smallint, int, begint, float, double, decimal, string, varchar, char,
binary, timestamp, date
• Complex: array, map, struct, union
• Operators
• Relational: =(==), <>(!=), <, <=, >, >=, BETWEEN, NOT BETWEEN, IS NULL, IS NOT NULL,
LIKE, NOT LIKE and RLIKE(REGEXP)
• Arithmetic: +, -, *, /, %, &, ^ and ~
• Logical: AND(&&), OR(||), NOT(!), IN, NOT IN, EXISTS and NOT EXISTS
• Functions
• String Functions, Conditional Functions, Date Functions, Mathematical Functions and
Aggregate Functions
• Also Where, Joins, Order, Groups and Sort
User-Defined Functions
• UDF (User-Defined Functions) 3 Types
• UDF - User-Defined Function
• UDAF - User-Defined Aggregate Function
• UDTF - User-Defined Table-generating Function
• Writing UDFs requires Java skill
Partitions and Buckets
• Partitions
• Buckets
CREATE TABLE logs (timestamp int, line string)
PARTITIONED BY (created date, country string);
CREATE TABLE user_bucket (id int, name string)
CLUSTERED BY (id) INTO 4 BUCKETS;
File Formats
TEXTFILE
0%
RCFILE
(~15% Smaller)
PARQUET
(~60% Smaller)
ORC
(~75% Smaller)
Poor
Best
DEMO
department.csv

employee.csv

salary.csv
HDFS
Put Files
Create Schema
Do Query
DEMO
Commands
$> hadoop fs -copyFromLocal /hive_data /user/hdfs
$> CREATE TABLE employee (
id int,
firstname varchar(50),
lastname varchar(50),
dept_id int
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
$> CREATE TABLE department (
id int,
name varchar(50)
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
$> CREATE TABLE salary (
id int,
user_id int,
salary decimal(12, 2),
created timestamp
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
$> LOAD DATA INPATH ‘/user/hduser/hive_data/employee.csv’
OVERWRITE INTO TABLE employee;
$> SELECT e.firstname, d.name, sum(s.salary)
FROM employee e
JOIN department d ON (e.dept_id = d.id)
JOIN salary s ON (e.id = s.user_id)
GROUP BY e.firstname, d.name;
$> CREATE TABLE salary_bucket(
id int,
user_id int,
salary decimal(12, 2),
created date
)
PARTITIONED BY(dt date)
CLUSTERED BY(user_id) INTO 10 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
$> INSERT OVERWRITE TABLE salary_bucket PARTITION (dt)
SELECT id, user_id, salary, created, created as dt FROM
salary;
Conclusion
• Hive suits for bach processing like Data warehouse.
• Hive runs on top of Hadoop HDFS and MapReduce.
• Hive’s User-Defined Functions requires Java skill.
• Partitioning and Bucketing are help improving query
performance.
• Hive suits for programmers. (Less learning curve)
• Parquet and ORC formats are preferred.
–Johnny Appleseed
Q & A
SQL vs HQL
Features SQL HiveQL
Updates
INSERT, UPDATE,
DELETE
INSERT, UPDATE,
DELETE
Transactions Supported Limited Supported
Indexes Supported Supported
Joins Supported Supported
Subqueries Supported Supported
Views
Updatable (Materialized
or Non-Materialized)
Read-only
Extension Points
User-defined function,
Stored Procedues
User-defined functions,
MapReduce Scripts

Apache hive

  • 1.
  • 2.
    Agenda • Apache Hive •What is Hive? • Hive in Hadoop Ecosystem • Built-in Data Types, Operators and Functions • User-Defined Functions • Partitions and Buckets • File Formats • Demo • Conclusion
  • 3.
    What is Hive? •The software facilitates reading,writing and managing large datasets residing in distributed storage using SQL-like syntax • HQL (Hive Query Language) • Developed by Facebook • Now owned by Apache (Apache License) • Not designed for On-line transaction processing • Does not offer Real-time query • Best used in batch processing job • Schema-on-Read • Use mostly in Data warehouse • Hive’s Connector Supports: Java, Python, PHP, NodeJs, Ruby and C++
  • 4.
    Hive in Hadoopecosystem
  • 5.
    Built-in Data Types,Operators and Functions • Data Types • Primitive: boolean, tinyint, smallint, int, begint, float, double, decimal, string, varchar, char, binary, timestamp, date • Complex: array, map, struct, union • Operators • Relational: =(==), <>(!=), <, <=, >, >=, BETWEEN, NOT BETWEEN, IS NULL, IS NOT NULL, LIKE, NOT LIKE and RLIKE(REGEXP) • Arithmetic: +, -, *, /, %, &, ^ and ~ • Logical: AND(&&), OR(||), NOT(!), IN, NOT IN, EXISTS and NOT EXISTS • Functions • String Functions, Conditional Functions, Date Functions, Mathematical Functions and Aggregate Functions • Also Where, Joins, Order, Groups and Sort
  • 6.
    User-Defined Functions • UDF(User-Defined Functions) 3 Types • UDF - User-Defined Function • UDAF - User-Defined Aggregate Function • UDTF - User-Defined Table-generating Function • Writing UDFs requires Java skill
  • 7.
    Partitions and Buckets •Partitions • Buckets CREATE TABLE logs (timestamp int, line string) PARTITIONED BY (created date, country string); CREATE TABLE user_bucket (id int, name string) CLUSTERED BY (id) INTO 4 BUCKETS;
  • 8.
  • 9.
  • 10.
    DEMO Commands $> hadoop fs-copyFromLocal /hive_data /user/hdfs $> CREATE TABLE employee ( id int, firstname varchar(50), lastname varchar(50), dept_id int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC;
  • 11.
    $> CREATE TABLEdepartment ( id int, name varchar(50) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC; $> CREATE TABLE salary ( id int, user_id int, salary decimal(12, 2), created timestamp ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC;
  • 12.
    $> LOAD DATAINPATH ‘/user/hduser/hive_data/employee.csv’ OVERWRITE INTO TABLE employee; $> SELECT e.firstname, d.name, sum(s.salary) FROM employee e JOIN department d ON (e.dept_id = d.id) JOIN salary s ON (e.id = s.user_id) GROUP BY e.firstname, d.name; $> CREATE TABLE salary_bucket( id int, user_id int, salary decimal(12, 2), created date ) PARTITIONED BY(dt date) CLUSTERED BY(user_id) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC;
  • 13.
    $> INSERT OVERWRITETABLE salary_bucket PARTITION (dt) SELECT id, user_id, salary, created, created as dt FROM salary;
  • 14.
    Conclusion • Hive suitsfor bach processing like Data warehouse. • Hive runs on top of Hadoop HDFS and MapReduce. • Hive’s User-Defined Functions requires Java skill. • Partitioning and Bucketing are help improving query performance. • Hive suits for programmers. (Less learning curve) • Parquet and ORC formats are preferred.
  • 15.
  • 16.
    SQL vs HQL FeaturesSQL HiveQL Updates INSERT, UPDATE, DELETE INSERT, UPDATE, DELETE Transactions Supported Limited Supported Indexes Supported Supported Joins Supported Supported Subqueries Supported Supported Views Updatable (Materialized or Non-Materialized) Read-only Extension Points User-defined function, Stored Procedues User-defined functions, MapReduce Scripts