In this session you will learn:
HIVE Overview
Working of Hive
Hive Tables
Hive - Data Types
Complex Types
Hive Database
HiveQL - Select-Joins
Different Types of Join
Partitions
Buckets
Strict Mode in Hive
Like and Rlike in Hive
Hive UDF
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
2. Page 2Classification: Restricted
• HIVE Overview
• Working of Hive
• Hive Tables
• Hive - Data Types
• Complex Types
• Hive Database
• HiveQL - Select-Joins
• Different Types of Join
• Partitions
• Buckets
• Strict Mode in Hive
• Like and Rlike in Hive
• Hive UDF
Agenda
3. Page 3Classification: Restricted
Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It makes querying and analyzing easy.
You should work on HiveQL to become a successful hadoop developer
using hive
Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive. It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for Analytical processing.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
HIVE Overview
5. Page 5Classification: Restricted
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive Web
UI, Hive command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of
tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process
Engine
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
of the replacements of traditional approach for MapReduce program. Instead
of writing MapReduce program in Java, we can write a query for MapReduce
job and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates results
as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to
store data into file system.
HIVE
7. Page 7Classification: Restricted
Step No. Operation
1 Execute QueryThe Hive interface such as Command Line or Web UI sends
query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2 Get PlanThe driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
3 Get MetadataThe compiler sends metadata request to Metastore (any
database).
4 Send MetadataMetastore sends metadata as a response to the compiler.
5 Send PlanThe compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
6 Execute PlanThe driver sends the execute plan to the execution engine.
7 Execute JobInternally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node and it
assigns this job to TaskTracker, which is in Data node. Here, the query
executes MapReduce job.
8 Fetch ResultThe execution engine receives the results from Data nodes.
9 Send ResultsThe execution engine sends those resultant values to the driver.
10 Send ResultsThe driver sends the results to Hive Interfaces.
The following table defines how Hive interacts with Hadoop framework:
8. Page 8Classification: Restricted
The Hive metastore service stores the metadata for Hive tables and
partitions in a relational database, and provides clients (including Hive)
access to this information via the metastore service API
HIVE
9. Page 9Classification: Restricted
For internal table
CREATE TABLE internal1 (col1 string);
Hive multiple table insert - Insert data into multiple hive tables
FROM sethu
INSERT OVERWRITE TABLE tab1 SELECT
sethu.column_one,sethu.column_two
INSERT OVERWRITE TABLE table_two SELECT
table_name.column_two
Hive Tables
10. Page 10Classification: Restricted
This chapter takes you through the different data types in Hive, which
are involved in the table creation. All the data types in Hive are
classified into :
PRIMITIVE TYPES:
Integral Types
Integer type data can be specified using integral data types, INT. When
the data range exceeds the range of INT, you need to use BIGINT and if
the data range is smaller than the INT, you use SMALLINT. TINYINT is
smaller than SMALLINT.
Floating Point Types
Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.
Dates
DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.
Boolean type
BOOLEAN—TRUE/FALSE
Hive - Data Types
11. Page 11Classification: Restricted
String Types
String type data types can be specified using single quotes (' ') or double quotes
(" "). It contains two data types: VARCHAR and CHAR. Hive follows C-types
escape characters.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used
for representing immutable arbitrary precision. The syntax and example is as
follows:
DECIMAL(precision, scale)
decimal(10,0)
Precision is the number of digits in a number. Scale is the number of digits to
the right of the decimal point in a number. For example, the number 123.45
has aprecision of 5 and a scale of 2.
Hive - Data Types
12. Page 12Classification: Restricted
Map<K,V>
Type Parameters:
K - the type of keys maintained by this map
V - the type of mapped values
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
friends ARRAY<BIGINT>, properties MAP<STRING, STRING>, …………)
Complex Types
13. Page 13Classification: Restricted
Hive is a database technology that can define databases and tables to
analyze structured data. The theme for structured data analysis is to store
the data in a tabular manner, and pass queries to analyze it.
Create Database is a statement used to create a database in Hive. A
database in Hive is a namespace or a collection of tables. The syntax for
this statement is as follows:
CREATE DATABASE <database name>
The following query is executed to create a database named userdb:
hive> CREATE DATABASE userdb;
The following query is used to verify a databases list:
hive> SHOW DATABASES;
The following queries are used to drop a database. Let us assume that the
database name is userdb.
hive> DROP DATABASE userdb;
Hive Database
14. Page 14Classification: Restricted
Create Table Statement
Create Table is a statement used to create a table in Hive. The example are
as follows:
The following data is a Comment, Row formatted fields such as Field
terminator, Lines terminator, and Stored File type.
COMMENT ‘Employee details’
FIELDS TERMINATED BY ‘t’
LINES TERMINATED BY ‘n';
hive> CREATE TABLE employee ( eid int, name String,
salary String, destination String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘t’;
Hive Database
15. Page 15Classification: Restricted
Create table road (id int,name VARCHAR(20),des string,year int) row format
delimited fields terminated by ',';
Then move data in hadoop using hadoop fs -put /home/mishra/Desktop/hive
/destination after creating the table and defining the schema,the next job is to
load data into hive which is done by:
load data inpath '/pnt' into table road;
inserting data is a complex operation in hive,generally not done but
to insert ad-hoc value like (12,"xyz), do this:
insert into table road select * from (select 12,"xyz","hr")a;
Alter Table Statement
It is used to alter a table in Hive.
Hive Database
16. Page 16Classification: Restricted
The following query renames the table from employee to emp.
hive> ALTER TABLE employee RENAME TO emp;
CHANGE STATEMENT
The following queries rename the column name and column data type using
the above data:
hive> ALTER TABLE employee CHANGE name ename String;
Add Columns Statement
The following query adds a column named dept to the employee table.
Hive Database
17. Page 17Classification: Restricted
Drop Table Statement
The syntax is as follows:
DROP TABLE table_name;
The following query drops a table named employee:
hive> DROP TABLE employee;
Hive Database
18. Page 18Classification: Restricted
You can save any result set data as a view. The usage of view in Hive is same as
that of the view in SQL.
A view is nothing more than a statement that is stored in the database with an
associated name.
Summarize data from various tables which can be used to generate reports.
Creating Views:
Database views are created using the CREATE VIEW statement.
The basic CREATE VIEW syntax is as follows:
CREATE VIEW view_name AS
SELECT column1, column2.....
FROM table_name
WHERE [condition];
Example:
Hive Database
19. Page 19Classification: Restricted
Consider the CUSTOMERS table having the following records:
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
Now, following is the example to create a view from CUSTOMERS table. This view would be
used to have customer name and age from CUSTOMERS table:
hive > CREATE VIEW CUSTOMERS_VIEW AS
SELECT name, age
FROM CUSTOMERS;
Now, you can query CUSTOMERS_VIEW in similar way as you query an actual table.
Following is the example:
Hive Database
20. Page 20Classification: Restricted
hive > SELECT * FROM CUSTOMERS_VIEW;
This would produce the following result:
+----------+-----+
| name | age |
+----------+-----+
| Ramesh | 32 |
| Khilan | 25 |
| kaushik | 23 |
| Chaitali | 25 |
| Hardik | 27 |
| Komal | 22 |
| Muffy | 24 |
+----------+-----+
Dropping a View
Use the following syntax to drop a view:
DROP VIEW view_name
Following is an example to delete a record having AGE= 22.
hive > DELETE FROM CUSTOMERS_VIEW
WHERE age = 22;
Hive Database
21. Page 21Classification: Restricted
JOIN is a clause that is used for combining specific fields from two tables by using
values common to each one. It is used to combine records from two or more tables in
the database. It is more or less similar to SQL JOIN.
Syntax
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]
Example
We will use the following two tables in this chapter. Consider the following table
named CUSTOMERS..
HiveQL - Select-Joins
23. Page 23Classification: Restricted
create a doc for orders on desktop and paste::::
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
create a doc for customers on desktop and paste::::
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
put the data in hdfs::::
hadoop fs -put /home/mishra/Desktop/untti /doc
hadoop fs -put /home/mishra/Desktop/delhi /docum
now create the table for them in hive as::::
create table CUSTOMERS (ID int,NAME string,AGE int,ADDRESS string,SALARY string) row
format delimited fields terminated by ',';
HiveQL - Select-Joins
24. Page 24Classification: Restricted
create table ORDERS (OID int,date string,CUSTOMER_ID int,AMOUNT string) row format
delimited fields terminated by ',';
now load data into hive table as:::
load data inpath '/doc' into table ORDERS;
load data inpath '/docum' into table CUSTOMERS;
There are different types of joins given as follows:
1. JOIN
2. LEFT OUTER JOIN
3. RIGHT OUTER JOIN
4. FULL OUTER JOIN
JOIN
The JOIN creates a new result table by combining column values of two tables (table1
and table2) based upon the join-predicate. The query compares each row of table1 with
each row of table2 to find all pairs of rows which satisfy the join-predicate. When the
join-predicate is satisfied, column values for each matched pair of rows of A and B are
combined into a result row.
Different Types of Join
25. Page 25Classification: Restricted
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
or
SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
INNER JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID;
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
Different Types of Join
26. Page 26Classification: Restricted
The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no
matches in the right table. This means, if the ON clause matches 0 (zero) records in the right
table, the JOIN still returns a row in the result, but with NULL in each column from the right
table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
or(just like sql regular practice)
SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
LEFT JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID;
Different Types of Join
27. Page 27Classification: Restricted
On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+
RIGHT OUTER JOIN
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are
no matches in the left table. If the ON clause matches 0 (zero) records in the left table,
the JOIN still returns a row in the result, but with NULL in each column from the left table.
Different Types of Join
28. Page 28Classification: Restricted
A RIGHT JOIN returns all the values from the right table, plus the matched values from
the left table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and
ORDER tables.
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
or
SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
RIGHT JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID;
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
Different Types of Join
29. Page 29Classification: Restricted
The HiveQL FULL OUTER JOIN combines the records of both the left and the right
outer tables that fulfil the JOIN condition. The joined table contains either all the
records from both the tables, or fills in NULL values for missing matches on either
side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
or
SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
FULL JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID;
Different Types of Join
31. Page 31Classification: Restricted
GROUP BY
Generate a query to retrieve the number of employees in each department.
The following query retrieves the employee details using the above scenario.
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
ORDER BY
SELECT * FROM CUSTOMERS ORDER BY NAME
Following is an example, which would sort the result in descending order by NAME:
SELECT * FROM CUSTOMERS ORDER BY NAME DESC;
Different Types of Join
32. Page 32Classification: Restricted
Hive is a good tool for performing queries on large datasets, especially datasets that
require full table scans. But quite often there are instances where users need to filter
the data on specific column values. Generally, Hive users know about the domain of the
data that they deal with. With this knowledge they can identify common columns that
are frequently queried in order to identify columns with low cardinality which can be
used to organize data using the partitioning feature of Hive.
In non-partitioned tables, Hive would have to read all the files in a table’s data directory
and subsequently apply filters on it. This is slow and expensive—especially in cases of
large tables.
Partitions are essentially slices of data which allow larger sets of data to be separated
into more manageable chunks.
When a partitioned table is queried with one or both partition columns in criteria or in
the WHERE clause, what Hive effectively does is partition elimination by scanning only
those data directories that are needed. If no partitioned columns are used, then all the
directories are scanned (full table scan) and partitioning will not have any effect.
Partitions
33. Page 33Classification: Restricted
Hive organizes tables into partitions. It is a way of dividing a table into related parts
based on the values of partitioned columns such as date, city, and department. Using
partition, it is easy to query a portion of the data.
How to create partitions?
create table anand(url string,page string)partitioned by(day string);
How to load data in a partition .
load data local inpath '/home/andy1/Desktop/1234.txt' into table anand
partition(day='tue');
The partitioning can be viewed in /user/hive/warehouse/logs
hive> select * from anand where day='mon';
Partitions in Hive
34. Page 34Classification: Restricted
Tables or partitions are sub-divided into buckets, to provide extra structure to
the data that may be used for more efficient querying. Bucketing works based
on the value of hash function of some column of a table.
How to create buckets.
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Ex:::::
Create sorted bucket.
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
Buckets
35. Page 35Classification: Restricted
Setting the Strict mode in hive means that you can query strictly only on the partition
defined and the query wont execute on the non partitioned part.
For ex::
You can set the strict mode in hive by command in hive as:::
set hive.mapred.mode=strict;
Now query on the data which we partitioned earlier:::::
Select * from logs where line==“wes”;
It would show some semantic error
BUT if you do the query on the coloumn which you have partitioned then it will show
The desired output.
set hive.mapred.mode=nonstrict;
The same query now gives the output as:::
2344 Wes 25feb india
Now you can undo the strict mode by:::::
If your partitioned table is very large, you could block any full table scan queries by
putting Hive into strict mode using the set hive.mapred.mode=strict command. In this
mode, when users submit a query that would result in a full table scan (i.e. queries
without any partitioned columns) an error is issued.
Strict Mode in Hive
36. Page 36Classification: Restricted
like in hive :::::
compares the string pattern of two coloumns specified as a and b
a LIKE b
create database andy;
use andy;
create table rat(id int,dep string,des string) row format delimited fields terminated by ',';
load data local inpath '/home/mishra/Desktop/naya' into table rat;
select * from rat;
SELECT * FROM rat WHERE des LIKE dep;
this command returns the value where the string of dep matches des
Rlike in hive :::::
True if any substring of A matches with B otherwise false
suppose my data is ::::::::::::::::::
id dep des
1 hr hr
2 hr man
3 peon staff
NULL NULL NULL
1 hr shr
2 hman man
3 peon staff
Like and Rlike in Hive
37. Page 37Classification: Restricted
SELECT * FROM rat WHERE des RLIKE dep; //or dep part of des
o/p::::::
1 hr hr
1 hr shr
SELECT * FROM rat WHERE dep RLIKE des;//or des part of dep
o/p:::::::
1 hr hr
2 hman man
Like and Rlike in Hive
38. Page 38Classification: Restricted
Hive UDF
Generally Hive having some Built-in functions like LIKE and RLIKE,we can use
that Built-in functions for our Hive program with out adding any extra code but
some times user requirement is not available in that built-in functions at that
time user can write some own custom user defined functions called UDF (user
defined function).
Process is:::::::
open eclipse and save the package with name xyz
save the class with name ToUpper.java
paste the following user defined codein it:::::::::
package xyz;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class ToUpper extends UDF {
public Text evaluate(Text s) {
Text to_value = new Text("");
if (s != null) { {
to_value.set(s.toString().toUpperCase());
} {
to_value = new Text(s);
}
}
return to_value; }}
39. Page 39Classification: Restricted
Add external jars to your eclipse project.two most important jars are hadoop-
common jar which is
visible outside incommon folder in /usr/local/work/hadoop/share/hadoop/common
and another jar is hive-exec jar present in lib folder in /usr/local/work/hive/lib
now add the jar in you hive using add jar command
hive>add jar /home/ands/Desktop/hiveudf.jar;
create a temporary function using create temporary function by the name by which
you wantt to run your udf.
hive>create temporary function toupper as 'xyz.ToUpper';
hive> create table anda (name string,age int)row format delimited fields terminated
by ',';
load data local inpath '/home/ands/Desktop/expudf' into table anda;
select toupper(name) from anda;
Hive UDF