2. Hive
●
●
●
●
Hive is a data warehouse infrastructure tool.
It resides on top of Hadoop to summarize Big
Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook,
Later the Apache Software Foundation took it
up and developed it further as an open source
under the name Apache Hive.
3. Hive is not
● A relational database.
● A design forOnLine Transaction
Processing (OLTP).
● A language for real-time queries and
row-level updates.
4. Features of Hive
●
●
●
●
It stores schema in a database and processed
data into HDFS.
It is designed for OLAP.
It provides SQLtype language for querying
called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
6. Architecture of Hive
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can
create interaction between user and HDFS.
The user interfaces that Hive supports are Hive Web UI,
Hive command line, and Hive HD Insight.
Meta Store Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping.
HiveQL Process
Engine
HiveQL is similar to SQLfor querying on schema info on
the Metastore
Execution
Engine
The conjunction part of HiveQL process Engine and
MapReduce is Hive ExecutionEngine.
Execution engine processes the query and generates
results as same as MapReduceresults
HDFS or HBASE Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.
8. Hive - Data Types
●
All the data types in Hive are classified into
four types,
–
–
–
–
Column Types
Literals
Null Values
Complex Types
9. Column Types
●
●
Column type are used as column data types of
Hive.
Integral Types
– Type
– TINYINT
– SMALLINT
– INT
– BIGINT
Postfix
Y
S
-
L
Example
10Y
10S
10
10L
10. Column Types
● String Types
– Specified using single quotes (' ') or double quotes
(" ").
– It contains two data types: VARCHAR and CHAR.
Hive follows C-types escape characters.
●
●
●
Data Type
VARCHAR
CHAR
Length
1 to 65355
255
Maps
MAP<primitive_type, data_type>
11. Column Types
Timestamp Dates Decimals
YYYY-MM-DD HH:MM:SS.fffffffff YYYY-MM-DD DECIMAL(precision, scale)
java.sql.Timestamp 1982-01-14 decimal(10,0)
Union Types
Union is a collection of heterogeneous data types.
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
Floating Point
Types
Decimal Type
Null Value – NULL
12. Hive - Create Database
hive> show databases;
OK
default
Time taken: 13.112 seconds
hive> create databaseretail;
OK
Time taken: 0.113seconds
hive> show databases;
OK
default
retail
Time taken: 0.058seconds
●
13. Hive - Drop Database
hive> showdatabases;
OK
default
retail
userdb
Time taken: 0.058seconds
hive> DROP DATABASEIF EXISTSuserdb;
OK
Time taken: 4.841seconds
hive> show databases;
OK
default
retail
Time taken: 0.07seconds
hive> DROP DATABASE IF EXISTS financials CASCADE; drop the tables in the
database first
14. Hive - Create Table & Load Data
SNO Field Name Data Type
1 Eid int
2 Name String
3 Salary Float
4 Designation String
CREATETABLE IF NOT EXISTS retail.employee ( eid int, name String, salary float, designation
String)
COMMENT 'Employee Details'
ROW FORMAT DELIMITED
FIELDSTERMINATED BY't'
LINES TERMINATED BY'n'
STOREDAS TEXTFILE;
hive> LOAD DATALOCAL INPATH'/home/hduser/emp.txt'
> OVERWRITEINTO TABLEretail.employee;
15. Hive - Alter Table
●
●
●
●
●
ALTERTABLE name RENAME TO new_name
ALTERTABLE name ADD COLUMNS (col_spec[, col_spec...])
ALTERTABLE name DROP [COLUMN] column_name
ALTERTABLE name CHANGE column_name new_name new_type
ALTERTABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
hive> show tables;
OK
testtable
Time taken: 0.082seconds
hive> ALTERTABLE testtable RENAME TOemp;
OK
Time taken: 1.837seconds
hive> show tables;
OK
emp
Time taken: 0.08seconds
16. Hive - Alter Table
●
●
●
●
●
ALTERTABLE name RENAME TO new_name
ALTERTABLE name ADD COLUMNS (col_spec[, col_spec...])
ALTERTABLE name DROP [COLUMN] column_name
ALTERTABLE name CHANGE column_name new_name new_type
ALTERTABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
hive> show tables;
OK
testtable
Time taken: 0.082seconds
hive> ALTERTABLE testtable RENAME TOemp;
OK
Time taken: 1.837seconds
hive> show tables;
OK
emp
Time taken: 0.08seconds
17. Hive - Alter Table Example
hive> ALTERTABLE employee CHANGE name ename String;
hive> ALTERTABLE employee CHANGE salary salaryDouble;
hive >ALTERTABLE employee ADD COLUMNS ( dept STRINGCOMMENT
'Department name');
Hive - Drop Table
DROP TABLE IF EXISTSemployee;
18. Create DATABASE
●
●
●
●
●
●
●
●
●
●
●
hive> CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT
Details'
> WITH DBPROPERTIES('creator'='PRAKASH');
OK
Time taken: 0.496 seconds
hive> SHOW DATABASES;
OK
default
retail
students
Time taken: 0.086 seconds
hive>
●
19. Describe
●
●
●
●
●
●
DESCRIBE DATABASE STUDENTS;
OK
students STUDENT Details
hdfs://localhost:54310/user/hive/warehouse/students.db
DESCRIBE DATABASE EXTENDED STUDENTS;
OK
students STUDENT Details
hdfs://localhost:54310/user/hive/warehouse/students.db
{creator=PRAKASH}
●
20. Alter and Describe
●
●
●
●
●
ALTER DATABASE STUDENTS SET
DBPROPERTIES ('edited by' = 'SRINIDHI');
DESCRIBE DATABASE EXTENDED STUDENTS;
OK
students STUDENT Details
hdfs://localhost:54310/user/hive/warehouse/studen
ts.db{edited by=SRINIDHI, creator=PRAKASH}
Time taken: 0.048 seconds
●
●
21. Tables – Managed Table
●
●
●
Stores the managed tables under the warehouse
folder under Hive.
The life cycle of table and data is managed by Hive
When the internal table is dropped, it drops the data
as well as metadata.
–
–
●
– CREATE TABLE IF NOT EXISTS STUDENT (rollno INT,
name STRING, gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't';
DESCRIBE STUDENT;
OK
int
string
rollno
name
gpa float
22. External or Self-Managed Table
●
●
●
When the table is dropped, it retains the data in the undelying
location.
External keyword is used.
Location needs to be specified to store the data set in that particular
location.
●
CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno
INT, name STRING,
> gpa FLOAT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY 't'
> LOCATION '/STUDENT_INFO';
LOAD DATALOCAL INPATH '/home/hduser/stu.tsv'
> OVERWRITE INTO TABLE EXT_STUDENT;
Copying data from file:/home/hduser/stu.tsv
Copying file: file:/home/hduser/stu.tsv
Loading data to table default.ext_student
23. Work with Collection Data Types
●
●
1001, Prakash,BE:ME,FLA!65:CLE!76:DAA!89
1002, Ram,Btech:Mtech,FLA!35:CLE!66:DAA!54
●
CREATE TABLE STUDENT_INFO(rollno INT, name String,
qualificationARRAY
> <STRING>, marks MAP<STRING,INT>)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> COLLECTION ITEMS TERMINATED BY ':'
> MAP KEYS TERMINATED BY '!';
●
LOAD DATA LOCAL INPATH '/home/hduser/studentinfo.csv'
> INTO TABLE STUDENT_INFO;
24. Querying Tables
●
●
●
SELECT * FROM EXT_STUDENT;
SELECT NAME,MARKS['FLA'] FROM
STUDENT_INFO;
SELECT NAME,QUALIFICATION[0] FROM
STUDENT_INFO;
25. Hive - Partitioning
●
●
–
–
–
–
●
–
●
–
Hive organizes tables into partitions. It is a way of dividing a table into related
parts based on the values of partitioned columns such as date, city, and
department. Using partition, it is easy to query a portion of the data.
Partitions are fundamentally horizontal slices of data which allow larges sets of
data to be segmented into more manageable chunks.
Assume that you are storing information of people in entire world spread across 196+
countries spanning around 500 crores of entries.
If you want to query people from a particular country (Vatican city), in absence of
partitioning, you have to scan all 500 crores of entries even to fetch thousand entries of
a country.
If you partition the table based on country, you can fine tune querying process by just
checking the data for only one country partition.
Hive partition creates a separate directory for a column(s) value.
Static Partition
Columns values known at compile time.
Dynamic Partition
Columns values known at Execution time
26. Static Partition
CREATE TABLE IF NOT EXISTS STATIC_STUDENT( rollno INT, name
STRING)
> PARTITIONED BY (gpa FLOAT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY 't';
INSERT OVERWRITE TABLE STATIC_STUDENT PARTITION (gpa = 8.1)
> SELECT ROLLNO, NAME FROM EXT_STUDENT WHERE GPA=8.1;
27. Dymanic Partition
CREATE TABLE IF NOT EXISTS DYNAMIC_STUDENT(rollno
INT, name STRING)
> PARTITIONED BY (gpa FLOAT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY 't';
hive> SET hive.exec.dynamic.partition = true;
hive> SET hive.exec.dynamic.partition.mode= nonstrict;
INSERT OVERWRITE TABLE DYNAMIC_STUDENT PARTITION
(gpa) select rollno, name , gpa from EXT_STUDENT;
28. Bucketing
●
●
●
●
Similar to partition.
In a partition you need to create a partition for
each unique value of the column -leads
thousands of partition.
Bucketing – limits the number of partition.
Bucket is a file.
29. Bucketing
• Tocreate a bucketed table having 3 buckets.
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT, name STRING,
gpa FLOAT)
CLUSTERED BY (gpa) into 3 buckets;
• Load data to bucketed table.
FROM STUDENT INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,gpa;
• Todisplay the content of first bucket.
SELECT DISTINCT gpa FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON gpa);
30. View
View support is available only in version starting from 0.6.
To create a view table named “STUDENT_VIEW”
CREATE VIEW STUDENT_VIEW AS SELECT rollno, name FROM
EXT_STUDENT;
Querying the view
SELECT * FROM STUDENT_VIEW LIMIT 4;
To drop the view
DROP VIEW STUDENT_VIEW;
31. Sub Query
LOAD DATALOCAL INPATH '/home/hduser/Desktop/lines.txt'
OVERWRITE INTO TABLE docs;
CREATE TABLE word_count AS
> SELECT word , count(1) AS count FROM
> (SELECT explode (split (line, ' ')) AS word FROM docs) w
> GROUP BY word
> ORDER BY word;
● explode function – takes array as input and outputs the
elements of the array as seperate rows.
32. Joins
Joins in Hive is similar to SQL joins
To create JOIN between Student and Department tables where we use RollNo from both the tables as the join key.
1.CREATE TABLE IF NOT EXISTS STUDENT(rollno INT, name STRING, gpa FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ‘t’;
2. LOAD DATA LOCAL INPATH ‘/home/hduser/Desktop/student.tsv’ OVERWRITWE INTO TABLE STUDENT;
3.CREATE TABLE IF NOT EXISTS DEPARTMENT(rollno INT, deptno INT ,name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED
BY ‘t’;
4. LOAD DATA LOCAL INPATH ‘//home/hduser/Desktop/department.tsv’ OVERWRITWE INTO TABLE DEPARTMENT;
5. SELECT a.rollno,a.name,a.gpa,b.deptno FROM STUDENT a JOIN DEPARTMENT b ON a.rollno=b.rollno
33. Aggregations
Hive supports aggregation functions like avg, count, etc.
Towrite the average and count aggregation function.
SELECT avg(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT;