HivePart1.pptx

Hive
●
●
●
●
Hive is a data warehouse infrastructure tool.
It resides on top of Hadoop to summarize Big
Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook,
Later the Apache Software Foundation took it
up and developed it further as an open source
under the name Apache Hive.

Hive is not
● A relational database.
● A design forOnLine Transaction
Processing (OLTP).
● A language for real-time queries and
row-level updates.

Features of Hive
●
●
●
●
It stores schema in a database and processed
data into HDFS.
It is designed for OLAP.
It provides SQLtype language for querying
called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.

Architecture of Hive
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can
create interaction between user and HDFS.
The user interfaces that Hive supports are Hive Web UI,
Hive command line, and Hive HD Insight.
Meta Store Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping.
HiveQL Process
Engine
HiveQL is similar to SQLfor querying on schema info on
the Metastore
Execution
Engine
The conjunction part of HiveQL process Engine and
MapReduce is Hive ExecutionEngine.
Execution engine processes the query and generates
results as same as MapReduceresults
HDFS or HBASE Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.

Hive - Data Types
●
All the data types in Hive are classified into
four types,
–
–
–
–
Column Types
Literals
Null Values
Complex Types

Column Types
●
●
Column type are used as column data types of
Hive.
Integral Types
– Type
– TINYINT
– SMALLINT
– INT
– BIGINT
Postfix
Y
S
-
L
Example
10Y
10S
10
10L

Column Types
● String Types
– Specified using single quotes (' ') or double quotes
(" ").
– It contains two data types: VARCHAR and CHAR.
Hive follows C-types escape characters.
●
●
●
Data Type
VARCHAR
CHAR
Length
1 to 65355
255
Maps
MAP<primitive_type, data_type>

Column Types
Timestamp Dates Decimals
YYYY-MM-DD HH:MM:SS.fffffffff YYYY-MM-DD DECIMAL(precision, scale)
java.sql.Timestamp 1982-01-14 decimal(10,0)
Union Types
Union is a collection of heterogeneous data types.
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
Floating Point
Types
Decimal Type
Null Value – NULL

Hive - Create Database
hive> show databases;
OK
default
Time taken: 13.112 seconds
hive> create databaseretail;
OK
Time taken: 0.113seconds
OK
default
retail
●

Hive - Drop Database
hive> showdatabases;
OK
default
retail
userdb
hive> DROP DATABASEIF EXISTSuserdb;
OK
OK
default
retail
hive> DROP DATABASE IF EXISTS financials CASCADE; drop the tables in the
database first

Hive - Create Table & Load Data
SNO Field Name Data Type
1 Eid int
2 Name String
3 Salary Float
4 Designation String
CREATETABLE IF NOT EXISTS retail.employee ( eid int, name String, salary float, designation
String)
COMMENT 'Employee Details'
ROW FORMAT DELIMITED
FIELDSTERMINATED BY't'
LINES TERMINATED BY'n'
STOREDAS TEXTFILE;
hive> LOAD DATALOCAL INPATH'/home/hduser/emp.txt'
> OVERWRITEINTO TABLEretail.employee;

Hive - Alter Table
●
●
●
●
●
ALTERTABLE name RENAME TO new_name
ALTERTABLE name ADD COLUMNS (col_spec[, col_spec...])
ALTERTABLE name DROP [COLUMN] column_name
ALTERTABLE name CHANGE column_name new_name new_type
ALTERTABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
hive> show tables;
OK
testtable
hive> ALTERTABLE testtable RENAME TOemp;
OK
hive> show tables;
OK
emp

Hive - Alter Table Example
hive> ALTERTABLE employee CHANGE name ename String;
hive> ALTERTABLE employee CHANGE salary salaryDouble;
hive >ALTERTABLE employee ADD COLUMNS ( dept STRINGCOMMENT
'Department name');
Hive - Drop Table
DROP TABLE IF EXISTSemployee;

Create DATABASE
●
●
●
●
●
●
●
●
●
●
●
hive> CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT
Details'
> WITH DBPROPERTIES('creator'='PRAKASH');
OK
hive> SHOW DATABASES;
OK
default
retail
students
hive>
●

Describe
●
●
●
●
●
●
DESCRIBE DATABASE STUDENTS;
OK
students STUDENT Details
hdfs://localhost:54310/user/hive/warehouse/students.db
DESCRIBE DATABASE EXTENDED STUDENTS;
OK
hdfs://localhost:54310/user/hive/warehouse/students.db
{creator=PRAKASH}
●

Alter and Describe
●
●
●
●
●
ALTER DATABASE STUDENTS SET
DBPROPERTIES ('edited by' = 'SRINIDHI');
DESCRIBE DATABASE EXTENDED STUDENTS;
OK
hdfs://localhost:54310/user/hive/warehouse/studen
ts.db{edited by=SRINIDHI, creator=PRAKASH}
●
●

Tables – Managed Table
●
●
●
Stores the managed tables under the warehouse
folder under Hive.
The life cycle of table and data is managed by Hive
When the internal table is dropped, it drops the data
as well as metadata.
–
–
●
– CREATE TABLE IF NOT EXISTS STUDENT (rollno INT,
name STRING, gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't';
DESCRIBE STUDENT;
OK
int
string
rollno
name
gpa float

External or Self-Managed Table
●
●
●
When the table is dropped, it retains the data in the undelying
location.
External keyword is used.
Location needs to be specified to store the data set in that particular
location.
●
CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno
INT, name STRING,
> gpa FLOAT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY 't'
> LOCATION '/STUDENT_INFO';
LOAD DATALOCAL INPATH '/home/hduser/stu.tsv'
> OVERWRITE INTO TABLE EXT_STUDENT;
Copying data from file:/home/hduser/stu.tsv
Copying file: file:/home/hduser/stu.tsv
Loading data to table default.ext_student

Work with Collection Data Types
●
●
1001, Prakash,BE:ME,FLA!65:CLE!76:DAA!89
1002, Ram,Btech:Mtech,FLA!35:CLE!66:DAA!54
●
CREATE TABLE STUDENT_INFO(rollno INT, name String,
qualificationARRAY
> <STRING>, marks MAP<STRING,INT>)
> FIELDS TERMINATED BY ','
> COLLECTION ITEMS TERMINATED BY ':'
> MAP KEYS TERMINATED BY '!';
●
LOAD DATA LOCAL INPATH '/home/hduser/studentinfo.csv'
> INTO TABLE STUDENT_INFO;

Querying Tables
●
●
●
SELECT * FROM EXT_STUDENT;
SELECT NAME,MARKS['FLA'] FROM
STUDENT_INFO;
SELECT NAME,QUALIFICATION[0] FROM
STUDENT_INFO;

Hive - Partitioning
●
●
–
–
–
–
●
–
●
–
Hive organizes tables into partitions. It is a way of dividing a table into related
parts based on the values of partitioned columns such as date, city, and
department. Using partition, it is easy to query a portion of the data.
Partitions are fundamentally horizontal slices of data which allow larges sets of
data to be segmented into more manageable chunks.
Assume that you are storing information of people in entire world spread across 196+
countries spanning around 500 crores of entries.
If you want to query people from a particular country (Vatican city), in absence of
partitioning, you have to scan all 500 crores of entries even to fetch thousand entries of
a country.
If you partition the table based on country, you can fine tune querying process by just
checking the data for only one country partition.
Hive partition creates a separate directory for a column(s) value.
Static Partition
Columns values known at compile time.
Dynamic Partition
Columns values known at Execution time

Static Partition
CREATE TABLE IF NOT EXISTS STATIC_STUDENT( rollno INT, name
STRING)
> PARTITIONED BY (gpa FLOAT)
> FIELDS TERMINATED BY 't';
INSERT OVERWRITE TABLE STATIC_STUDENT PARTITION (gpa = 8.1)
> SELECT ROLLNO, NAME FROM EXT_STUDENT WHERE GPA=8.1;

Dymanic Partition
CREATE TABLE IF NOT EXISTS DYNAMIC_STUDENT(rollno
INT, name STRING)
> PARTITIONED BY (gpa FLOAT)
> FIELDS TERMINATED BY 't';
hive> SET hive.exec.dynamic.partition = true;
hive> SET hive.exec.dynamic.partition.mode= nonstrict;
INSERT OVERWRITE TABLE DYNAMIC_STUDENT PARTITION
(gpa) select rollno, name , gpa from EXT_STUDENT;

Bucketing
●
●
●
●
Similar to partition.
In a partition you need to create a partition for
each unique value of the column -leads
thousands of partition.
Bucketing – limits the number of partition.
Bucket is a file.

Bucketing
• Tocreate a bucketed table having 3 buckets.
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT, name STRING,
gpa FLOAT)
CLUSTERED BY (gpa) into 3 buckets;
• Load data to bucketed table.
FROM STUDENT INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,gpa;
• Todisplay the content of first bucket.
SELECT DISTINCT gpa FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON gpa);

View
View support is available only in version starting from 0.6.
To create a view table named “STUDENT_VIEW”
CREATE VIEW STUDENT_VIEW AS SELECT rollno, name FROM
EXT_STUDENT;
Querying the view
SELECT * FROM STUDENT_VIEW LIMIT 4;
To drop the view
DROP VIEW STUDENT_VIEW;

Sub Query
LOAD DATALOCAL INPATH '/home/hduser/Desktop/lines.txt'
OVERWRITE INTO TABLE docs;
CREATE TABLE word_count AS
> SELECT word , count(1) AS count FROM
> (SELECT explode (split (line, ' ')) AS word FROM docs) w
> GROUP BY word
> ORDER BY word;
● explode function – takes array as input and outputs the
elements of the array as seperate rows.

Joins
Joins in Hive is similar to SQL joins
To create JOIN between Student and Department tables where we use RollNo from both the tables as the join key.
1.CREATE TABLE IF NOT EXISTS STUDENT(rollno INT, name STRING, gpa FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ‘t’;
2. LOAD DATA LOCAL INPATH ‘/home/hduser/Desktop/student.tsv’ OVERWRITWE INTO TABLE STUDENT;
3.CREATE TABLE IF NOT EXISTS DEPARTMENT(rollno INT, deptno INT ,name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED
BY ‘t’;
4. LOAD DATA LOCAL INPATH ‘//home/hduser/Desktop/department.tsv’ OVERWRITWE INTO TABLE DEPARTMENT;
5. SELECT a.rollno,a.name,a.gpa,b.deptno FROM STUDENT a JOIN DEPARTMENT b ON a.rollno=b.rollno

Aggregations
Hive supports aggregation functions like avg, count, etc.
Towrite the average and count aggregation function.
SELECT avg(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT;

HivePart1.pptx

Recommended

Recommended

More Related Content

Similar to HivePart1.pptx

Similar to HivePart1.pptx (20)

Recently uploaded

Recently uploaded (20)

HivePart1.pptx