Apache Hive

BIG DATA
PRESENTATION
GROUP - 2

• Apache Hive is a data warehouse software project built on top
of Apache Hadoop for providing data query and analysis. Hive
gives an SQL-like interface to query data stored in various
databases and file systems that integrate with Hadoop.
• It is an open source data warehouse software for reading,
writing and managing large data set files that are stored
directly in either the Apache Hadoop Distributed File System
(HDFS) or other data storage systems such as Apache HBase.

• Functionality: SQL-like query engine designed for high
volume data stores. Multiple file-formats are supported.
• Processing Type: Batch processing using Apache Tez or
MapReduce compute frameworks.
• Latency: Medium to high, depending on the responsiveness
of the compute engine. The distributed execution model
provides superior performance compared to monolithic
query systems, like RDBMS, for the same data volumes.
• Hadoop Integration: Runs on top of Hadoop, with Apache
Tez or MapReduce for processing and HDFS or Amazon S3
for storage.

WHY HIVE?
• The motivation behind Apache Hive was to simplify query development, and to, in turn, open
up Hadoop unstructured data to a wider group of users in organizations. ... Hive enables data
serialization/deserialization and increases flexibility in schema design by including a system
catalog called Hive-Metastore.
• Hive allows users to read, write, and manage petabytes of data using SQL. Hive is built on top
of Apache Hadoop, which is an open-source framework used to efficiently store and process
large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work
quickly on petabytes of data.
• Hive is designed for querying and managing only structured data stored in tables. Hive is
scalable, fast, and uses familiar concepts. Schema gets stored in a database, while processed
data goes into a Hadoop Distributed File System (HDFS).

HIVE DATA MODEL
Data in Hive organized into :
• Tables
• Partitions
• Buckets

TABLES
- Analogous to relational tables
- Each table has a corresponding directory
in HDFS
- Data serialized and stored as files
within that directory
- Hive has default serialization built in
which supports compression and lazy
deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)

PARTITIONS
partitions - Each table can be broken into partitions
- Partitions determine distribution of data
within subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount
FLOAT)
PARTITIONED BY (country STRING, year INT,
month INT)
So each partition will be split out into different
folders like
Sales/country=US/year=2012/month=12

BUCKETS
- Data in each partition divided into
buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket
number
- Each bucket is stored as a file in
partition directory.

HIVE CLIENTS
• Hive provides different drivers for communication with a different type of
applications. ForThrift based applications, it will provideThrift client for
communication.
• For Java related applications, it provides JDBC Drivers. Other than any type of
applications provided ODBC drivers.These Clients and drivers in turn again
communicate with Hive server in the Hive services.

HIVE SERVICES
• Client interactions with Hive can be performed through Hive
Services. If the client wants to perform any query related
operations in Hive, it has to communicate through Hive
Services.
• CLI is the command line interface acts as Hive service for
DDL (Data definition Language) operations. All drivers
communicate with Hive server and to the main driver in
Hive services as shown in above architecture diagram.
• Driver present in the Hive services represents the main driver, and it
communicates all type of JDBC, ODBC, and other client specific
applications.
• Driver will process those requests from different applications to
meta store and field systems for further processing.

HIVE STORAGE AND
COMPUTING
• Hive services such as Meta store, File system,
and Job Client in turn communicates with Hive
storage and performs the following actions
• Metadata information of tables created in Hive is
stored in Hive "Meta storage database".
• Query results and data loaded in the tables are
going to be stored in Hadoop cluster on HDFS.

• Hive provides a CLI to write Hive queries using Hive
Query Language (HiveQL). Generally HQL syntax is
similar to the SQL syntax that most data analysts are
familiar with.
• Hive's SQL-inspired language separates the user from
the complexity of Map Reduce programming. It reuses
familiar concepts from the relational database world,
such as tables, rows, columns and schema, to ease
learning.

SELECT-WHERE STATEMENT
• SELECT statement is used to retrieve
the data from a table.
• WHERE clause works similar to a
condition.
• It filters the data using the condition
and gives you a finite result.
• The built-in operators and functions
generate an expression , which fulfils
the condition.
• Example:
• hive> SELECT * FROM
employee WHERE
salary>30000;
• The above query selects all the
rows from the table employee
where salary is more than
30000.

HIVEQL - SELECT-JOINS JOIN
• HiveQL - Select-Joins JOIN is a clause
that is used for combining specific fields
from two tables by using values common
to each one.
• It is used to combine records from two or
more tables in the database. It is more or
less similar to SQL JOIN.

Select Order By
• clause is used to retrieve the
details based on one column and
sort the result set by ascending
or descending order
• hive> SELECT Id, Name, Dept
FROM employee ORDER BY
DEPT;
Select-Group By
• clause is used to group all the
records in a result set using a
particular collection column. It is
used to query a group of records.
• hive> SELECT Dept,count(*)
FROM employee GROUP BY
DEPT;

PROS AND CONS..
PROS
• Framework
• Multi-user
• Data Analysis
• Storage
• Format conversion
CONS
• OLTP Processing issues
• No Updates
• Subqueries

1830069 – Tushar
Singhal
1830095 – Hansa
Maheshwari
18300112 –
Rajarshi
Chowdhury
1830113 – Rajat
Sharma
1830118 –
Ritwika Mitra
1830142 –
Twinkle Sinha

Apache Hive

More Related Content

What's hot

Similar to Apache Hive

Recently uploaded

Apache Hive