Hive - A theoretical overview in Detail.pptx

Table of Contents
• Introduction to HIVE

What is HIVE?
• It is a framework for data warehousing on top of
Hadoop.
• Hive grew from a need to manage and learn from the huge
volumes of data that Facebook was producing every day
from its burgeoning social network.
• Hive was created to make it possible for analysts with strong
SQL skills to run queries on the huge volumes of data that
Facebook stored in HDFS.

What is HIVE?
• Hive is a data warehouse infrastructure tool to process structured
data in Hadoop.
• It resides on top of Hadoop to summarize Big Data, and makes
querying and analysing easy.

What is HIVE?
• A system for querying and managing structured data built on top of Hadoop
• Uses Map-Reduce for execution
• HDFS for storage – but any system that implements Hadoop FS API
• Key Building Principles:
• Structured data with rich data types (structs, lists and maps)
• Directly query data from different formats (text/binary) and file formats
(Flat/Sequence)
• SQL as a familiar programming tool and for standard analytics
• Allow embedded scripts for extensibility and for non standard applications
• Rich MetaData to allow data discovery and for optimization

What is HIVE?
• Data warehouse software facilitates reading, writing, and
managing large datasets residing in distributed storage using
SQL.
• Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates

Features of Hive
• Features of Hive are:
• Tools to enable easy access to data via SQL, thus enabling
data warehousing tasks such as extract/transform/load (ETL),
reporting, and data analysis.
• Apache Hive supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems such as Amazon
S3 filesystem.
• Access to files stored either directly in Apache HDFS or in
other data storage systems such as Apache HBase

Features of Hive
• It provides SQL like querying language called HiveQL or HQL.
• HiveQL are implicitly converted into MapReduce or Tez, or
Spark jobs.
• Using HiveQL doesn't require any knowledge of programming
language, Knowledge of basic SQL query if enough.

Features of Hive
• Built-in user defined functions (UDFs) to manipulate dates,
strings, and other data-mining tools.
• Hive supports extending the UDF set to handle use-cases not
supported by built-in functions.
• Hive's SQL can also be extended with user code via user
defined functions (UDFs), user defined aggregates (UDAFs),
and user defined table functions (UDTFs).

Features of Hive
• Hive support file formats which are textFile, SequenceFile,
ORC, RCFile, Avro Files, Parquet, LZO Compression etc.
• Operates on compressed data stored into the Hadoop
ecosystem using algorithms including DEFLATE, BWT, snappy,
etc.

Features of Hive
• Supports external tables which make it possible to
process data without actually storing in HDFS.
• It stores schema in a database and processed data
into HDFS.
• Metadata storage in an RDBMS, significantly
reducing the time to perform semantic checks during
query execution.

Features of Hive
• It is designed for OLAP.
• It is simple to use SQL, fast, scalable, and extensible.

Advantages
• Hive is designed to
• enable easy data summarization
• ad-hoc querying
• analysis of large volumes of data.
• Hive is built on hadoop, so supports and handles all
the capabilities of hadoop provides like reliability,
high performance , node failure.

Advantages
• HiveQL statements are automatically translated into
MapReduce jobs
• Database developer need not learn the java
programming for writing map reduce programs for
retrieving data from hadoop system

Advantages
• High level query language - Simplifies working with
large amounts of data
• Lower learning curve than Pig or MapReduce -
HiveQL is much closer to SQL than Pig.
• Less trial and error than Pig

Disadvantages
• Hive is not for OLAP processing.
• Not all ‘standard’ SQL is supported
• No support for INSERTing single rows
• Updating data is complicated
• Mainly because of using HDFS
• Can add records
• Can overwrite partitions

Disadvantages
• Relatively limited number of built-in functions
• Does not support TRANSACTION
• No real time access to data
• Use other means like Hbase or Impala
• High latency

Running Hive
• Hive can be executed from:
• Hive web interface

Running Hive
• Hive shell
• $HIVE_HOME/bin/hive for interactive shell
• Or you can run queries directly:
• $HIVE_HOME/bin/hive -e ‘select a.col from tab1 a’
•

Running Hive
• JDBC - Java Database Connectivity
• "jdbc:hive://host:port/dbname"

Running Hive
• Also possible to use hive directly in Python, C, C++, PHP 6
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value
string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}

Pig Vs. Hive
• Hive is a good choice:
• if you are familiar with SQL
• when you want to query the data
• when you need an answer to a speciﬁc question
• Pig is a good choice:
• for ETL (Extract -> Transform -> Load)
‐ ‐
• preparing your data so that it is easier to analyse
• when you have a long series of steps to perform
• Many businesses use both Pig and Hive together

Hive Vs. RDBMS
• Differences between Hive vs RDBMS (traditional relation databases).
• Few examples of traditional relational databases are MySQL, PostgreSQL,
Oracle 11 g, MS SQL Server etc.
• Some of the key features of Hive that differ from RDBMS.
• Hive resembles a traditional database by supporting SQL interface but it is
not a full database. Hive can be better called as data warehouse instead of
database.
• Hive enforces schema on read time whereas RDBMS enforces schema on write
time.

Hive Vs. RDBMS
• Key features of Hive that differ from RDBMS.
• In RDBMS, a table's schema is enforced at data load time, If the
data being loaded doesn't conform to the schema, then it is
rejected.
• This design is called schema on write.
• But Hive doesn't verify the data when it is loaded, but rather when
it is retrieved.
• This is called schema on read.

Hive Vs. RDBMS
• Schema on read makes for a very fast initial load, since the
data does not have to be read, parsed, and serialized to disk in
the database's internal format.
• The load operation is just a file copy or move.
• Schema on write makes query time performance faster, since
the database can index columns and perform compression on
the data but it takes longer to load data into the database.

Hive Vs. RDBMS
• Hive is based on the notion of Write once, Read
many times.
• RDBMS is designed for Read and Write many times.

Hive Vs. RDBMS
• In RDBMS, record level updates, insertions and deletes,
transactions and indexes are possible.
• This is not allowed in Hive because Hive was built to operate
over HDFS data using MapReduce, where full-table scans
are the norm and a table update is achieved by
transforming the data into a new table.

Hive Vs. RDBMS
• In RDBMS, maximum data size allowed will be in
10's of Terabytes
• Hive can 100's Petabytes very easily.

Hive Vs. RDBMS
• As Hadoop is a batch-oriented system, Hive doesn't support
OLTP (Online Transaction Processing) but it is closer to OLAP
(Online Analytical Processing) but not ideal since there is
significant latency between issuing a query and receiving a
reply, due to the overhead of Mapreduce jobs and due to
the size of the data sets Hadoop was designed to serve.

Hive Vs. RDBMS
• RDBMS is best suited for dynamic data analysis and where fast
responses are expected
• Hive is suited for data warehouse applications, where relatively
static data is analyzed, fast response times are not required, and
when the data is not changing rapidly.
• To overcome the limitations of Hive, HBase is being integrated
with Hive to support record level operations and OLAP.

Hive Vs. RDBMS
• Hive is very easily scalable at low cost
• RDBMS is not that much scalable that too it is very
costly scale up.

Traditional databases Vs. Hive
Hive Traditional Databases
SQL Interface SQL Interface
Focus on batch analytics Mostly online, interactive analytics
No transactions Transactions are their way of life
No random inserts
Updates are not natively supported (but possible.)
Random insert and updates
Distributed processing via MR Distributed processing capabilities
vary
Scales to hundreds of nodes Seldom scales beyond 20 nodes
Built for commodity hardware Expensive, proprietary hardware
Low cost per petabyte Does not support petabyte

HiveQL
• Hive query language provides the basic SQL like operations.
• These operations are:
• Ability to filter rows from a table using a where clause.
• Ability to select certain columns from the table using a select
clause.
• Ability to do equi-joins between two tables.
• Ability to evaluate aggregations on multiple "group by"
columns for the data stored in a table.

HiveQL
• These operations are:
• Ability to store the results of a query into another table.
• Ability to download the contents of a table to a local directory.
• Ability to store the results of a query in a hadoop dfs directory.
• Ability to manage tables and partitions (create, drop and
alter).
• Ability to use custom scripts in chosen language (for map /
reduce).

SQL Vs. HiveQL
Operations
& Functions
SQL Hive Query Language
Select SQL-92 supports it. Single table or view in FROM
clause.
For partial ordering SORT BY is
used.
To limit number of rows returned
LIMIT operations is used.
HAVING clause is not supported.

SQL Vs. HiveQL
Operations&
Functions
Updates UPDATE, INSERT,
DELETE
INSERT OVERWRITE TABLE
(It populates complete table
or partition)
Data types
present
Integral, floating
point, fixed point, text
and binary strings.
temporal
Integral, floating point.
boolean, string, array, map.
struct

SQL Vs. HiveQL
Operations&
Functions
Default Join
Types
Inner Join Equi Join
Built-in
Functions
Built-in functions are in
Hundreds.
Dozens of built- in
Functions present.
Multiple table
inserts
Not supported in SQL Supported in HiveQL

SQL Vs. HiveQL
Operations&
Functions
Create table
as select
Not valid in SQL but
may be found in
some databases
Supported by HiveQL
Extension
points
User-defined
functions and Stored
procedures.
User-defined functions and
Map-Reduce scripts.

SQL Vs. HiveQL
Operations&
Functions
Transactions Supported Not supported
Indexes Supported Not supported
Latency Sub-second Minutes

Hive Data Model
DB HDFS
Directory
Partitions
(sub-directory)
Buckets
(Files)
Tables
• Hive structure data into a well defined database concept i.e
Tables , columns and rows, partitions ,buckets etc .
Hive Data Model

Hive Data Model
• Tables
• Types Columns(int , float , string , date , Boolean)
• Supports array/map/struct for JSON like data
• Partitions
• ie, range partition tables by date
• Buckets
• Hash partition within ranges
• Useful for sampling , join optimization

Metastore
• Database
• Namespace containing a set of tables
• Table
• Containing list of columns and their types .
• Partition
• Each partition can have its own columns storage info
• Mapping to HDFS directories
• Statistics
• Info about the database

Hive Physical Layout
• Warehouse directory in HDFS
• Table row data is stored in warehouse subdirectory
• Partition creates subdirectory within table directories
• Actual data is stored in flat files

Hive - A theoretical overview in Detail.pptx

Hive - A theoretical overview in Detail.pptx

More Related Content

Similar to Hive - A theoretical overview in Detail.pptx

Recently uploaded

Hive - A theoretical overview in Detail.pptx

Editor's Notes