3. Introduction
Hive is a data warehouse infrastructure that is used to process
structured data in Hadoop.
It provides an SQL like language known as HiveQL(HQL) to
perform data analysis and querying on large datasets. Hive
allow user to interact with Hadoop data using SQL like syntax.
It supports a wide range of data formats including structured,
semi-structured, and unstructured data. Hive supports the
creation of tables, partitions, and views, enabling data
organization and management.
It provides functions, operators, and built-in transformations for
data manipulation and analysis.
Hive is widely used in big data analytics, data warehousing, and
business intelligence applications.
4. Components of Hive
• Hive Client : The interface used by users to interact with
Hive and submit queries for data analysis and retrieval.
• Hive Servers : The components that run Hive services and
provide query processing capabilities, including the Hive
Metastore and Hive Server.
• Processing & Resources Management : The coordination
and allocation of computing resources, such as memory and
CPU, for executing Hive queries efficiently and optimizing
performance.
• Distribute Storage : The storage mechanism used by Hive to
store and manage large datasets across a cluster of
machines, typically based on distributed file systems like
Hadoop Distributed File System (HDFS).
5. Architecture
User Interface Hive is a data warehouse infrastructure software that can
create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive command
line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their
data types, and HDFS mapping.
HiveQL Process
Engine
HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional
approach for MapReduce program. Instead of writing
6. Features
• SQL-like language for querying and interacting with data in Hadoop. Sql type queries will
help many of the Hadoop developers to write queries with ease.
• Designed for data warehousing and analytics on large datasets.
• Apache hive is really fast Sql like interface , using this feature of HDFS will help to write
quries faster and executing them.
• Hive is high scalable and easy to learn that is known to be highly extensible.
• Support for data partitioning to improve performance.
• Built-in support for various data formats.
• We can execute ad- hoc query to analyze and predict data.
• Optimization and execution engines for efficient query processing.
• Scalability and fault tolerance for handling large volumes of data.
7. Working
Step
No.
Operation
1 Execute Query The Hive interface such as Command Line or
Web UI sends query to Driver (any database driver such as
JDBC, ODBC, etc.) to execute.
2 Get Plan The driver takes the help of query compiler that
parses the query to check the syntax and query plan or the
requirement of query.
3 Get Metadata The compiler sends metadata request to
Metastore (any database).
8. 5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.
6 Execute Plan The driver sends the execute plan to the execution engine.
7 Execute Job Internally, the process of execution job is a MapReduce job. The execution
engine sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations
with Metastore.
8 Fetch Result The execution engine receives the results from Data nodes.
9 Send Results The execution engine sends those resultant values to the driver.