2. Introduction
Most of the Data Scientists use SQL queries in order to
explore the data and get valuable insights from them. Now, as
the volume of data is growing at such a high pace, we need
new dedicated tools to deal with big volumes of data.
Initially, Hadoop came up and became one of the most popular
tools to process and store big data. But developers were
required to write complex map-reduce codes to work with
Hadoop. This is Facebook’s Apache Hive came to rescue. It is
another tool designed to work with Hadoop. We can write SQL
like queries in the hive and in the backend it converts them
into the map-reduce jobs.
3. What is Apache Hive?
Apache Hive is a data warehouse system developed by Facebook to
process a huge amount of structure data in Hadoop. We know that to
process the data using Hadoop, we need to right complex map-
reduce functions which is not an easy task for most of the
developers. Hive makes this work very easy for us.
It uses a scripting language called HiveQL which is almost similar to
the SQL. So now, we just have to write SQL-like commands and at
the backend of Hive will automatically convert them into the map-
reduce jobs.
4. Apache Hive Architecture
Hive Clients: It allows us to write hive applications using
different types of clients such as thrift server, JDBC driver for
Java, and Hive applications and also supports the applications
that use ODBC protocol.
Hive Services: As a developer, if we wish to process any
data, we need to use the hive services such as hive CLI
(Command Line Interface). In addition to that hive also
provides a web-based interface to run the hive applications.
Hive Driver: It is capable of receiving queries from multiple
resources like thrift, JDBC, and ODBS using the hive server
and directly from hive CLI and web-based UI. After receiving
the queries, it transfers it to the compiler.
HiveQL Engine: It receives the query from the compiler and
converts the SQL like query into the map-reduce jobs.
Meta Store: Here hive stores the meta-information about the
databases like schema of the table, data types of the columns,
location in the HDFS, etc
HDFS: It is simply the Hadoop distributed file system used to
store the data.
5.
6. Working of Apache Hive
1. In the first step, we write down the query using the web interface or the command-line interface of the hive. It sends
it to the driver to execute the query.
2. In the next step, the driver sends the received query to the compiler where the compiler verifies the syntax.
3. And once the syntax verification is done, it requests metadata from the meta store.
4. Now, the metadata provides information like the database, tables, data types of the column in response to the
query back to the compiler.
5. The compiler again checks all the requirements received from the meta store and sends the execution plan to the
driver.
6. Now, the driver sends the execution plan to the HiveQL process engine where the engine converts the query into
the map-reduce job.
7. After the query is converted into the map-reduce job, it sends the task information to the Hadoop where the
processing of the query begins and at the same time it updates the metadata about the map-reduce job in the meta
store.
8. Once the processing is done, the execution engine receives the results of the query.
9. The execution engine transfers the results back to the driver and which finally sends to the hive user-interface from
where we can see the results.
7.
8. Data Types in Apache Hive
Hive data types are divided into the following 5
different categories:
Numeric Type: TINYINT, SMALLINT, INT,
BIGINT
Date/Time Types: TIMESTAMP, DATE,
INTERVAL
String Types: STRING, VARCHAR, CHAR
Complex Types: STRUCT, MAP, UNION, ARRAY
Misc Types: BOOLEAN, BINARY
9.
10. Create and Drop Database
Creating and Dropping database is very simple and
similar to the SQL. We need to assign a unique
name to each of the databases in the hive. If the
database already exists, it will show a warning and
to suppress this warning you can add the
keywords IF NOT EXISTS after the database
keyword.
CREATE DATABASE <<database_name>> ;
Dropping a database is also very simple, you just
need to write a drop database and the database
name to be dropped. If you try to drop the database
that doesn’t exist, it will give you the
SemanticException error.
DROP DATABASE <<database_name>> ;
11. Create Table
We use the create table statement to create a table and the
complete syntax is as follows.
12. Load Data into Table
Now, the tables have been created. It’s time to load
the data into it. We can load the data from any local
file on our system using the following syntax.
LOAD DATA LOCAL INPATH <<path of file on your local system>> INTO
TABLE <<database_name.>><<table_name>> ;
When we work with a huge amount of data, there is a possibility of having
unmatched data types in some of the rows. In that case, the hive will not throw
any error rather it will fill null values in place of them. This is a very useful
feature as loading big data files into the hive is an expensive process and we
do not want to load the entire dataset just because of few files.
13. Alter Table
In the hive, we can do multiple modifications to the
existing tables like renaming the tables, adding more
columns to the table. The commands to alter the
table are very much similar to the SQL commands.
Here is the syntax to rename the table:
ALTER TABLE <<table_name>> RENAME TO
<<new_name>> ;
14. Syntax to add more columns from the table:
## to add more columns
ALTER TABLE <<table_name>> ADD
COLUMNS (new_column_name_1
data_type_1, new_column_name_2
data_type_2, . . new_column_name_n
data_type_n) ;
15. Advantages/Disadvantages of
Apache Hive
1. Uses SQL like query language which is already
familiar to most of the developers so makes it easy
to use.
2. It is highly scalable, you can use it to process any
size of data.
3. Supports multiple databases like MySQL, derby,
Postgres, and Oracle for its metastore.
4. Supports multiple data formats also allows indexing,
partitioning, and bucketing for query optimization.
5. Can only deal with cold data and is useless when it
comes to processing real-time data.
6. It is comparatively slower than some of its
competitors. If your use-case is mostly about batch
processing then Hive is well and fine.