Hadoop Hive

Hadoop–Developer
Training
An Introduction to
Hive
Madhur Nawandar
madhur.nawandar@clogeny.com
Cloud
Computing
Enterprise
Applications Big Data
Storage
DevOps

Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
What is Hive
A data warehousing infrastructure based on
Hadoop
Provides easy data summarization
Provides ad-hoc querying and analysis of large
volumes of data
Comes with Hive QL, based on SQL
Allows to plug in custom mappers and reducers

(US) 408-556-9645
(India) +91 20 661 43 482
What Hive is NOT
Not suitable for small datasets due to high latency
Cannot be compared to systems like Oracle
Does not offer real-time queries and row level
updates

(US) 408-556-9645
(India) +91 20 661 43 482
Hive Architecture
(US) 408-556-9645
(India) +91 20 661 43 482

(US) 408-556-9645
(India) +91 20 661 43 482
Data Models
Tables
• Made up of actual data and the associated metadata
• Actual data is stored in any Hadoop Filesystem
• Metadata is always stored in a relational database
• Managed Tables
 Hive moves data into its warehouse
 CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• External Tables
 Hive refers data from existing location
 CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
(US) 408-556-9645
(India) +91 20 661 43 482

(US) 408-556-9645
(India) +91 20 661 43 482
Data Models
Partitions
• A way to dividing tables into coarse-grained parts
• Based on the value of partition column
• Supports multiple dimensions
• Defined at table creation time using PARTITION BY
clause
• At the filesystem level, partitions are simply nested
subdirectories of the table directory.
(US) 408-556-9645
(India) +91 20 661 43 482

(US) 408-556-9645
(India) +91 20 661 43 482
Data Models
• CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
• LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
(US) 408-556-9645
(India) +91 20 661 43 482

(US) 408-556-9645
(India) +91 20 661 43 482
Data Models
Buckets
• Partitions table within range
• Enables more efficient queries
• Make sampling more efficient
 CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
(US) 408-556-9645
(India) +91 20 661 43 482

(US) 408-556-9645
(India) +91 20 661 43 482
Column Data Types
(US) 408-556-9645
(India) +91 20 661 43 482
Primitive
TYPE DESCRIPTION EXAMPLE
TINYINT 8-bit signed integer 1
SMALLINT 16-bit signed integer 1
INT 32-bit signed integer 1
BIGINT 64-bit signed integer 1
FLOAT 32-bit single precision floating point
number
1.0
DOUBLE 64-bit double precision floating point
number
1.0
BOOLEAN true/false value TRUE
STRING Character string ‘a’,”a”
TIMESTRAMP Timestamp with nanosecond
precision
‘2012-01-02
03:04:05.123456789’

(US) 408-556-9645
(India) +91 20 661 43 482
Column Data Types
(US) 408-556-9645
(India) +91 20 661 43 482
Complex
TYPE DESCRIPTION EXAMPLE
ARRAY An ordered collection of fields. The
fields must all be of same type
array(1, 2)
MAP An unordered collection of key-value
pairs. Keys must be primitives, values
may be any type. For a particular
map, the keys must be the same
type, and the values must be the
same type
map(‘a’, 1,’ b’, 2)
STRUCT A collection of named fields. The
fields may be of different types
struct(‘a’, 1, 1.0)

(US) 408-556-9645
(India) +91 20 661 43 482
Metastore
(US) 408-556-9645
(India) +91 20 661 43 482
A central repository of Hive metadata
Comprises of 2 parts:
• Metastore service
• Backing store for the data

(US) 408-556-9645
(India) +91 20 661 43 482
Metastore deployment modes
1: Embedded Mode
This is the default metastore deployment mode for CDH. In this
mode the metastore uses a Derby database.
Both the database and the metastore service run embedded in
the main HiveServer process. Both are started for you when you
start the HiveServer process..
This mode requires the least amount of effort to configure.
But it can support only one active user at a time and is not
certified for production use.

(US) 408-556-9645
(India) +91 20 661 43 482
2: Local Mode
In this mode the Hive metastore service runs in the same process as the
main HiveServer process, but the metastore database runs in a separate
process, and can be on a separate host.
The embedded metastore service communicates with the metastore
database over JDBC.

(US) 408-556-9645
(India) +91 20 661 43 482
3: Remote Mode
In this mode the Hive metastore service runs in its own JVM process; other processes
communicate with it via the Thrift network API (configured via the hive.metastore.uris
property).
The metastore service communicates with the metastore database over JDBC (configured
via the javax.jdo.option.ConnectionURL property).

(US) 408-556-9645
(India) +91 20 661 43 482
Metastore Properties
(US) 408-556-9645
(India) +91 20 661 43 482
Property Name Type Description
hive.metastore.warehouse.dir URI The directory in HDFS where
managed tables are stored
hive.metastore.local Boolean Flag for embedded metastore or local
metastore
hive.metastore.uris Comma
separated
URIs
List of remote metastore URI’s
javax.jdo.option.ConnectionURL URI The JDBC URL of the metastore
database
javax.jdo.option.ConnectionDriverName String The JDBC driver classname
javax.jdo.option.ConnectionUserName String The JDBC username
javax.jdo.option.ConnectionPassword String The JDBC password

(US) 408-556-9645
(India) +91 20 661 43 482
Hive Packages
The following packages are needed by Hive:
hive – base package that provides the complete
language and runtime (required)
hive-metastore – provides scripts for running the
metastore as a standalone service (optional)
hive-server – provides scripts for running the
original HiveServer as a standalone service
(optional)
hive-server2 – provides scripts for running the new
HiveServer2 as a standalone service (optional)

(US) 408-556-9645
(India) +91 20 661 43 482
Comparison with Traditional Databases
Schema on Read Verses Schema on Write
• In a traditional database, a table’s schema is enforced at
data load time
• If the data being loaded doesn’t conform to the schema,
then it is rejected
• Hive, on the other hand, doesn’t verify the data when it is
loaded, but rather when a query is issued
Updates, Transactions, and Indexes
• Updates, transactions, and indexes are mainstays of
traditional databases.
• Until recently, these features have not been considered a
part of Hive’s feature set

(US) 408-556-9645
(India) +91 20 661 43 482
Installing Hive
We will install hive with Metastore as a standalone
service
For this install the hive and Metastore packages as:
$ yum –y install hive hive-metastore

(US) 408-556-9645
(India) +91 20 661 43 482
Hive Configuration
Default configuration in
• /etc/hive/conf/hive-default.xml
Re(Define) properties in
• /etc/hive/conf/hive-site.xml
Use $HIVE_CONF_DIR to specify alternate conf dir
location
You can override Hadoop configuration properties
in Hive’s configuration
• e.g: mapred.reduce.tasks=1

(US) 408-556-9645
(India) +91 20 661 43 482
Configure Metastore database
Step 1: Install and start MySQL if you have not
already done so
• $ yum install mysql-server
Step 2: Configure the MySQL Service and
Connector
• $ yum install mysql-connector-java
• $ ln -s /usr/share/java/mysql-connector-java-
5.1.17.jar /usr/lib/hive/lib/mysql-connector-java-
5.1.17.jar
Step 3: To set the MySQL root password:

(US) 408-556-9645
(India) +91 20 661 43 482
Configure Metastore database

(US) 408-556-9645
(India) +91 20 661 43 482
Configure Metastore database cont…
Step 4: To make sure the MySQL server starts at boot
• $ /sbin/chkconfig mysqld on
Step 5. Create the Database and User
• Create the initial database schema using the hive-schema-
0.10.0.mysql.sql file located in
the/usr/lib/hive/scripts/metastore/upgrade/mysql directory.
• Create a user for hive with the hostname of the metastore.
• Grant proper privileges to the user.

(US) 408-556-9645
(India) +91 20 661 43 482

(US) 408-556-9645
(India) +91 20 661 43 482
Step 6: Configure the Metastore Service to
Communicate with the MySQL Database
• This step shows the configuration properties you need
to set in hive-site.xml to configure the metastore
service to communicate with the MySQL database, and
provides sample settings. Though you can use the same
• hive-site.xml on all hosts (client, metastore, HiveServer)
• hive.metastore.uris is the only property that must be
configured on all of them; the others are used only on
the metastore host.

(US) 408-556-9645
(India) +91 20 661 43 482
Step 7: Create hive user directory in hdfs
• $ sudo –u hdfs hadoop fs –mkdir /user/hive/warehouse
• $ sudo –u hdfs hadoop fs –chmod og+rw /user/hive/warehouse
• $ sudo –u hdfs hadoop fs –chown –R hive /user/hive
Step 8: Set Environment Variables:
• Add the following to .bashrc file
• $ vim ~/.bashrc
• export HADOOP_HOME="/usr/lib/hadoop"
• PATH=$PATH:"/usr/lib/hadoop/bin“
• Run command “bash” on command prompt
• $ bash

(US) 408-556-9645
(India) +91 20 661 43 482
Starting the Metastore
You can run the metastore from the command line:
• $ hive --service metastore
Ensure that the above does not give any error
Use Ctrl-c to stop the metastore process running
from the command line.
To run the metastore as a daemon, the command
is:
• $ service hive-metastore start

(US) 408-556-9645
(India) +91 20 661 43 482
Starting the Hive Console
To start the Hive console:
• $ hive
To confirm that Hive is working, issue the show
tables; command to list the Hive tables; be sure to
use a semi-colon after the command:
• hive> SHOW tables;

(US) 408-556-9645
(India) +91 20 661 43 482
Hive CLI Commands
Set a Hive or Hadoop conf property:
• hive> set propkey=value;
List all properties and values:
• hive> set –v;

(US) 408-556-9645
(India) +91 20 661 43 482
Hive CLI Commands
Creating managed table
• $ cat input/hive/tables/data.txt
• $ hive
• hive> CREATE TABLE managed_table (dummy STRING);
• hive> LOAD DATA LOCAL INPATH
‘input/hive/tables/data.txt' INTO table
managed_table;
• hive> select * from managed_table;
• $ hadoop fs -cat
/user/hive/warehouse/managed_table/data.txt

(US) 408-556-9645
(India) +91 20 661 43 482
Hive CLI Commands

(US) 408-556-9645
(India) +91 20 661 43 482
Hive CLI Commands
Creating external table
• Select a location in hdfs to create table
• Ensure it has write access to other users
 $ sudo -u hdfs hadoop fs -mkdir /user/joe/table
 $ sudo -u hdfs hadoop fs -chmod a+w /user/joe/table
• Create external table and load data into it:
 hive> CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/joe/table';
 hive> LOAD DATA LOCAL INPATH 'input/hive/tables/data.txt' INTO
TABLE external_table;
 hive> select * from external_table;
• Check if the table was created in the external directory
 $ sudo -u hdfs hadoop fs -cat /user/joe/table/data.txt

(US) 408-556-9645
(India) +91 20 661 43 482
Hive CLI Commands
Create Partitioned table
• hive> CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt
STRING, country STRING);
Load data in table specifying the partitions
• hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO
TABLE logs PARTITION (dt='2001-01-01', country='GB');
TABLE logs PARTITION (dt='2001-01-01', country='US');
TABLE logs PARTITION (dt='2001-01-02', country='US');
See the table contents
• hive> select * from logs;
List all the partitions
• hive> SHOW PARTITIONS logs;

(US) 408-556-9645
(India) +91 20 661 43 482
Hive CLI Commands
Create Bucket:
• Create a normal table users and create a bucket named
bucketed_users from it
 hive> set hive.enforce.bucketing=true;
 hive> CREATE TABLE users (id INT, name STRING);
 hive> LOAD DATA LOCAL INPATH 'input/hive/tables/users.txt' INTO table
users;
 hive> CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED
BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
 hive> INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users;
• Check the contents of table per bucket
 hive> select * from bucketed_users;
 hive> select * from bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4
ON id);

(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Prerequisites
• Create 2 tables sales and things and load data from files
 hive> CREATE TABLE sales (user STRING, id INT)row format
delimited fields terminated by 't' stored as textfile;
 hive> LOAD DATA LOCAL INPATH 'input/hive/joins/sales.txt'
INTO table sales;
 hive> select * from sales;
 hive> CREATE TABLE things (id INT, name STRING)row format
delimited fields terminated by 't' stored as textfile;
 hive> LOAD DATA LOCAL INPATH 'input/hive/joins/things.txt'
INTO table things;
 hive> select * from things;

(US) 408-556-9645
(India) +91 20 661 43 482
Joins

(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Inner Join
• hive> SELECT sales.*, things.* FROM sales JOIN things
ON (sales.id = things.id);

(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Left Outer Join
• hive> SELECT sales.*, things.* FROM sales LEFT OUTER
JOIN things ON (sales.id = things.id);

(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Right Outer Join
• hive> SELECT sales.*, things.* FROM sales RIGHT
OUTER JOIN things ON (sales.id = things.id);

(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Full Outer Join
• hive> SELECT sales.*, things.* FROM sales FULL OUTER
JOIN things ON (sales.id = things.id);

(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Semi Joins
• Hive does not support IN sub queries
 SELECT * from things WHERE things.id IN (SELECT id from sales);
• So solution is semi joins
 hive> SELECT * from things LEFT SEMI JOIN ON (sales.id = things.id);

(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Map Joins
• Used in case when 1 table is very small enough to fit in
memory. No reducers used
 hive> SELECT /*+ MAPJOIN(things) */ sales.*, things.* FROM
sales JOIN things ON (sales.id = things.id);

(US) 408-556-9645
(India) +91 20 661 43 482
Other Commands
CREATE TABLE…AS SELECT
• hive> CREATE TABLE target AS SELECT id from things;
Altering Tables
• hive> ALTER TABLE target RENAME TO source;
• hive> ALTER TABLE source ADD COLUMNS (col2
STRING);

(US) 408-556-9645
(India) +91 20 661 43 482
Other Commands
Dropping Tables
• For managed tables both data and metadata is deleted
• For external tables only metadata is deleted
 hive> drop table <table_name>;

(US) 408-556-9645
(India) +91 20 661 43 482
References
Hadoop: The Definitive Guide, 3rd Edition
• http://shop.oreilly.com/product/0636920021773.do
Hive Community page
• http://hive.apache.org/

Hadoop Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Hive

Similar to Hadoop Hive (20)

Recently uploaded

Recently uploaded (20)

Hadoop Hive