2. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
What is Hive
A data warehousing infrastructure based on
Hadoop
Provides easy data summarization
Provides ad-hoc querying and analysis of large
volumes of data
Comes with Hive QL, based on SQL
Allows to plug in custom mappers and reducers
3. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
What Hive is NOT
Not suitable for small datasets due to high latency
Cannot be compared to systems like Oracle
Does not offer real-time queries and row level
updates
5. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Data Models
Tables
• Made up of actual data and the associated metadata
• Actual data is stored in any Hadoop Filesystem
• Metadata is always stored in a relational database
• Managed Tables
Hive moves data into its warehouse
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• External Tables
Hive refers data from existing location
CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
6. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Data Models
Partitions
• A way to dividing tables into coarse-grained parts
• Based on the value of partition column
• Supports multiple dimensions
• Defined at table creation time using PARTITION BY
clause
• At the filesystem level, partitions are simply nested
subdirectories of the table directory.
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
7. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Data Models
• CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
• LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
8. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Data Models
Buckets
• Partitions table within range
• Enables more efficient queries
• Make sampling more efficient
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
9. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Column Data Types
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Primitive
TYPE DESCRIPTION EXAMPLE
TINYINT 8-bit signed integer 1
SMALLINT 16-bit signed integer 1
INT 32-bit signed integer 1
BIGINT 64-bit signed integer 1
FLOAT 32-bit single precision floating point
number
1.0
DOUBLE 64-bit double precision floating point
number
1.0
BOOLEAN true/false value TRUE
STRING Character string ‘a’,”a”
TIMESTRAMP Timestamp with nanosecond
precision
‘2012-01-02
03:04:05.123456789’
10. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Column Data Types
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Complex
TYPE DESCRIPTION EXAMPLE
ARRAY An ordered collection of fields. The
fields must all be of same type
array(1, 2)
MAP An unordered collection of key-value
pairs. Keys must be primitives, values
may be any type. For a particular
map, the keys must be the same
type, and the values must be the
same type
map(‘a’, 1,’ b’, 2)
STRUCT A collection of named fields. The
fields may be of different types
struct(‘a’, 1, 1.0)
11. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Metastore
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
A central repository of Hive metadata
Comprises of 2 parts:
• Metastore service
• Backing store for the data
12. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Metastore deployment modes
1: Embedded Mode
This is the default metastore deployment mode for CDH. In this
mode the metastore uses a Derby database.
Both the database and the metastore service run embedded in
the main HiveServer process. Both are started for you when you
start the HiveServer process..
This mode requires the least amount of effort to configure.
But it can support only one active user at a time and is not
certified for production use.
13. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Metastore deployment modes
2: Local Mode
In this mode the Hive metastore service runs in the same process as the
main HiveServer process, but the metastore database runs in a separate
process, and can be on a separate host.
The embedded metastore service communicates with the metastore
database over JDBC.
14. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Metastore deployment modes
3: Remote Mode
In this mode the Hive metastore service runs in its own JVM process; other processes
communicate with it via the Thrift network API (configured via the hive.metastore.uris
property).
The metastore service communicates with the metastore database over JDBC (configured
via the javax.jdo.option.ConnectionURL property).
15. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Metastore Properties
Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Property Name Type Description
hive.metastore.warehouse.dir URI The directory in HDFS where
managed tables are stored
hive.metastore.local Boolean Flag for embedded metastore or local
metastore
hive.metastore.uris Comma
separated
URIs
List of remote metastore URI’s
javax.jdo.option.ConnectionURL URI The JDBC URL of the metastore
database
javax.jdo.option.ConnectionDriverName String The JDBC driver classname
javax.jdo.option.ConnectionUserName String The JDBC username
javax.jdo.option.ConnectionPassword String The JDBC password
16. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Hive Packages
The following packages are needed by Hive:
hive – base package that provides the complete
language and runtime (required)
hive-metastore – provides scripts for running the
metastore as a standalone service (optional)
hive-server – provides scripts for running the
original HiveServer as a standalone service
(optional)
hive-server2 – provides scripts for running the new
HiveServer2 as a standalone service (optional)
17. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Comparison with Traditional Databases
Schema on Read Verses Schema on Write
• In a traditional database, a table’s schema is enforced at
data load time
• If the data being loaded doesn’t conform to the schema,
then it is rejected
• Hive, on the other hand, doesn’t verify the data when it is
loaded, but rather when a query is issued
Updates, Transactions, and Indexes
• Updates, transactions, and indexes are mainstays of
traditional databases.
• Until recently, these features have not been considered a
part of Hive’s feature set
18. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Installing Hive
We will install hive with Metastore as a standalone
service
For this install the hive and Metastore packages as:
$ yum –y install hive hive-metastore
19. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Hive Configuration
Default configuration in
• /etc/hive/conf/hive-default.xml
Re(Define) properties in
• /etc/hive/conf/hive-site.xml
Use $HIVE_CONF_DIR to specify alternate conf dir
location
You can override Hadoop configuration properties
in Hive’s configuration
• e.g: mapred.reduce.tasks=1
20. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Configure Metastore database
Step 1: Install and start MySQL if you have not
already done so
• $ yum install mysql-server
Step 2: Configure the MySQL Service and
Connector
• $ yum install mysql-connector-java
• $ ln -s /usr/share/java/mysql-connector-java-
5.1.17.jar /usr/lib/hive/lib/mysql-connector-java-
5.1.17.jar
Step 3: To set the MySQL root password:
22. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Configure Metastore database cont…
Step 4: To make sure the MySQL server starts at boot
• $ /sbin/chkconfig mysqld on
Step 5. Create the Database and User
• Create the initial database schema using the hive-schema-
0.10.0.mysql.sql file located in
the/usr/lib/hive/scripts/metastore/upgrade/mysql directory.
• Create a user for hive with the hostname of the metastore.
• Grant proper privileges to the user.
24. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Configure Metastore database cont…
Step 6: Configure the Metastore Service to
Communicate with the MySQL Database
• This step shows the configuration properties you need
to set in hive-site.xml to configure the metastore
service to communicate with the MySQL database, and
provides sample settings. Though you can use the same
• hive-site.xml on all hosts (client, metastore, HiveServer)
• hive.metastore.uris is the only property that must be
configured on all of them; the others are used only on
the metastore host.
27. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Configure Metastore database cont…
Step 7: Create hive user directory in hdfs
• $ sudo –u hdfs hadoop fs –mkdir /user/hive/warehouse
• $ sudo –u hdfs hadoop fs –chmod og+rw /user/hive/warehouse
• $ sudo –u hdfs hadoop fs –chown –R hive /user/hive
Step 8: Set Environment Variables:
• Add the following to .bashrc file
• $ vim ~/.bashrc
• export HADOOP_HOME="/usr/lib/hadoop"
• PATH=$PATH:"/usr/lib/hadoop/bin“
• Run command “bash” on command prompt
• $ bash
28. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Starting the Metastore
You can run the metastore from the command line:
• $ hive --service metastore
Ensure that the above does not give any error
Use Ctrl-c to stop the metastore process running
from the command line.
To run the metastore as a daemon, the command
is:
• $ service hive-metastore start
29. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Starting the Hive Console
To start the Hive console:
• $ hive
To confirm that Hive is working, issue the show
tables; command to list the Hive tables; be sure to
use a semi-colon after the command:
• hive> SHOW tables;
30. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Hive CLI Commands
Set a Hive or Hadoop conf property:
• hive> set propkey=value;
List all properties and values:
• hive> set –v;
37. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Hive CLI Commands
Create Partitioned table
• hive> CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt
STRING, country STRING);
Load data in table specifying the partitions
• hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO
TABLE logs PARTITION (dt='2001-01-01', country='GB');
• hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file2' INTO
TABLE logs PARTITION (dt='2001-01-01', country='US');
• hive> LOAD DATA LOCAL INPATH 'input/hive/partitions/file3' INTO
TABLE logs PARTITION (dt='2001-01-02', country='US');
See the table contents
• hive> select * from logs;
List all the partitions
• hive> SHOW PARTITIONS logs;
41. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Hive CLI Commands
Create Bucket:
• Create a normal table users and create a bucket named
bucketed_users from it
hive> set hive.enforce.bucketing=true;
hive> CREATE TABLE users (id INT, name STRING);
hive> LOAD DATA LOCAL INPATH 'input/hive/tables/users.txt' INTO table
users;
hive> CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED
BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
hive> INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users;
• Check the contents of table per bucket
hive> select * from bucketed_users;
hive> select * from bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4
ON id);
44. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Prerequisites
• Create 2 tables sales and things and load data from files
hive> CREATE TABLE sales (user STRING, id INT)row format
delimited fields terminated by 't' stored as textfile;
hive> LOAD DATA LOCAL INPATH 'input/hive/joins/sales.txt'
INTO table sales;
hive> select * from sales;
hive> CREATE TABLE things (id INT, name STRING)row format
delimited fields terminated by 't' stored as textfile;
hive> LOAD DATA LOCAL INPATH 'input/hive/joins/things.txt'
INTO table things;
hive> select * from things;
47. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Left Outer Join
• hive> SELECT sales.*, things.* FROM sales LEFT OUTER
JOIN things ON (sales.id = things.id);
48. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Right Outer Join
• hive> SELECT sales.*, things.* FROM sales RIGHT
OUTER JOIN things ON (sales.id = things.id);
49. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Full Outer Join
• hive> SELECT sales.*, things.* FROM sales FULL OUTER
JOIN things ON (sales.id = things.id);
50. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Semi Joins
• Hive does not support IN sub queries
SELECT * from things WHERE things.id IN (SELECT id from sales);
• So solution is semi joins
hive> SELECT * from things LEFT SEMI JOIN ON (sales.id = things.id);
51. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Joins
Map Joins
• Used in case when 1 table is very small enough to fit in
memory. No reducers used
hive> SELECT /*+ MAPJOIN(things) */ sales.*, things.* FROM
sales JOIN things ON (sales.id = things.id);
52. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Other Commands
CREATE TABLE…AS SELECT
• hive> CREATE TABLE target AS SELECT id from things;
Altering Tables
• hive> ALTER TABLE target RENAME TO source;
• hive> ALTER TABLE source ADD COLUMNS (col2
STRING);
53. Clogeny Technologies http://www.clogeny.com
(US) 408-556-9645
(India) +91 20 661 43 482
Other Commands
Dropping Tables
• For managed tables both data and metadata is deleted
• For external tables only metadata is deleted
hive> drop table <table_name>;