Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

©2012, Cognizant
Data Warehouse and Query Language for Hadoop
August 2013
By Someshwar Kale

| ©2012, Cognizant2
HIVE
 Data Warehousing Solution built on top of Hadoop
 Provides SQL-like query language named HiveQL
– Minimal learning curve for people with SQL expertise
– Data analysts are target audience
 Early Hive development work started at Facebook in 2007
Today, Facebook counts 29% of its employees (and growing!)
as Hive users.
https://www.facebook.com/note.php?note_id=114588058858
 Today Hive is an Apache project under Hadoop
– http://hive.apache.org

| 2012 Cognizant Technology Solutions
Hive Provides
3
• Ability to bring structure to various data Formats
• Simple interface for ad hoc querying,analyzing and
summarizing large amounts of data
• Access to files on various data stores such
as HDFS and HBase

Hive
 Hive does NOT provide low latency or realtime queries.
 Even querying small amounts of data may take minutes.
 Designed for scalability and ease-of-use rather than low latency
responses

Hive
 Translates HiveQL statements into a set of MapReduce Jobs
which are then executed on a Hadoop Cluster.

Hive Metastore
 To support features like schema(s) and data partitioning Hive
keeps its metadata in a Relational Database
 Packaged with Derby, a lightweight embedded SQL DB
 Default Derby based is good for evaluation an testing
 Schema is not shared between users as each user has their own
instance of embedded Derby Stored in metastore_db directory
which resides in the directory that hive was started from
• Can easily switch another SQL installation such as MySQL

Metastore Deployment Modes : Embedded Mode
 Default metastore deployment mode for CDH.
 Both the database and the metastore service run embedded in
the main HiveServer process
 Both are started for you when you start the HiveServer process.
 Support only one active user at a time and is not certified for
production use.

Metastore Deployment Modes : Local Mode
 Hive metastore service runs
in the same process as the
main HiveServer process.
 The metastore database runs
in a separate process, and
can be on a separate host.
 The embedded metastore
service communicates with
the metastore database over
JDBC.

Metastore Deployment Modes : Remote Mode

Hive Architecture

Hive Interface Options
Command Line Interface (CLI)
– Will use exclusively in these slides
• Hive Web Interface
https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface
• Java Database Connectivity (JDBC)
– https://cwiki.apache.org/confluence/display/Hive/HiveClient
BEELINE for Hivesrver2 (new in CDH4)
- http://sqlline.sourceforge.net/#manual

Data Types
[cts318692@aster4 ~]$ hive
Logging initialized using configuration in
jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.2.1.jar!/hive-
log4j.properties
Hive history
file=/tmp/cts318692/hive_job_log_cts318692_201308071622_200
5272769.txt
hive>
Launch Hive Command Line Interface
(CLI)
Location of the session’s log file
hive> !cat data/user-posts.txt;
user1,Funny Story,1343182026191
user2,Cool Deal,1343182133839
user4,Interesting Post,1343182154633
user5,Yet Another Blog,13431839394
hive>
Can execute local commands
within CLI, place a command
in between ! and ;

Data Types
Numeric Types
TINYINT
SMALLINT
INT
BIGINT
FLOAT
DOUBLE
DECIMAL (Note: Only available starting with Hive 0.11.0)
Date/Time Types
TIMESTAMP (Note: Only available starting with
Hive 0.8.0)
DATE (Note: Only available starting with Hive 0.12.0)
Misc Types
BOOLEAN
STRING
BINARY (Note: Only available starting with Hive 0.8.0)

Complex Data Types

Check physical storage of hive
[cts318692@aster4 ~]$ hive -S -e "set" | grep warehouse
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=true
This is the location where hive stores
its data.

Creating DataBase
hive> CREATE DATABASE IF NOT EXISTS som COMMENT 'my
database'
> LOCATION '/user/cts318692/someshwar/hivestore/'
> WITH DBPROPERTIES ('creator'='someshwar
kale','date'='2013-06-08');
OK
Time taken: 0.046 seconds
Used to suppress
warnings
Database name,
Hive opens default database when u open a
new session
You can override ‘/usr/hive/warehouse’
default location for the new directory
Table propertiesPhysical storage for som
database

Exploring Data
STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>
For complex data types map,
arrays,structures
field

Creating Table
For complex data types map,
arrays,structures
For map key and value eg. ‘key’
^C ’value’ (003=ctrlC=^C)
Column seperator Definition

hive> DESCRIBE FORMATTED som.employees;

Creating External Table

Create ..like
 If you omit the EXTERNAL keyword and the original table is
external, the new table will also be external.
 If you omit EXTERNAL and the original table is managed,
the new table will also be managed. However, if you include
the EXTERNAL keyword and the original table is managed,
the new table will be external. Even in this scenario, the
LOCATION clause will still be optional.

Select Clause

Describe External Table

| ©2012, Cognizant
Dropping DataBase and Table
By default, Hive won’t permit
you to drop a database if it
contains tables. You can either
drop the tables first or append
the CASCADE keyword to the
command, which will cause
the Hive to drop the tables in the
database first.

| ©2012, Cognizant
Partitions
 To increase performance Hive has the capability to partition data
– The values of partitioned column divide a table into
segments
– Entire partitions can be ignored at query time
– Similar to relational databases’ indexes but not as
Granular
 Partitions have to be properly crated by users
– When inserting data must specify a partition
 At query time, whenever appropriate, Hive will automatically filter
out partitions

| ©2012, Cognizant
Creating Partitioned Table
Partition table based on
the value of a country
and state

| ©2012, Cognizant
Loading data to table
LOAD DATA LOCAL ... copies the local data to the final location in the
distributed filesystem, while LOAD DATA ... (i.e., without LOCAL) moves
the data to the final location.
Necessary if table to which we are loading
the data is partitioned. This is known as
Static partitioning as we are providing the
partition value in the query
Partitions are physically stored under
separate directories

| ©2012, Cognizant
Schema Violations
hive> LOAD DATA LOCAL INPATH
> 'data/user-posts-inconsistentFormat.txt'
> OVERWRITE INTO TABLE posts;
OK
hive> select * from posts;
OK
user1 Funny Story 1343182026191
user2 Cool Deal NULL
user4 Interesting Post 1343182154633
user5 Yet Another Blog 13431839394
null is set for any value that
violates pre-defined schema

| ©2012, Cognizant
External Partitioned Tables

| ©2012, Cognizant
Cntd…
There is no difference in syntax
• When partitioned column is specified in the
where clause entire directories/partitions could
be ignored

| ©2012, Cognizant
Bucketing
• Break data into a set of buckets based on a hash
function of a "bucket column"
– Capability to execute queries on a sub-set of random data
• Doesn’t automatically enforce bucketing
– User is required to specify the number of buckets by setting hash of
Reducer
hive> mapred.reduce.tasks = 256;
OR
hive> hive.enforce.bucketing = true;
Either manually set the hash
of
reducers to be the number of
buckets or you can use
‘hive.enforce.bucketing’ which
will set it on your behalf.

| ©2012, Cognizant
Create and Use Table with Buckets

| ©2012, Cognizant
ALTER TABLE

| ©2012, Cognizant
Cntd…
Partition columns
are not deleted

| ©2012, Cognizant
Inserting Data into Tables from Queries

| ©2012, Cognizant
Dynamic Partition Inserts

| ©2012, Cognizant
Exporting Data

| ©2012, Cognizant
Table generating functions
Return 0 to many rows, one row for
each element from
the input array

| ©2012, Cognizant
Table generating functions
Only a single expression in the
SELECT clause is supported with
UDTF's'.

| ©2012, Cognizant
LIMIT clause

| ©2012, Cognizant
Points to remember
 Only equality joins are allowed.
 More than 2 tables can be joined in the same query e.g.
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key2)
is a valid join.
 A single map/reduce job if for every table the same column is used in
the join clause -
ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
 ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
is converted into two map/reduce jobs because key1 column from b
is used in the first join condition and key2 column from b is used in the
second one.

| ©2012, Cognizant
ORDER BY and SORT BY
 ORDER BY uses single reducer to sort the data, which may take
an unacceptably long time to execute for larger data sets.
 Hive adds an alternative, SORT BY, that orders the data only
within each reducer, thereby performing a local ordering, where
each reducer’s output will be sorted.

| ©2012, Cognizant
Casting
 If a salary value was not a valid string for a floating-
point number? In this case, Hive returns NULL.

| ©2012, Cognizant
UNION ALL and Nested select
 Each subquery of the union query must produce the
same number of columns, and for each column, its
type must match all the column types in the same
position.

| ©2012, Cognizant
Lateral view
 Lateral view is used in conjunction with user-defined table
generating functions such as explode().
 A lateral view first applies the UDTF to each row of base table and
then joins resulting output rows to the input rows to form a virtual
table having the supplied table alias.
 Syntax-
1. LATERAL VIEW udtf(expression) tableAlias AS columnAlias

| ©2012, Cognizant
UDF
 Hive actually uses reflection to find methods whose names are
evaluate and matches the arguments used in the HiveQL function
call.
 Hive can work with both the Hadoop Writables and the Java
primitives, but it’s recommended to work with the Writables since
they can be reused.
 Input arguments type and return type must be same.

| ©2012, Cognizant
between operator
hive> select name,salary from employees2 where salary between
80000 and 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
....
OK
John Doe 100000.0
John Doe 100000.0
Mary Smith 80000.0
Mary Smith 80000.0
 Both values (lower and upper) are inclusive.

| ©2012, Cognizant
HiveServer2
 As of CDH4.1, you can deploy HiveServer2, an improved version of
HiveServer that supports a new Thrift API tailored for JDBC and
ODBC clients, Kerberos authentication, and multi-client concurrency.
 There is also a new CLI for HiveServer2 named BeeLine.
 HiveServer2
 Connection URL ===== jdbc:hive2://<host>:<port>
 Driver Class =========== org.apache.hive.jdbc.HiveDriver
 HiveServer1
 Connection URL ===== jdbc:hive://<host>:<port>
 Driver Class ========org.apache.hadoop.hive.jdbc.HiveDriver

| ©2012, Cognizant
BEELINE
$ /usr/lib/hive/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000 username password
org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000>

| ©2012, Cognizant
References
Hive
Edward Capriolo (Author), Dean Wampler
(Author), Jason
Rutherglen (Author)
O'Reilly Media; 1 edition (October 3, 2012)
Chapter About Hive
Hadoop in Action
Chuck Lam (Author)
Manning Publications; 1st Edition (December,
2010)

Thank You

Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

Similar to Learning Apache HIVE - Data Warehouse and Query Language for Hadoop (20)

Recently uploaded

Recently uploaded (20)

Learning Apache HIVE - Data Warehouse and Query Language for Hadoop