Hive

DataWarehousing on Hadoop
HIVE

Hadoop is great for large-data processing!
But writing Java programs for everything is verbose
and slow
Analysts don’t want to (or can’t) write Java
Solution: develop higher-level data processing languages
Hive: HQL is like SQL
Pig: Pig Latin is a bit like Perl
Need for High-Level Languages

Problem: Data, data and more data
200GB per day in March 2008
2+TB(compressed) raw data per day today
The Hadoop Experiment
Much superior to availability and scalability of
commercial DBs
Efficiency not that great and required more hardware
PartialAvailability/resilience/scale more important than
ACID
Problem: Programmability and Metadata
Map-reduce hard to program (users know
sql/bash/python)
Need to publish data in well known schemas
Why Hive??

Shell: allows interactive queries
Driver: session handles, fetch, execute
Compiler: parse, plan, optimize
Execution engine: DAG of stages (MR, HDFS, metadata)
Metastore: schema, location in HDFS, SerDe
HIVE: Components

Tables
Typed columns (int, float, string, boolean)
Also, list: map (for JSON-like data)
Partitions
For example, range-partition tables by date
Command : PARTITIONED BY
Buckets
Hash partitions within ranges (useful for sampling,
join optimization)
Command : CLUSTERED BY
Data Model

Database: namespace containing a set of tables
Holds table definitions (column types, physical
layout)
Holds partitioning information
Can be stored in Derby, MySQL, and many other
relational databases
Metastore

Warehouse directory in HDFS
E.g., /user/hive/warehouse
Tables stored in subdirectories of warehouse
Partitions form subdirectories of tables
Actual data stored in flat files
Control char-delimited text, or SequenceFiles
With custom SerDe, can use arbitrary format
Physical Layout

HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.WebUIHIVE: Components

CREATE TABLE sample (foo INT, bar STRING)
PARTITIONED BY (ds STRING);
SHOW TABLES '.*s';
DESCRIBE sample;
ALTER TABLE sample ADD COLUMNS (new_col
INT);
DROP TABLE sample;
Examples – DDL Operations

LOAD DATA LOCAL INPATH './sample.txt'
OVERWRITE INTO TABLE sample PARTITION
(ds='2012-02-24');
LOAD DATA INPATH '/user/falvariz/hive/sample.txt'
OVERWRITE INTO TABLE sample PARTITION
(ds='2012-02-24');
Examples – DML Operations

SELECT * FROM (
FROM pv_users
SELECTTRANSFORM(pv_users.userid, pv_users.date) USING
'map_script'
AS(dt, uid)
CLUSTER BY(dt)) map
INSERT INTOTABLE pv_users_reduced
SELECTTRANSFORM(map.dt, map.uid) USING
'reduce_script'AS (date, count);
Running Custom Map/Reduce
Scripts

Machine 2
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<k4, v4>
<k5, v5>
<k6, v6>
(Simplified) Map Reduce Review
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
Local
Map
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk1, nv1>
<nk3, nv3>
<nk1, nv6>
Global
Shuffle
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
Local
Sort
<nk2, 3>
<nk1, 2>
<nk3, 1>
Local
Reduce

• SQL:
INSERT INTOTABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
pageid age
1 25
2 25
1 32
X =
page_view
user
pv_users
Hive QL – Join

key value
111 <1,1>
111 <1,2>
222 <1,1>
key value
111 <2,25>
222 <2,32>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
page_view
user Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
Shuffle
Sort Reduce
Hive QL – Join in Map Reduce

 Outer Joins
INSERT INTOTABLE pv_users
SELECT pv.*, u.gender, u.age
FROM page_view pv FULL OUTER JOIN user u
ON (pv.userid = u.id)
WHERE pv.date = 2008-03-03;
Joins

 Only Equality Joins with conjunctions supported
 Future
 Pruning of values send from map to reduce on the
basis of projections
 Make Cartesian product more memory efficient
 Map side joins
Hash Joins if one of the tables is very small
Exploit pre-sorted data by doing map-side merge join
Join To Map Reduce

SQL:
FROM (a join b on a.key = b.key) join c on a.key = c.key
SELECT …
key av bv
1 111 222
key av
1 111
A
Map Reducekey bv
1 222
B
key cv
1 333
C
AB
Map Reduce
key av bv cv
1 111 222 333
ABC
Hive Optimizations – Merge Sequential Map Reduce Jobs

SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age;
pageid age
1 25
2 25
1 32
2 25
pv_users
pageid age count
1 25 1
2 25 2
1 32 1
Hive QL – Group By

pa
pageid age
1 25
2 25
pv_users
pa
pageid age
1 32
2 25
Map
key value
<1,25> 1
<2,25> 1
key value
<1,32> 1
<2,25> 1
key value
<1,25> 1
<1,32> 1
key value
<2,25> 1
<2,25> 1
Shuffle
Sort
Reduce
Hive QL – Group By in Map Reduce

SELECT pageid, COUNT(DISTINCT userid)
FROM page_view GROUP BY pageid
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
2 111 9:08:20
page_view
pageid count_distinct_userid
1 2
2 1
Hive QL – Group By with Distinct

pageid count
1 1
page_view
pageid count
1 1
2 1
Shuffle
and
Sort
Reduce
pageid userid time
1 111 9:08:01
2 111 9:08:13
pageid userid time
1 222 9:08:14
2 111 9:08:20
key v
<1,111>
<2,111>
<2,111>
key v
<1,222>
Hive QL – Group By with Distinct in Map Reduce

FROM pv_users
INSERT INTOTABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY(pv_users.gender)
INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age)
INSERT INTO LOCAL DIRECTORY‘/home/me/pv_age_sum.dir’
FIELDSTERMINATED BY‘,’ LINESTERMINATED BY 013
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age);
Inserts into Files, Tables and Local Files

Hive

More Related Content

What's hot

Similar to Hive

Recently uploaded

Hive

Editor's Notes