DataWarehousing on Hadoop
HIVE
Hadoop is great for large-data processing!
But writing Java programs for everything is verbose
and slow
Analysts don’t want to (or can’t) write Java
Solution: develop higher-level data processing languages
Hive: HQL is like SQL
Pig: Pig Latin is a bit like Perl
Need for High-Level Languages
Problem: Data, data and more data
200GB per day in March 2008
2+TB(compressed) raw data per day today
The Hadoop Experiment
Much superior to availability and scalability of
commercial DBs
Efficiency not that great and required more hardware
PartialAvailability/resilience/scale more important than
ACID
Problem: Programmability and Metadata
Map-reduce hard to program (users know
sql/bash/python)
Need to publish data in well known schemas
Why Hive??
HIVE: Components
Shell: allows interactive queries
Driver: session handles, fetch, execute
Compiler: parse, plan, optimize
Execution engine: DAG of stages (MR, HDFS, metadata)
Metastore: schema, location in HDFS, SerDe
HIVE: Components
Tables
Typed columns (int, float, string, boolean)
Also, list: map (for JSON-like data)
Partitions
For example, range-partition tables by date
Command : PARTITIONED BY
Buckets
Hash partitions within ranges (useful for sampling,
join optimization)
Command : CLUSTERED BY
Data Model
Database: namespace containing a set of tables
Holds table definitions (column types, physical
layout)
Holds partitioning information
Can be stored in Derby, MySQL, and many other
relational databases
Metastore
Warehouse directory in HDFS
E.g., /user/hive/warehouse
Tables stored in subdirectories of warehouse
Partitions form subdirectories of tables
Actual data stored in flat files
Control char-delimited text, or SequenceFiles
With custom SerDe, can use arbitrary format
Physical Layout
HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.WebUIHIVE: Components
CREATE TABLE sample (foo INT, bar STRING)
PARTITIONED BY (ds STRING);
SHOW TABLES '.*s';
DESCRIBE sample;
ALTER TABLE sample ADD COLUMNS (new_col
INT);
DROP TABLE sample;
Examples – DDL Operations
LOAD DATA LOCAL INPATH './sample.txt'
OVERWRITE INTO TABLE sample PARTITION
(ds='2012-02-24');
LOAD DATA INPATH '/user/falvariz/hive/sample.txt'
OVERWRITE INTO TABLE sample PARTITION
(ds='2012-02-24');
Examples – DML Operations
SELECT * FROM (
FROM pv_users
SELECTTRANSFORM(pv_users.userid, pv_users.date) USING
'map_script'
AS(dt, uid)
CLUSTER BY(dt)) map
INSERT INTOTABLE pv_users_reduced
SELECTTRANSFORM(map.dt, map.uid) USING
'reduce_script'AS (date, count);
Running Custom Map/Reduce
Scripts
Machine 2
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<k4, v4>
<k5, v5>
<k6, v6>
(Simplified) Map Reduce Review
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
Local
Map
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk1, nv1>
<nk3, nv3>
<nk1, nv6>
Global
Shuffle
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
Local
Sort
<nk2, 3>
<nk1, 2>
<nk3, 1>
Local
Reduce
• SQL:
INSERT INTOTABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
pageid age
1 25
2 25
1 32
X =
page_view
user
pv_users
Hive QL – Join
key value
111 <1,1>
111 <1,2>
222 <1,1>
key value
111 <2,25>
222 <2,32>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
page_view
user Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
Shuffle
Sort Reduce
Hive QL – Join in Map Reduce
 Outer Joins
INSERT INTOTABLE pv_users
SELECT pv.*, u.gender, u.age
FROM page_view pv FULL OUTER JOIN user u
ON (pv.userid = u.id)
WHERE pv.date = 2008-03-03;
Joins
 Only Equality Joins with conjunctions supported
 Future
 Pruning of values send from map to reduce on the
basis of projections
 Make Cartesian product more memory efficient
 Map side joins
Hash Joins if one of the tables is very small
Exploit pre-sorted data by doing map-side merge join
Join To Map Reduce
SQL:
FROM (a join b on a.key = b.key) join c on a.key = c.key
SELECT …
key av bv
1 111 222
key av
1 111
A
Map Reducekey bv
1 222
B
key cv
1 333
C
AB
Map Reduce
key av bv cv
1 111 222 333
ABC
Hive Optimizations – Merge Sequential Map Reduce Jobs
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age;
pageid age
1 25
2 25
1 32
2 25
pv_users
pageid age count
1 25 1
2 25 2
1 32 1
Hive QL – Group By
pa
pageid age
1 25
2 25
pv_users
pa
pageid age
1 32
2 25
Map
key value
<1,25> 1
<2,25> 1
key value
<1,32> 1
<2,25> 1
key value
<1,25> 1
<1,32> 1
key value
<2,25> 1
<2,25> 1
Shuffle
Sort
Reduce
Hive QL – Group By in Map Reduce
SELECT pageid, COUNT(DISTINCT userid)
FROM page_view GROUP BY pageid
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
2 111 9:08:20
page_view
pageid count_distinct_userid
1 2
2 1
Hive QL – Group By with Distinct
pageid count
1 1
page_view
pageid count
1 1
2 1
Shuffle
and
Sort
Reduce
pageid userid time
1 111 9:08:01
2 111 9:08:13
pageid userid time
1 222 9:08:14
2 111 9:08:20
key v
<1,111>
<2,111>
<2,111>
key v
<1,222>
Hive QL – Group By with Distinct in Map Reduce
FROM pv_users
INSERT INTOTABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY(pv_users.gender)
INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age)
INSERT INTO LOCAL DIRECTORY‘/home/me/pv_age_sum.dir’
FIELDSTERMINATED BY‘,’ LINESTERMINATED BY 013
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age);
Inserts into Files, Tables and Local Files
ThankYou

Hive

  • 1.
  • 2.
    Hadoop is greatfor large-data processing! But writing Java programs for everything is verbose and slow Analysts don’t want to (or can’t) write Java Solution: develop higher-level data processing languages Hive: HQL is like SQL Pig: Pig Latin is a bit like Perl Need for High-Level Languages
  • 3.
    Problem: Data, dataand more data 200GB per day in March 2008 2+TB(compressed) raw data per day today The Hadoop Experiment Much superior to availability and scalability of commercial DBs Efficiency not that great and required more hardware PartialAvailability/resilience/scale more important than ACID Problem: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Why Hive??
  • 4.
  • 5.
    Shell: allows interactivequeries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe HIVE: Components
  • 6.
    Tables Typed columns (int,float, string, boolean) Also, list: map (for JSON-like data) Partitions For example, range-partition tables by date Command : PARTITIONED BY Buckets Hash partitions within ranges (useful for sampling, join optimization) Command : CLUSTERED BY Data Model
  • 7.
    Database: namespace containinga set of tables Holds table definitions (column types, physical layout) Holds partitioning information Can be stored in Derby, MySQL, and many other relational databases Metastore
  • 8.
    Warehouse directory inHDFS E.g., /user/hive/warehouse Tables stored in subdirectories of warehouse Partitions form subdirectories of tables Actual data stored in flat files Control char-delimited text, or SequenceFiles With custom SerDe, can use arbitrary format Physical Layout
  • 9.
    HDFS Hive CLI DDLQueriesBrowsing Map Reduce MetaStore ThriftAPI SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt.WebUIHIVE: Components
  • 10.
    CREATE TABLE sample(foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; Examples – DDL Operations
  • 11.
    LOAD DATA LOCALINPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24'); LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24'); Examples – DML Operations
  • 12.
    SELECT * FROM( FROM pv_users SELECTTRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map INSERT INTOTABLE pv_users_reduced SELECTTRANSFORM(map.dt, map.uid) USING 'reduce_script'AS (date, count); Running Custom Map/Reduce Scripts
  • 13.
    Machine 2 Machine 1 <k1,v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> (Simplified) Map Reduce Review <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • 14.
    • SQL: INSERT INTOTABLEpv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male pageid age 1 25 2 25 1 32 X = page_view user pv_users Hive QL – Join
  • 15.
    key value 111 <1,1> 111<1,2> 222 <1,1> key value 111 <2,25> 222 <2,32> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male page_view user Map key value 111 <1,1> 111 <1,2> 111 <2,25> key value 222 <1,1> 222 <2,32> Shuffle Sort Reduce Hive QL – Join in Map Reduce
  • 16.
     Outer Joins INSERTINTOTABLE pv_users SELECT pv.*, u.gender, u.age FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id) WHERE pv.date = 2008-03-03; Joins
  • 17.
     Only EqualityJoins with conjunctions supported  Future  Pruning of values send from map to reduce on the basis of projections  Make Cartesian product more memory efficient  Map side joins Hash Joins if one of the tables is very small Exploit pre-sorted data by doing map-side merge join Join To Map Reduce
  • 18.
    SQL: FROM (a joinb on a.key = b.key) join c on a.key = c.key SELECT … key av bv 1 111 222 key av 1 111 A Map Reducekey bv 1 222 B key cv 1 333 C AB Map Reduce key av bv cv 1 111 222 333 ABC Hive Optimizations – Merge Sequential Map Reduce Jobs
  • 19.
    SELECT pageid, age,count(1) FROM pv_users GROUP BY pageid, age; pageid age 1 25 2 25 1 32 2 25 pv_users pageid age count 1 25 1 2 25 2 1 32 1 Hive QL – Group By
  • 20.
    pa pageid age 1 25 225 pv_users pa pageid age 1 32 2 25 Map key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 Shuffle Sort Reduce Hive QL – Group By in Map Reduce
  • 21.
    SELECT pageid, COUNT(DISTINCTuserid) FROM page_view GROUP BY pageid pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 2 111 9:08:20 page_view pageid count_distinct_userid 1 2 2 1 Hive QL – Group By with Distinct
  • 22.
    pageid count 1 1 page_view pageidcount 1 1 2 1 Shuffle and Sort Reduce pageid userid time 1 111 9:08:01 2 111 9:08:13 pageid userid time 1 222 9:08:14 2 111 9:08:20 key v <1,111> <2,111> <2,111> key v <1,222> Hive QL – Group By with Distinct in Map Reduce
  • 23.
    FROM pv_users INSERT INTOTABLEpv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age) INSERT INTO LOCAL DIRECTORY‘/home/me/pv_age_sum.dir’ FIELDSTERMINATED BY‘,’ LINESTERMINATED BY 013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age); Inserts into Files, Tables and Local Files
  • 24.

Editor's Notes

  • #3 Pig is &amp;quot;scripting for Hadoop&amp;quot;, then Hive is &amp;quot;SQL queries for Hadoop“
  • #4 The hadoop experiment : used sql on hadoop.. Required more hardware.. While data warehousing partial availability/resilience/scale is more important than ACID (Atomicity, Consistency, Isolation, Durability) Hive is still intended as a tool for long-running batch-oriented queries over massive data; it&amp;apos;s not &amp;quot;real-time&amp;quot; in any sense
  • #5 Scribe is a server for aggregating log data a file server is a computer attached to a network that has the primary purpose of providing a location for shared disk access, i.e. shared storage of computer files (such as documents, sound files, photographs, movies, images, databases, etc.) that can be accessed by the workstations that are attached to the same computer network
  • #6 Will talk more about metastore
  • #10 SerDe is short for Serializer/Deserializer. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format.  Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine Planner is responsible for generating the execution plan for the parsred query
  • #11 *Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be. *lists all the table that end with &amp;apos;s&amp;apos;. The pattern matching follows Java regular expressions.
  • #12 *Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a). *lists all the table that end with &amp;apos;s&amp;apos;. The pattern matching follows Java regular expressions.
  • #13 *Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a). *lists all the table that end with &amp;apos;s&amp;apos;. The pattern matching follows Java regular expressions.
  • #14 We assume there are only 2 mappers and 2 reducers. Each machine runs 1 mapper and 1 reducer.