DataWarehousing on Hadoop
HIVE
Hadoop is great for large-data processing!
But writing Java programs for everything is verbose
and slow
Analysts don’t ...
Problem: Data, data and more data
200GB per day in March 2008
2+TB(compressed) raw data per day today
The Hadoop Exper...
HIVE: Components
Shell: allows interactive queries
Driver: session handles, fetch, execute
Compiler: parse, plan, optimize
Execution en...
Tables
Typed columns (int, float, string, boolean)
Also, list: map (for JSON-like data)
Partitions
For example, range...
Database: namespace containing a set of tables
Holds table definitions (column types, physical
layout)
Holds partitioni...
Warehouse directory in HDFS
E.g., /user/hive/warehouse
Tables stored in subdirectories of warehouse
Partitions form su...
HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner...
CREATE TABLE sample (foo INT, bar STRING)
PARTITIONED BY (ds STRING);
SHOW TABLES '.*s';
DESCRIBE sample;
ALTER TABLE ...
LOAD DATA LOCAL INPATH './sample.txt'
OVERWRITE INTO TABLE sample PARTITION
(ds='2012-02-24');
LOAD DATA INPATH '/user/f...
SELECT * FROM (
FROM pv_users
SELECTTRANSFORM(pv_users.userid, pv_users.date) USING
'map_script'
AS(dt, uid)
CLUSTER BY(dt...
Machine 2
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<k4, v4>
<k5, v5>
<k6, v6>
(Simplified) Map Reduce Review
<nk1, nv1>
<nk2, ...
• SQL:
INSERT INTOTABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
pageid u...
key value
111 <1,1>
111 <1,2>
222 <1,1>
key value
111 <2,25>
222 <2,32>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 2...
 Outer Joins
INSERT INTOTABLE pv_users
SELECT pv.*, u.gender, u.age
FROM page_view pv FULL OUTER JOIN user u
ON (pv.useri...
 Only Equality Joins with conjunctions supported
 Future
 Pruning of values send from map to reduce on the
basis of pro...
SQL:
FROM (a join b on a.key = b.key) join c on a.key = c.key
SELECT …
key av bv
1 111 222
key av
1 111
A
Map Reducekey ...
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age;
pageid age
1 25
2 25
1 32
2 25
pv_users
pageid age count
...
pa
pageid age
1 25
2 25
pv_users
pa
pageid age
1 32
2 25
Map
key value
<1,25> 1
<2,25> 1
key value
<1,32> 1
<2,25> 1
key v...
SELECT pageid, COUNT(DISTINCT userid)
FROM page_view GROUP BY pageid
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 ...
pageid count
1 1
page_view
pageid count
1 1
2 1
Shuffle
and
Sort
Reduce
pageid userid time
1 111 9:08:01
2 111 9:08:13
pag...
FROM pv_users
INSERT INTOTABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY(pv_users.gen...
ThankYou
Upcoming SlideShare
Loading in …5
×

Hive

467 views
346 views

Published on

Data Warehousing on Hadooop

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
467
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Pig is &amp;quot;scripting for Hadoop&amp;quot;, then Hive is &amp;quot;SQL queries for Hadoop“
  • The hadoop experiment : used sql on hadoop.. Required more hardware..
    While data warehousing partial availability/resilience/scale is more important than ACID (Atomicity, Consistency, Isolation, Durability)
    Hive is still intended as a tool for long-running batch-oriented queries over massive data; it&amp;apos;s not &amp;quot;real-time&amp;quot; in any sense
  • Scribe is a server for aggregating log data
    a file server is a computer attached to a network that has the primary purpose of providing a location for shared disk access, i.e. shared storage of computer files (such as documents, sound files, photographs, movies, images, databases, etc.) that can be accessed by the workstations that are attached to the same computer network
  • Will talk more about metastore
  • SerDe is short for Serializer/Deserializer. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. 
    Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine
    Planner is responsible for generating the execution plan for the parsred query
  • *Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be.
    *lists all the table that end with &amp;apos;s&amp;apos;. The pattern matching follows Java regular expressions.
  • *Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).
    *lists all the table that end with &amp;apos;s&amp;apos;. The pattern matching follows Java regular expressions.
  • *Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).
    *lists all the table that end with &amp;apos;s&amp;apos;. The pattern matching follows Java regular expressions.
  • We assume there are only 2 mappers and 2 reducers. Each machine runs 1 mapper and 1 reducer.
  • Hive

    1. 1. DataWarehousing on Hadoop HIVE
    2. 2. Hadoop is great for large-data processing! But writing Java programs for everything is verbose and slow Analysts don’t want to (or can’t) write Java Solution: develop higher-level data processing languages Hive: HQL is like SQL Pig: Pig Latin is a bit like Perl Need for High-Level Languages
    3. 3. Problem: Data, data and more data 200GB per day in March 2008 2+TB(compressed) raw data per day today The Hadoop Experiment Much superior to availability and scalability of commercial DBs Efficiency not that great and required more hardware PartialAvailability/resilience/scale more important than ACID Problem: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Why Hive??
    4. 4. HIVE: Components
    5. 5. Shell: allows interactive queries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe HIVE: Components
    6. 6. Tables Typed columns (int, float, string, boolean) Also, list: map (for JSON-like data) Partitions For example, range-partition tables by date Command : PARTITIONED BY Buckets Hash partitions within ranges (useful for sampling, join optimization) Command : CLUSTERED BY Data Model
    7. 7. Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Holds partitioning information Can be stored in Derby, MySQL, and many other relational databases Metastore
    8. 8. Warehouse directory in HDFS E.g., /user/hive/warehouse Tables stored in subdirectories of warehouse Partitions form subdirectories of tables Actual data stored in flat files Control char-delimited text, or SequenceFiles With custom SerDe, can use arbitrary format Physical Layout
    9. 9. HDFS Hive CLI DDLQueriesBrowsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt.WebUIHIVE: Components
    10. 10. CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; Examples – DDL Operations
    11. 11. LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24'); LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24'); Examples – DML Operations
    12. 12. SELECT * FROM ( FROM pv_users SELECTTRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map INSERT INTOTABLE pv_users_reduced SELECTTRANSFORM(map.dt, map.uid) USING 'reduce_script'AS (date, count); Running Custom Map/Reduce Scripts
    13. 13. Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> (Simplified) Map Reduce Review <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
    14. 14. • SQL: INSERT INTOTABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male pageid age 1 25 2 25 1 32 X = page_view user pv_users Hive QL – Join
    15. 15. key value 111 <1,1> 111 <1,2> 222 <1,1> key value 111 <2,25> 222 <2,32> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male page_view user Map key value 111 <1,1> 111 <1,2> 111 <2,25> key value 222 <1,1> 222 <2,32> Shuffle Sort Reduce Hive QL – Join in Map Reduce
    16. 16.  Outer Joins INSERT INTOTABLE pv_users SELECT pv.*, u.gender, u.age FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id) WHERE pv.date = 2008-03-03; Joins
    17. 17.  Only Equality Joins with conjunctions supported  Future  Pruning of values send from map to reduce on the basis of projections  Make Cartesian product more memory efficient  Map side joins Hash Joins if one of the tables is very small Exploit pre-sorted data by doing map-side merge join Join To Map Reduce
    18. 18. SQL: FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT … key av bv 1 111 222 key av 1 111 A Map Reducekey bv 1 222 B key cv 1 333 C AB Map Reduce key av bv cv 1 111 222 333 ABC Hive Optimizations – Merge Sequential Map Reduce Jobs
    19. 19. SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age; pageid age 1 25 2 25 1 32 2 25 pv_users pageid age count 1 25 1 2 25 2 1 32 1 Hive QL – Group By
    20. 20. pa pageid age 1 25 2 25 pv_users pa pageid age 1 32 2 25 Map key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 Shuffle Sort Reduce Hive QL – Group By in Map Reduce
    21. 21. SELECT pageid, COUNT(DISTINCT userid) FROM page_view GROUP BY pageid pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 2 111 9:08:20 page_view pageid count_distinct_userid 1 2 2 1 Hive QL – Group By with Distinct
    22. 22. pageid count 1 1 page_view pageid count 1 1 2 1 Shuffle and Sort Reduce pageid userid time 1 111 9:08:01 2 111 9:08:13 pageid userid time 1 222 9:08:14 2 111 9:08:20 key v <1,111> <2,111> <2,111> key v <1,222> Hive QL – Group By with Distinct in Map Reduce
    23. 23. FROM pv_users INSERT INTOTABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age) INSERT INTO LOCAL DIRECTORY‘/home/me/pv_age_sum.dir’ FIELDSTERMINATED BY‘,’ LINESTERMINATED BY 013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age); Inserts into Files, Tables and Local Files
    24. 24. ThankYou

    ×