Introduction to Hive for Hadoop

15,112 views

Published on

Provides an introduction to Hive. This was given at the 1st Boston Hadoop User Meetup Group on October 28th, 2009.

Published in: Technology
1 Comment
20 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Apache_Hive.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
15,112
On SlideShare
0
From Embeds
0
Number of Embeds
88
Actions
Shares
0
Downloads
0
Comments
1
Likes
20
Embeds 0
No embeds

No notes for slide

Introduction to Hive for Hadoop

  1. 1. Hive <ul><li>Hive </li></ul><ul><li>Boston Hadoop Meetup </li></ul><ul><li>October 2009 </li></ul><ul><li>Ryan LeCompte </li></ul><ul><li>Sr. Software Engineer </li></ul><ul><li>ScanScout </li></ul>
  2. 2. What's the problem? <ul><li>Data ! Companies are no longer dealing with gigabytes, but rather terabytes of data </li></ul><ul><ul><li>Large amount of data to analyze </li></ul></ul><ul><ul><li>Researchers want to study and understand data </li></ul></ul><ul><ul><li>Business folks want to see the data and metrics sliced and diced in various ways </li></ul></ul><ul><ul><li>Everyone is impatient – give me answers now </li></ul></ul><ul><li>Hadoop (by itself) helps solve these issues, but let's face it: writing complex map/reduce jobs in Java (or even Hadoop Streaming) can be tedious and error-prone </li></ul><ul><ul><li>Joining across large datasets is quite tricky </li></ul></ul>
  3. 3. Hive: Putting structure on top of Hadoop <ul><li>Originally built by Facebook; used to analyze and query their incoming ~15TB of log data each day (reporting / ad optimization) </li></ul><ul><li>Puts a schema/structure to log data housed inside of Hadoop/HDFS (via Hive tables) </li></ul><ul><li>Provides a SQL-like query language for writing concise queries on data in Hive tables </li></ul><ul><li>Hive engine compiles the queries into efficiently chained map-reduce jobs‏ (in our case, faster than Java-based map/reduce jobs) </li></ul><ul><li>Automatically figures out number of reducers needed per Hive query based on data input size, etc. </li></ul><ul><li>Results can be pumped back into a Hive table, HDFS, or out to a flat file on disk </li></ul>
  4. 4. Hive Integration & Workflow
  5. 5. Hive Tables <ul><li>Hive has a command-line shell interface (similar to the MySQL shell) where table creation statements and queries can be executed (e.g, SHOW TABLES, etc) </li></ul><ul><li>Hive tables consist of primitive and aggregate column data types (INT,STRING,MAP,LIST,etc) that are delimited by certain characters </li></ul><ul><li>Tables can be associated with a SerDe (serialization/deserialization) class that can be used to interpret/parse data that is loaded into the Hive table </li></ul><ul><li>Tables can be partitioned (by date, for example) and also bucketed (may improve certain queries and joins) </li></ul>
  6. 6. Hive Table Creation <ul><li>Example: A high-traffic online store that logs when users view products and purchase products </li></ul><ul><li>Product Views Table </li></ul><ul><li>CREATE TABLE product_views(userid STRING, productid INT, viewtime STRING, country STRING, price DOUBLE, otherparams MAP<STRING,STRING>) PARTITIONED BY(dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '01' COLLECTION ITEMS TERMINATED BY '02' MAP KEYS TERMINATED BY '03' LINES TERMINATED BY '12' STORED AS SEQUENCEFILE; </li></ul><ul><li>Product Purchases Table </li></ul><ul><li>CREATE TABLE product_purchases(userid STRING, productid INT, purchasetime STRING, country STRING, otherparams MAP<STRING,STRING>) PARTITIONED BY(dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '01' COLLECTION ITEMS TERMINATED BY '02' MAP KEYS TERMINATED BY '03' LINES TERMINATED BY '12' STORED AS SEQUENCEFILE; </li></ul>
  7. 7. Loading Data into Hive Tables <ul><li>The following statements can be executed in the Hive shell: </li></ul><ul><li>LOAD DATA INPATH '/logs/productviews' OVERWRITE INTO TABLE product_views PARTITION(dt='2009-10-20'); </li></ul><ul><li>LOAD DATA INPATH '/logs/productpurchases' OVERWRITE INTO TABLE product_purchases PARTITION(dt='2009-10-20'); </li></ul>
  8. 8. Hive Query Language <ul><li>Hive query language is very similar to standard SQL </li></ul><ul><ul><li>Supports sub-queries, UNION ALL </li></ul></ul><ul><ul><li>Supports map/reduce joins across Hive tables (hard to do with regular Java-based map/reduce jobs) </li></ul></ul><ul><ul><li>Supports simple functions (CONCAT,SUBSTR,ROUND,FLOOR,etc) </li></ul></ul><ul><ul><li>Supports aggregation functions (SUM,COUNT,MAX,etc) </li></ul></ul><ul><ul><li>Supports GROUP BY and SORT BY </li></ul></ul><ul><ul><li>Supports LIKE and RLIKE (regular expression matching) </li></ul></ul><ul><ul><li>Supports specifying your own mapper script or reducer script referenced in the Hive query via TRANSFORM (use case for this appears later) </li></ul></ul><ul><ul><li>Supports specifying user-defined simple functions and aggregate functions </li></ul></ul>
  9. 9. Hive Query Example 1 <ul><li>Number of unique users (by country) who viewed each product </li></ul><ul><li>SELECT productid,country,COUNT(DISTINCT userid) FROM product_views GROUP BY productid,country; </li></ul>
  10. 10. Hive Query Example 2 <ul><li>For each user, display how many times a purchased product was viewed by the user </li></ul><ul><li>SELECT pp.userid, pp.productid, COUNT(pv.productid) </li></ul><ul><li>FROM product_views pv </li></ul><ul><li>JOIN product_purchases pp </li></ul><ul><li>ON (pv.userid = pp.userid AND pv.productid = pp.productid) </li></ul><ul><li>GROUP BY pp.userid,pp.productid; </li></ul>
  11. 11. Hive Tips & Tricks <ul><li>Create your Hive tables as sequence files and load compressed (gzip) data into them </li></ul><ul><li>Don't create one big Hive table for all of your data – create multiple tables that are partitioned (e.g., by date) and take advantage of simple Hive JOIN operations between tables </li></ul><ul><li>Take advantage of multi-table inserts on the same table to avoid redundant full table scans </li></ul><ul><li>If you have STRING columns that contain multiple comma-delimited values (e.g., col1='val1,val2,val3') then use a TRANSFORM with a custom mapper for breaking up the column into multiple key/value pairs </li></ul>
  12. 12. Helpful Links <ul><li>Hive Wiki (Apache) </li></ul><ul><li>Hive Introduction Video (Cloudera) </li></ul><ul><li>Rethinking the Data Warehouse with Hadoop and Hive (Facebook) </li></ul><ul><li>Hadoop Development at Facebook: Hive and HDFS (Facebook) </li></ul>

×