Introduction to Hive for Hadoop


Published on

Provides an introduction to Hive. This was given at the 1st Boston Hadoop User Meetup Group on October 28th, 2009.

Published in: Technology
1 Comment
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to Hive for Hadoop

  1. 1. Hive <ul><li>Hive </li></ul><ul><li>Boston Hadoop Meetup </li></ul><ul><li>October 2009 </li></ul><ul><li>Ryan LeCompte </li></ul><ul><li>Sr. Software Engineer </li></ul><ul><li>ScanScout </li></ul>
  2. 2. What's the problem? <ul><li>Data ! Companies are no longer dealing with gigabytes, but rather terabytes of data </li></ul><ul><ul><li>Large amount of data to analyze </li></ul></ul><ul><ul><li>Researchers want to study and understand data </li></ul></ul><ul><ul><li>Business folks want to see the data and metrics sliced and diced in various ways </li></ul></ul><ul><ul><li>Everyone is impatient – give me answers now </li></ul></ul><ul><li>Hadoop (by itself) helps solve these issues, but let's face it: writing complex map/reduce jobs in Java (or even Hadoop Streaming) can be tedious and error-prone </li></ul><ul><ul><li>Joining across large datasets is quite tricky </li></ul></ul>
  3. 3. Hive: Putting structure on top of Hadoop <ul><li>Originally built by Facebook; used to analyze and query their incoming ~15TB of log data each day (reporting / ad optimization) </li></ul><ul><li>Puts a schema/structure to log data housed inside of Hadoop/HDFS (via Hive tables) </li></ul><ul><li>Provides a SQL-like query language for writing concise queries on data in Hive tables </li></ul><ul><li>Hive engine compiles the queries into efficiently chained map-reduce jobs‏ (in our case, faster than Java-based map/reduce jobs) </li></ul><ul><li>Automatically figures out number of reducers needed per Hive query based on data input size, etc. </li></ul><ul><li>Results can be pumped back into a Hive table, HDFS, or out to a flat file on disk </li></ul>
  4. 4. Hive Integration & Workflow
  5. 5. Hive Tables <ul><li>Hive has a command-line shell interface (similar to the MySQL shell) where table creation statements and queries can be executed (e.g, SHOW TABLES, etc) </li></ul><ul><li>Hive tables consist of primitive and aggregate column data types (INT,STRING,MAP,LIST,etc) that are delimited by certain characters </li></ul><ul><li>Tables can be associated with a SerDe (serialization/deserialization) class that can be used to interpret/parse data that is loaded into the Hive table </li></ul><ul><li>Tables can be partitioned (by date, for example) and also bucketed (may improve certain queries and joins) </li></ul>
  6. 6. Hive Table Creation <ul><li>Example: A high-traffic online store that logs when users view products and purchase products </li></ul><ul><li>Product Views Table </li></ul><ul><li>CREATE TABLE product_views(userid STRING, productid INT, viewtime STRING, country STRING, price DOUBLE, otherparams MAP<STRING,STRING>) PARTITIONED BY(dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '01' COLLECTION ITEMS TERMINATED BY '02' MAP KEYS TERMINATED BY '03' LINES TERMINATED BY '12' STORED AS SEQUENCEFILE; </li></ul><ul><li>Product Purchases Table </li></ul><ul><li>CREATE TABLE product_purchases(userid STRING, productid INT, purchasetime STRING, country STRING, otherparams MAP<STRING,STRING>) PARTITIONED BY(dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '01' COLLECTION ITEMS TERMINATED BY '02' MAP KEYS TERMINATED BY '03' LINES TERMINATED BY '12' STORED AS SEQUENCEFILE; </li></ul>
  7. 7. Loading Data into Hive Tables <ul><li>The following statements can be executed in the Hive shell: </li></ul><ul><li>LOAD DATA INPATH '/logs/productviews' OVERWRITE INTO TABLE product_views PARTITION(dt='2009-10-20'); </li></ul><ul><li>LOAD DATA INPATH '/logs/productpurchases' OVERWRITE INTO TABLE product_purchases PARTITION(dt='2009-10-20'); </li></ul>
  8. 8. Hive Query Language <ul><li>Hive query language is very similar to standard SQL </li></ul><ul><ul><li>Supports sub-queries, UNION ALL </li></ul></ul><ul><ul><li>Supports map/reduce joins across Hive tables (hard to do with regular Java-based map/reduce jobs) </li></ul></ul><ul><ul><li>Supports simple functions (CONCAT,SUBSTR,ROUND,FLOOR,etc) </li></ul></ul><ul><ul><li>Supports aggregation functions (SUM,COUNT,MAX,etc) </li></ul></ul><ul><ul><li>Supports GROUP BY and SORT BY </li></ul></ul><ul><ul><li>Supports LIKE and RLIKE (regular expression matching) </li></ul></ul><ul><ul><li>Supports specifying your own mapper script or reducer script referenced in the Hive query via TRANSFORM (use case for this appears later) </li></ul></ul><ul><ul><li>Supports specifying user-defined simple functions and aggregate functions </li></ul></ul>
  9. 9. Hive Query Example 1 <ul><li>Number of unique users (by country) who viewed each product </li></ul><ul><li>SELECT productid,country,COUNT(DISTINCT userid) FROM product_views GROUP BY productid,country; </li></ul>
  10. 10. Hive Query Example 2 <ul><li>For each user, display how many times a purchased product was viewed by the user </li></ul><ul><li>SELECT pp.userid, pp.productid, COUNT(pv.productid) </li></ul><ul><li>FROM product_views pv </li></ul><ul><li>JOIN product_purchases pp </li></ul><ul><li>ON (pv.userid = pp.userid AND pv.productid = pp.productid) </li></ul><ul><li>GROUP BY pp.userid,pp.productid; </li></ul>
  11. 11. Hive Tips & Tricks <ul><li>Create your Hive tables as sequence files and load compressed (gzip) data into them </li></ul><ul><li>Don't create one big Hive table for all of your data – create multiple tables that are partitioned (e.g., by date) and take advantage of simple Hive JOIN operations between tables </li></ul><ul><li>Take advantage of multi-table inserts on the same table to avoid redundant full table scans </li></ul><ul><li>If you have STRING columns that contain multiple comma-delimited values (e.g., col1='val1,val2,val3') then use a TRANSFORM with a custom mapper for breaking up the column into multiple key/value pairs </li></ul>
  12. 12. Helpful Links <ul><li>Hive Wiki (Apache) </li></ul><ul><li>Hive Introduction Video (Cloudera) </li></ul><ul><li>Rethinking the Data Warehouse with Hadoop and Hive (Facebook) </li></ul><ul><li>Hadoop Development at Facebook: Hive and HDFS (Facebook) </li></ul>