Introduction to Hive for Hadoop

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    5 Favorites

    Introduction to Hive for Hadoop - Presentation Transcript

    1. Hive
      • Hive
      • Boston Hadoop Meetup
      • October 2009
      • Ryan LeCompte
      • Sr. Software Engineer
      • ScanScout
    2. What's the problem?
      • Data ! Companies are no longer dealing with gigabytes, but rather terabytes of data
        • Large amount of data to analyze
        • Researchers want to study and understand data
        • Business folks want to see the data and metrics sliced and diced in various ways
        • Everyone is impatient – give me answers now
      • Hadoop (by itself) helps solve these issues, but let's face it: writing complex map/reduce jobs in Java (or even Hadoop Streaming) can be tedious and error-prone
        • Joining across large datasets is quite tricky
    3. Hive: Putting structure on top of Hadoop
      • Originally built by Facebook; used to analyze and query their incoming ~15TB of log data each day (reporting / ad optimization)
      • Puts a schema/structure to log data housed inside of Hadoop/HDFS (via Hive tables)
      • Provides a SQL-like query language for writing concise queries on data in Hive tables
      • Hive engine compiles the queries into efficiently chained map-reduce jobs‏ (in our case, faster than Java-based map/reduce jobs)
      • Automatically figures out number of reducers needed per Hive query based on data input size, etc.
      • Results can be pumped back into a Hive table, HDFS, or out to a flat file on disk
    4. Hive Integration & Workflow
    5. Hive Tables
      • Hive has a command-line shell interface (similar to the MySQL shell) where table creation statements and queries can be executed (e.g, SHOW TABLES, etc)
      • Hive tables consist of primitive and aggregate column data types (INT,STRING,MAP,LIST,etc) that are delimited by certain characters
      • Tables can be associated with a SerDe (serialization/deserialization) class that can be used to interpret/parse data that is loaded into the Hive table
      • Tables can be partitioned (by date, for example) and also bucketed (may improve certain queries and joins)
    6. Hive Table Creation
      • Example: A high-traffic online store that logs when users view products and purchase products
      • Product Views Table
      • CREATE TABLE product_views(userid STRING, productid INT, viewtime STRING, country STRING, price DOUBLE, otherparams MAP<STRING,STRING>) PARTITIONED BY(dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '01' COLLECTION ITEMS TERMINATED BY '02' MAP KEYS TERMINATED BY '03' LINES TERMINATED BY '12' STORED AS SEQUENCEFILE;
      • Product Purchases Table
      • CREATE TABLE product_purchases(userid STRING, productid INT, purchasetime STRING, country STRING, otherparams MAP<STRING,STRING>) PARTITIONED BY(dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '01' COLLECTION ITEMS TERMINATED BY '02' MAP KEYS TERMINATED BY '03' LINES TERMINATED BY '12' STORED AS SEQUENCEFILE;
    7. Loading Data into Hive Tables
      • The following statements can be executed in the Hive shell:
      • LOAD DATA INPATH '/logs/productviews' OVERWRITE INTO TABLE product_views PARTITION(dt='2009-10-20');
      • LOAD DATA INPATH '/logs/productpurchases' OVERWRITE INTO TABLE product_purchases PARTITION(dt='2009-10-20');
    8. Hive Query Language
      • Hive query language is very similar to standard SQL
        • Supports sub-queries, UNION ALL
        • Supports map/reduce joins across Hive tables (hard to do with regular Java-based map/reduce jobs)
        • Supports simple functions (CONCAT,SUBSTR,ROUND,FLOOR,etc)
        • Supports aggregation functions (SUM,COUNT,MAX,etc)
        • Supports GROUP BY and SORT BY
        • Supports LIKE and RLIKE (regular expression matching)
        • Supports specifying your own mapper script or reducer script referenced in the Hive query via TRANSFORM (use case for this appears later)
        • Supports specifying user-defined simple functions and aggregate functions
    9. Hive Query Example 1
      • Number of unique users (by country) who viewed each product
      • SELECT productid,country,COUNT(DISTINCT userid) FROM product_views GROUP BY productid,country;
    10. Hive Query Example 2
      • For each user, display how many times a purchased product was viewed by the user
      • SELECT pp.userid, pp.productid, COUNT(pv.productid)
      • FROM product_views pv
      • JOIN product_purchases pp
      • ON (pv.userid = pp.userid AND pv.productid = pp.productid)
      • GROUP BY pp.userid,pp.productid;
    11. Hive Tips & Tricks
      • Create your Hive tables as sequence files and load compressed (gzip) data into them
      • Don't create one big Hive table for all of your data – create multiple tables that are partitioned (e.g., by date) and take advantage of simple Hive JOIN operations between tables
      • Take advantage of multi-table inserts on the same table to avoid redundant full table scans
      • If you have STRING columns that contain multiple comma-delimited values (e.g., col1='val1,val2,val3') then use a TRANSFORM with a custom mapper for breaking up the column into multiple key/value pairs
    12. Helpful Links
      • Hive Wiki (Apache)
      • Hive Introduction Video (Cloudera)
      • Rethinking the Data Warehouse with Hadoop and Hive (Facebook)
      • Hadoop Development at Facebook: Hive and HDFS (Facebook)

    + ryanlecompteryanlecompte, 1 month ago

    custom

    404 views, 5 favs, 0 embeds more stats

    Provides an introduction to Hive. This was given at more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 404
      • 404 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 5
    • Downloads 0
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories