Introduction to Hive for Hadoop - Presentation Transcript
Hive
Hive
Boston Hadoop Meetup
October 2009
Ryan LeCompte
Sr. Software Engineer
ScanScout
What's the problem?
Data ! Companies are no longer dealing with gigabytes, but rather terabytes of data
Large amount of data to analyze
Researchers want to study and understand data
Business folks want to see the data and metrics sliced and diced in various ways
Everyone is impatient – give me answers now
Hadoop (by itself) helps solve these issues, but let's face it: writing complex map/reduce jobs in Java (or even Hadoop Streaming) can be tedious and error-prone
Joining across large datasets is quite tricky
Hive: Putting structure on top of Hadoop
Originally built by Facebook; used to analyze and query their incoming ~15TB of log data each day (reporting / ad optimization)
Puts a schema/structure to log data housed inside of Hadoop/HDFS (via Hive tables)
Provides a SQL-like query language for writing concise queries on data in Hive tables
Hive engine compiles the queries into efficiently chained map-reduce jobs (in our case, faster than Java-based map/reduce jobs)
Automatically figures out number of reducers needed per Hive query based on data input size, etc.
Results can be pumped back into a Hive table, HDFS, or out to a flat file on disk
Hive Integration & Workflow
Hive Tables
Hive has a command-line shell interface (similar to the MySQL shell) where table creation statements and queries can be executed (e.g, SHOW TABLES, etc)
Hive tables consist of primitive and aggregate column data types (INT,STRING,MAP,LIST,etc) that are delimited by certain characters
Tables can be associated with a SerDe (serialization/deserialization) class that can be used to interpret/parse data that is loaded into the Hive table
Tables can be partitioned (by date, for example) and also bucketed (may improve certain queries and joins)
Hive Table Creation
Example: A high-traffic online store that logs when users view products and purchase products
Product Views Table
CREATE TABLE product_views(userid STRING, productid INT, viewtime STRING, country STRING, price DOUBLE, otherparams MAP<STRING,STRING>) PARTITIONED BY(dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '
0 comments
Post a comment