20081030linkedin
Upcoming SlideShare
Loading in...5
×
 

20081030linkedin

on

  • 10,504 views

An Introduction to Hive

An Introduction to Hive

Statistics

Views

Total Views
10,504
Views on SlideShare
10,442
Embed Views
62

Actions

Likes
13
Downloads
535
Comments
1

2 Embeds 62

http://www.slideshare.net 59
http://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • - Quickstart, page 6: If you are not running on HDFS, change hive.metastore.warehouse.dir or create the default directory: sudo mkdir -p -m 1777 /user/hive/warehouse

    - Page 15: 'only outer equi-joins are supported' should be 'only equi-joins are supported' since inner joins are supported (plain 'JOIN' means 'INNER JOIN').

    - Page 17: The AS and USING after the TRANSFORM are reversed (AS should be after USING).
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

20081030linkedin 20081030linkedin Presentation Transcript

  • An Introduction to Hive: Components and Query Language Jeff Hammerbacher Chief Scientist and VP of Product October 30, 2008
  • Hive Components A Leaky Database ▪ Hadoop ▪ HDFS ▪ MapReduce (bundles Resource Manager and Job Scheduler) ▪ Hive ▪ Logical data partitioning ▪ Metastore (command line and web interfaces) ▪ Query Language ▪ Libraries to handle different serialization formats (SerDes) ▪ JDBC interface
  • Related Work Glaringly Incomplete ▪ Gamma, Bubba, Volcano, etc. ▪ Google: Sawzall ▪ Yahoo: Pig ▪ IBM Research: JAQL ▪ Microsoft: SCOPE ▪ Greenplum: YAML MapReduce ▪ Aster Data: In-Database MapReduce ▪ Business.com: CloudBase
  • Hive Resources ▪ Facebook Mirror: http://mirror.facebook.com/facebook/hive ▪ Currently the best place to get the Hive distribution ▪ Wiki page: http://wiki.apache.org/hadoop/Hive ▪ Getting started: http://wiki.apache.org/hadoop/Hive/GettingStarted ▪ Query language reference: http://wiki.apache.org/hadoop/Hive/HiveQL ▪ Presentations: http://wiki.apache.org/hadoop/Hive/Presentations ▪ Roadmap: http://wiki.apache.org/hadoop/Hive/Roadmap ▪ Mailing list: hive-users@publists.facebook.com ▪ JIRA: https://issues.apache.org/jira/browse/HADOOP/component/12312455
  • Running Hive Quickstart ▪ <install Hadoop> ▪ wget http://mirror.facebook.com/facebook/hive/hadoop-0.19/dist.tar.gz ▪ (Replace 0.19 with 0.17 if you’re still on 0.17) ▪ tar xvzf dist.tar.gz ▪ cd dist ▪ export HADOOP=<path to bin/hadoop in your Hadoop distribution> ▪ Or: edit hadoop.bin.path and hadoop.conf.dir in conf/hive-default.xml ▪ bin/hive ▪ hive>
  • Running Hive Configuration Details ▪ conf/hive-default.xml ▪ hadoop.bin.path: Points to bin/hadoop in your Hadoop installation ▪ hadoop.config.dir: Points to conf/ in your Hadoop installation ▪ hive.exec.scratchdir: HDFS directory where execution information is written ▪ hive.metastore.warehouse.dir: HDFS directory managed by Hive ▪ The rest of the properties relate to the Metastore ▪ conf/hive-log4j.properties ▪ Will put data into /tmp/{user.name}/hive.log by default ▪ conf/jpox.properties ▪ JPOX is a Java object persistence library used by the Metastore
  • Populating Hive MovieLens Data ▪ <cd into your hive directory> ▪ wget http://www.grouplens.org/system/files/ml-data.tar__0.gz ▪ tar xvzf ml-data.tar__0.gz ▪ CREATE TABLE u_data (userid INT, movieid INT, rating INT, unixtime TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; ▪ The first query can take ten seconds or more, as the Metastore needs to be created ▪ To confirm our table has been created: ▪ SHOW TABLES; ▪ DESCRIBE u_data; ▪ LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data; ▪ SELECT COUNT(1) FROM u_data; ▪ Should fire off 2 MapReduce jobs and ultimately return a count of 100,000
  • Hive Query Language Utility Statements ▪ SHOW TABLES [table_name | table_name_pattern] ▪ DESCRIBE [EXTENDED] table_name [PARTITION (partition_col = partition_col_value, ...)] ▪ EXPLAIN [EXTENDED] query_statement ▪ SET [EXTENDED] ▪ “SET property_name=property_value” to modify a value
  • Hive Query Language CREATE TABLE Syntax ▪ CREATE [EXTERNAL] TABLE table_name (col_name data_type [col_comment], ...) [PARTITIONED BY (col_name data_type [col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] ▪ PARTITION columns are virtual columns; they are not part of the data itself but are derived on load ▪ CLUSTERED columns are real columns, hash partitioned into num_buckets folders ▪ ROW FORMAT can be used to specify a delimited data set or a custom deserializer ▪ Use EXTERNAL with ROW FORMAT, STORED AS, and LOCATION to analyze HDFS files in place ▪ “DROP TABLE table_name” can reverse this operation ▪ NB: Currently, DROP TABLE will delete both data and metadata
  • Hive Query Language CREATE TABLE Syntax, Part Two ▪ data_type: primitive_type | array_type | map_type ▪ primitive_type: ▪ TINYINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | STRING ▪ DATE | DATETIME | TIMESTAMP ▪ array_type: ARRAY < primitive_type > ▪ map_type: MAP < primitive_type, primitive_type > ▪ row_format: ▪ DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] ▪ SERIALIZER serde_name [WITH PROPERTIES property_name=property_value, property_name=property_value, ...] ▪ file_format: SEQUENCEFILE | TEXTFILE
  • Hive Query Language ALTER TABLE Syntax ▪ ALTER TABLE table_name RENAME TO new_table_name; ▪ ALTER TABLE table_name ADD COLUMNS (col_name data_type [col_comment], ...); ▪ ALTER TABLE DROP partition_spec, partition_spec, ...; ▪ Future work: ▪ Support for removing or renaming columns ▪ Support for altering serialization format
  • Hive Query Language LOAD DATA Syntax ▪ LOAD DATA [LOCAL] INPATH '/path/to/file' [OVERWRITE] INTO TABLE table_name [PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)] ▪ You can load data from the local filesystem or anywhere in HDFS (cf. CREATE TABLE EXTERNAL) ▪ If you don’t specify OVERWRITE, data will be appended to existing table
  • Hive Query Language SELECT Syntax ▪ [insert_clause] SELECT [ALL|DISTINCT] select_list FROM [table_source|join_source] [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list] ▪ insert_clause: INSERT OVERWRITE destination ▪ destination: ▪ LOCAL DIRECTORY '/local/path' ▪ DIRECTORY '/hdfs/path' ▪ TABLE table_name [PARTITION (partition_col = partiton_col_value, ...)]
  • Hive Query Language SELECT Syntax ▪ join_source: table_source join_clause table_source join_clause table_source ... ▪ join_clause ▪ [LEFT OUTER|RIGHT OUTER|FULL OUTER] JOIN ON (equality_expression, equality_expression, ...) ▪ Currently, only outer equi-joins are supported in Hive. ▪ There are two join algorithms ▪ Map-side merge join ▪ Reduce-side merge join
  • Hive Query Language Building a Histogram of Review Counts ▪ CREATE TABLE review_counts (userid INT, review_count INT); ▪ INSERT OVERWRITE TABLE review_counts SELECT a.userid, COUNT(1) AS review_count FROM u_data a GROUP BY a.userid; ▪ SELECT b.review_count, COUNT(1) FROM review_counts b GROUP BY b.review_count; ▪ Notes: ▪ No INSERT OVERWRITE for second query means output is dumped to the shell ▪ Hive does not currently support CREATE TABLE AS ▪ We have to create the table and then INSERT into it ▪ Hive does not currently support subqueries ▪ We have to write two queries
  • Hive Query Language Running Custom MapReduce ▪ Put the following into weekday_mapper.py: ▪ import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print ','.join([userid, movieid, rating, str(weekday)]) ▪ CREATE TABLE u_data_new (userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’; ▪ FROM u_data a INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (a.userid, a.movieid, a.rating, a.unixtime) AS (userid, movieid, rating, weekday) USING ‘python /full/path/to/weekday_mapper.py’
  • Hive Query Language Programmatic Access ▪ The Hive shell can take a file with queries to be executed ▪ bin/hive -f /path/to/query/file ▪ You can also run a Hive query straight from the command line ▪ bin/hive -e 'quoted query string' ▪A simple JDBC interface is available for experimentation as well ▪ https://issues.apache.org/jira/browse/HADOOP-4101
  • Hive Components Metastore ▪ Currently uses an embedded Derby database for persistence ▪ While Derby is in place, you’ll need to put it into Server Mode to have more than one Hive concurrent Hive user ▪ See http://wiki.apache.org/hadoop/HiveDerbyServerMode ▪ Next release will use MySQL as default persistent data store ▪ The goal is have the persistent store be pluggable ▪ You can view the Thrift IDL for the metastore online ▪ https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/hive/metastore/if/hive_metastore.thrift
  • Hive Components Query Processing ▪ Compiler ▪ Parser ▪ Type Checking ▪ Semantic Analysis ▪ Plan Generation ▪ Task Generation ▪ Execution Engine ▪ Plan ▪ Operators ▪ UDFs and UDAFs
  • Future Directions ▪ Query Optimization ▪ Support for Statistics ▪ These stats are needed to make optimization decisions ▪ Join Optimizations ▪ Map-side joins, semi join techniques etc to do the join faster ▪ Predicate Pushdown Optimizations ▪ Pushing predicates just above the table scan for certain situations in joins as well as ensuring that only required columns are sent across map/reduce boundaries ▪ Group By Optimizations ▪ Various optimizations to make group by faster ▪ Optimizations to reduce the number of map files created by filter operations ▪ Filters with a large number of mappers produces a lot of files which slows down the following operations.
  • Future Directions ▪ MapReduce Integration ▪ Schema-less MapReduce ▪ TRANSFORM needs a schema while MapReduce is schema-less. ▪ Improvements to TRANSFORM ▪ Make this more intuitive to MapReduce developers - evaluate some other keywords, etc. ▪ User Experience ▪ Create a web interface ▪ Error reporting improvements for parse errors ▪ Add “help” command to the CLI ▪ JDBC driver to enable traditional database tools to be used with Hive
  • Future Directions ▪ Integrating Dynamic SerDe with the DDL ▪ This allows the users to create typed tables along with list and map types from the DDL ▪ Transformations in LOAD DATA ▪ LOAD DATA currently does not transform the input data if it is not in the format expected by the destination table. ▪ Explode and Collect Operators ▪ Explode and collect operators to convert collections to individual items and vice versa. ▪ Propagating sort properties to destination tables ▪ If the query produces sorted we want to capture that in the destination table's metadata so that downstream optimizations can be enabled.
  • (c) 2008 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc. All rights reserved. 1.0