SDEC2011 Essentials of Hive
Upcoming SlideShare
Loading in...5
×
 

SDEC2011 Essentials of Hive

on

  • 1,680 views

 

Statistics

Views

Total Views
1,680
Views on SlideShare
1,680
Embed Views
0

Actions

Likes
3
Downloads
119
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    SDEC2011 Essentials of Hive SDEC2011 Essentials of Hive Presentation Transcript

    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Essentials of HiveMastering Hadoop Map-reduce for Data AnalysisShashank Tiwariblog: shanky.org | twitter: @tshankyst@treasuryofideas.com
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.What is Hive?• A data warehouse system for Hadoop• Facilitates data summarization and ad-hoc queries• Allows SQL like querying using HiveQL, by transposing metadata onto data stored in HDFS• Can also plug-in custom mappers and reducers
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Supported Platforms• Linux/Unix and Mac OSX• Does not work on Cygwin
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Required Software• Java 1.6.x• Hadoop 0.17.x to 0.20.x
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Download• Source: http://hive.apache.org/releases.html• Version: • hive-0.7.0• Both binary and source distributions available
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Install• Extract: tar zxvf hive-0.7.0-bin.tar.gz• Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive• Set environment variable HIVE_HOME to point to the hive directory• Add $HIVE_HOME/bin to your PATH environment variable
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Build From Source• $ svn co http://svn.apache.org/repos/asf/hive/trunk hive• $ cd hive• $ ant clean package• The binary distribution is in build/dist
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Hive Needs Hadoop• Needs Hadoop • Add Hadoop distribution to your path or set HADOOP_HOME • Start Hadoop daemons • bin/start-all.sh
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Configure Hive• Create /tmp in HDFS and set appropriate permissions • bin/hadoop fs -mkdir /tmp • bin/hadoop fs -chmod g+w /tmp• Create /user/hive/warehouse and set appropriate permissions • bin/hadoop fs -mkdir /user/hive/warehouse • bin/hadoop fs -chmod g+w /user/hive/warehouse
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Default Hive Configuration• Default configuration: conf/hive-default.xml• Override default configuration by redefining properties in: • conf/hive-site.xml• Set HIVE_CONF_DIR to set a new location for the config file• Hive configuration is a overlay on top of Hadoop configuration
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Hive Configuration Manipulation• Edit: conf/hive-site.xml• Use SET command on the Hive cli• Pass parameters to Hive • bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2 • set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Hive by Example -- Getting Started• Start the cli: bin/hive• Basic DDL statements • List the existing tables • SHOW TABLES;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Create Table• CREATE TABLE books (isbn INT, title STRING);• DESCRIBE books; • isbn int • title string• CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol STRING); • What is PARTITION BY vcol?
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Logical Table Partitions• A Hive table can be logically partitioned by a virtual column• virtual column is derived by the partition in which the data is stored• A table can have multiple partitions• Each partition in uniquely identified by a virtual column value
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Alter Table• ALTER TABLE books ADD COLUMNS (author STRING, category STRING);• Change Column Property • ALTER TABLE table_name CHANGE [COLUMN] • old_column_name new_column_name column_type • [COMMENT column_comment] [FIRST|AFTER column_name]
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Alter Table Column Property• ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT "multi-valued"; • old and new column name needs to be specified • Data type changed
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Data Types Supported• Primitives: INT, STRING, etc...• Complex types: maps, array, struct
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Rename Table• ALTER TABLE books RENAME TO published_contents;• DESCRIBE published_contents;• DESCRIBE books; (Execution error!)
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Drop Tables• DROP TABLE published_contents;• DROP TABLE users;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.GroupLens Example -- Getting the Data Set• Movie ratings -- 1 million records• Available in tar.gz format: million-ml-data.tar__0.gz• Extract: tar zxvf million-ml-data.tar__0.gz•
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Loading Rating Data• Format of data in ratings.dat: • UserID::MovieID::Rating::Timestamp• Replace delimiter ‘::’ for ‘#’ • :%s/::/#/g• Save as .hash_delimited
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Creating Metadata and Loading the File• hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY # STORED AS TEXTFILE;• LOAD DATA LOCAL INPATH <path/to/flat/file> OVERWRITE INTO TABLE <table name>;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.File Load Properties• No validation. Developer’s responsibility to make sure schema matches between table schema and the file.• Data can be on the local filesystem or on HDFS• Data copied to Hive HDFS namespace• If OVERWRITE not specified then its data append
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Rating Data Load• hive> LOAD DATA LOCAL INPATH /path/to/ratings.dat.hash_delimited• > OVERWRITE INTO TABLE ratings;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.A SQL Style Query• SELECT COUNT(*) FROM ratings;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Loading movies and users data• Now load the movies and users data in the same way as the ratings data. • Details on the console...• CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY # STORED AS TEXTFILE;• add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/ occupation_mapper.py;• INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender, age, occupation, zipcode) USING python occupation_mapper.py AS (userid, gender, age, occupation_str, zipcode) FROM users;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Good Old SQL• SELECT * FROM movies LIMIT 5;• SELECT * FROM ratings WHERE movieid = 1;• SELECT COUNT(*) FROM ratings WHERE movieid < 10;• SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;• SELECT title FROM movies WHERE title = `^Toy+`;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.More Than Good Old SQL• SELECT `*+(id)` FROM ratings WHERE movieid = 1; • regular expression based search on column name• SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid = 1 GROUP BY ratings.rating; (group by)• SELECT * FROM movies ORDER BY movieid DESC;• DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.JOIN(s) in HiveQL• equality joins, outer joins, left semi-joins • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;• More than 2 tables: • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.JOIN(s) in HiveQL• equality joins, outer joins, left semi-joins • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;• More than 2 tables: • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Explain Plan to Under the hood MapReduce• EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Questions?• blog: shanky.org | twitter: @tshanky• st@treasuryofideas.com