Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Ā
SDEC2011 Essentials of Hive
1. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Essentials of Hive
Mastering Hadoop Map-reduce for Data Analysis
Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryoļ¬deas.com
2. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
What is Hive?
ā¢ A data warehouse system for Hadoop
ā¢ Facilitates data summarization and ad-hoc queries
ā¢ Allows SQL like querying using HiveQL, by transposing metadata onto data
stored in HDFS
ā¢ Can also plug-in custom mappers and reducers
3. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Supported Platforms
ā¢ Linux/Unix and Mac OSX
ā¢ Does not work on Cygwin
4. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Required Software
ā¢ Java 1.6.x
ā¢ Hadoop 0.17.x to 0.20.x
5. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Download
ā¢ Source: http://hive.apache.org/releases.html
ā¢ Version:
ā¢ hive-0.7.0
ā¢ Both binary and source distributions available
6. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Install
ā¢ Extract: tar zxvf hive-0.7.0-bin.tar.gz
ā¢ Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive
ā¢ Set environment variable HIVE_HOME to point to the hive directory
ā¢ Add $HIVE_HOME/bin to your PATH environment variable
7. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Build From Source
ā¢ $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
ā¢ $ cd hive
ā¢ $ ant clean package
ā¢ The binary distribution is in build/dist
8. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Hive Needs Hadoop
ā¢ Needs Hadoop
ā¢ Add Hadoop distribution to your path or set HADOOP_HOME
ā¢ Start Hadoop daemons
ā¢ bin/start-all.sh
9. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Conļ¬gure Hive
ā¢ Create /tmp in HDFS and set appropriate permissions
ā¢ bin/hadoop fs -mkdir /tmp
ā¢ bin/hadoop fs -chmod g+w /tmp
ā¢ Create /user/hive/warehouse and set appropriate permissions
ā¢ bin/hadoop fs -mkdir /user/hive/warehouse
ā¢ bin/hadoop fs -chmod g+w /user/hive/warehouse
10. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Default Hive Conļ¬guration
ā¢ Default conļ¬guration: conf/hive-default.xml
ā¢ Override default conļ¬guration by redeļ¬ning properties in:
ā¢ conf/hive-site.xml
ā¢ Set HIVE_CONF_DIR to set a new location for the conļ¬g ļ¬le
ā¢ Hive conļ¬guration is a overlay on top of Hadoop conļ¬guration
11. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Hive Conļ¬guration Manipulation
ā¢ Edit: conf/hive-site.xml
ā¢ Use SET command on the Hive cli
ā¢ Pass parameters to Hive
ā¢ bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2
ā¢ set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
12. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Hive by Example -- Getting Started
ā¢ Start the cli: bin/hive
ā¢ Basic DDL statements
ā¢ List the existing tables
ā¢ SHOW TABLES;
13. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Create Table
ā¢ CREATE TABLE books (isbn INT, title STRING);
ā¢ DESCRIBE books;
ā¢ isbn int
ā¢ title string
ā¢ CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol
STRING);
ā¢ What is PARTITION BY vcol?
14. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Logical Table Partitions
ā¢ A Hive table can be logically partitioned by a virtual column
ā¢ virtual column is derived by the partition in which the data is stored
ā¢ A table can have multiple partitions
ā¢ Each partition in uniquely identiļ¬ed by a virtual column value
15. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Alter Table
ā¢ ALTER TABLE books ADD COLUMNS (author STRING, category STRING);
ā¢ Change Column Property
ā¢ ALTER TABLE table_name CHANGE [COLUMN]
ā¢ old_column_name new_column_name column_type
ā¢ [COMMENT column_comment] [FIRST|AFTER column_name]
16. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Alter Table Column Property
ā¢ ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT
"multi-valued";
ā¢ old and new column name needs to be speciļ¬ed
ā¢ Data type changed
17. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Data Types Supported
ā¢ Primitives: INT, STRING, etc...
ā¢ Complex types: maps, array, struct
18. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Rename Table
ā¢ ALTER TABLE books RENAME TO published_contents;
ā¢ DESCRIBE published_contents;
ā¢ DESCRIBE books; (Execution error!)
19. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Drop Tables
ā¢ DROP TABLE published_contents;
ā¢ DROP TABLE users;
20. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
GroupLens Example -- Getting the Data Set
ā¢ Movie ratings -- 1 million records
ā¢ Available in tar.gz format: million-ml-data.tar__0.gz
ā¢ Extract: tar zxvf million-ml-data.tar__0.gz
ā¢
21. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Loading Rating Data
ā¢ Format of data in ratings.dat:
ā¢ UserID::MovieID::Rating::Timestamp
ā¢ Replace delimiter ā::ā for ā#ā
ā¢ :%s/::/#/g
ā¢ Save as .hash_delimited
22. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Creating Metadata and Loading the File
ā¢ hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp
STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED
AS TEXTFILE;
ā¢ LOAD DATA LOCAL INPATH <'path/to/ļ¬at/ļ¬le'> OVERWRITE INTO TABLE
<table name>;
23. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
File Load Properties
ā¢ No validation. Developerās responsibility to make sure schema matches
between table schema and the ļ¬le.
ā¢ Data can be on the local ļ¬lesystem or on HDFS
ā¢ Data copied to Hive HDFS namespace
ā¢ If OVERWRITE not speciļ¬ed then its data append
24. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Rating Data Load
ā¢ hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited'
ā¢ > OVERWRITE INTO TABLE ratings;
25. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
A SQL Style Query
ā¢ SELECT COUNT(*) FROM ratings;
26. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Loading movies and users data
ā¢ Now load the movies and users data in the same way as the ratings data.
ā¢ Details on the console...
ā¢ CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation
STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED
BY '#' STORED AS TEXTFILE;
ā¢ add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/
occupation_mapper.py;
ā¢ INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender,
age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid,
gender, age, occupation_str, zipcode) FROM users;
27. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Good Old SQL
ā¢ SELECT * FROM movies LIMIT 5;
ā¢ SELECT * FROM ratings WHERE movieid = 1;
ā¢ SELECT COUNT(*) FROM ratings WHERE movieid < 10;
ā¢ SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;
ā¢ SELECT title FROM movies WHERE title = `^Toy+`;
28. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
More Than Good Old SQL
ā¢ SELECT `*+(id)` FROM ratings WHERE movieid = 1;
ā¢ regular expression based search on column name
ā¢ SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid
= 1 GROUP BY ratings.rating; (group by)
ā¢ SELECT * FROM movies ORDER BY movieid DESC;
ā¢ DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
29. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
JOIN(s) in HiveQL
ā¢ equality joins, outer joins, left semi-joins
ā¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM
ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;
ā¢ More than 2 tables:
ā¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title,
users.gender FROM ratings JOIN movies ON (ratings.movieid =
movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
30. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
JOIN(s) in HiveQL
ā¢ equality joins, outer joins, left semi-joins
ā¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM
ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;
ā¢ More than 2 tables:
ā¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title,
users.gender FROM ratings JOIN movies ON (ratings.movieid =
movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
31. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Explain Plan to Under the hood MapReduce
ā¢ EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating =
5;
32. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Questions?
ā¢ blog: shanky.org | twitter: @tshanky
ā¢ st@treasuryoļ¬deas.com