Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Essentials of Hive
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




What is Hive?

• A data warehouse system for Hadoop


• Facilitates data summarization and ad-hoc queries


• Allows SQL like querying using HiveQL, by transposing metadata onto data
  stored in HDFS


• Can also plug-in custom mappers and reducers
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Supported Platforms

• Linux/Unix and Mac OSX


• Does not work on Cygwin
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Required Software

• Java 1.6.x


• Hadoop 0.17.x to 0.20.x
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Download

• Source: http://hive.apache.org/releases.html


• Version:


   • hive-0.7.0


• Both binary and source distributions available
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




 Install

• Extract: tar zxvf hive-0.7.0-bin.tar.gz


• Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive


• Set environment variable HIVE_HOME to point to the hive directory


• Add $HIVE_HOME/bin to your PATH environment variable
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




 Build From Source

• $ svn co http://svn.apache.org/repos/asf/hive/trunk hive


• $ cd hive


• $ ant clean package


• The binary distribution is in build/dist
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Hive Needs Hadoop

• Needs Hadoop


  • Add Hadoop distribution to your path or set HADOOP_HOME


  • Start Hadoop daemons


    • bin/start-all.sh
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Configure Hive

• Create /tmp in HDFS and set appropriate permissions


  • bin/hadoop fs -mkdir /tmp


  • bin/hadoop fs -chmod g+w /tmp


• Create /user/hive/warehouse and set appropriate permissions


  • bin/hadoop fs -mkdir /user/hive/warehouse


  • bin/hadoop fs -chmod g+w /user/hive/warehouse
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Default Hive Configuration

• Default configuration: conf/hive-default.xml


• Override default configuration by redefining properties in:


   • conf/hive-site.xml


• Set HIVE_CONF_DIR to set a new location for the config file


• Hive configuration is a overlay on top of Hadoop configuration
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Hive Configuration Manipulation

• Edit: conf/hive-site.xml


• Use SET command on the Hive cli


• Pass parameters to Hive


   • bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2


   • set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                           Copyright for all other & referenced work is retained by their respective owners.




Hive by Example -- Getting Started

• Start the cli: bin/hive


• Basic DDL statements


   • List the existing tables


      • SHOW TABLES;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                             Copyright for all other & referenced work is retained by their respective owners.




Create Table

• CREATE TABLE books (isbn INT, title STRING);


• DESCRIBE books;


  • isbn	    int	


  • title	   string


• CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol
  STRING);


  • What is PARTITION BY vcol?
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Logical Table Partitions

• A Hive table can be logically partitioned by a virtual column


• virtual column is derived by the partition in which the data is stored


• A table can have multiple partitions


• Each partition in uniquely identified by a virtual column value
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Alter Table

• ALTER TABLE books ADD COLUMNS (author STRING, category STRING);


• Change Column Property


  • ALTER TABLE table_name CHANGE [COLUMN]


  • old_column_name new_column_name column_type


  • [COMMENT column_comment] [FIRST|AFTER column_name]
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Alter Table Column Property

• ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT
  "multi-valued";


  • old and new column name needs to be specified


  • Data type changed
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Data Types Supported

• Primitives: INT, STRING, etc...


• Complex types: maps, array, struct
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Rename Table

• ALTER TABLE books RENAME TO published_contents;


• DESCRIBE published_contents;


• DESCRIBE books; (Execution error!)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Drop Tables

• DROP TABLE published_contents;


• DROP TABLE users;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




GroupLens Example -- Getting the Data Set

• Movie ratings -- 1 million records


• Available in tar.gz format: million-ml-data.tar__0.gz


• Extract: tar zxvf million-ml-data.tar__0.gz


•
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Loading Rating Data

• Format of data in ratings.dat:


   • UserID::MovieID::Rating::Timestamp


• Replace delimiter ‘::’ for ‘#’


   • :%s/::/#/g


• Save as .hash_delimited
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Creating Metadata and Loading the File

• hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp
  STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED
  AS TEXTFILE;


• LOAD DATA LOCAL INPATH <'path/to/flat/file'> OVERWRITE INTO TABLE
  <table name>;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




File Load Properties

• No validation. Developer’s responsibility to make sure schema matches
  between table schema and the file.


• Data can be on the local filesystem or on HDFS


• Data copied to Hive HDFS namespace


• If OVERWRITE not specified then its data append
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Rating Data Load

• hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited'


•   > OVERWRITE INTO TABLE ratings;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




A SQL Style Query

• SELECT COUNT(*) FROM ratings;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Loading movies and users data

• Now load the movies and users data in the same way as the ratings data.


  • Details on the console...


• CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation
  STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED
  BY '#' STORED AS TEXTFILE;


• add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/
  occupation_mapper.py;


• INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender,
  age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid,
  gender, age, occupation_str, zipcode) FROM users;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Good Old SQL

• SELECT * FROM movies LIMIT 5;


• SELECT * FROM ratings WHERE movieid = 1;


• SELECT COUNT(*) FROM ratings WHERE movieid < 10;


• SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;


• SELECT title FROM movies WHERE title = `^Toy+`;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




More Than Good Old SQL

• SELECT `*+(id)` FROM ratings WHERE movieid = 1;


  • regular expression based search on column name


• SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid
  = 1 GROUP BY ratings.rating; (group by)


• SELECT * FROM movies ORDER BY movieid DESC;


• DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




JOIN(s) in HiveQL

• equality joins, outer joins, left semi-joins


   • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM
     ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;


• More than 2 tables:


   • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title,
     users.gender FROM ratings JOIN movies ON (ratings.movieid =
     movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




JOIN(s) in HiveQL

• equality joins, outer joins, left semi-joins


   • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM
     ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;


• More than 2 tables:


   • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title,
     users.gender FROM ratings JOIN movies ON (ratings.movieid =
     movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Explain Plan to Under the hood MapReduce

• EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating =
  5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

SDEC2011 Essentials of Hive

  • 1.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Hive Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. What is Hive? • A data warehouse system for Hadoop • Facilitates data summarization and ad-hoc queries • Allows SQL like querying using HiveQL, by transposing metadata onto data stored in HDFS • Can also plug-in custom mappers and reducers
  • 3.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Supported Platforms • Linux/Unix and Mac OSX • Does not work on Cygwin
  • 4.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Required Software • Java 1.6.x • Hadoop 0.17.x to 0.20.x
  • 5.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Download • Source: http://hive.apache.org/releases.html • Version: • hive-0.7.0 • Both binary and source distributions available
  • 6.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Install • Extract: tar zxvf hive-0.7.0-bin.tar.gz • Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive • Set environment variable HIVE_HOME to point to the hive directory • Add $HIVE_HOME/bin to your PATH environment variable
  • 7.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Build From Source • $ svn co http://svn.apache.org/repos/asf/hive/trunk hive • $ cd hive • $ ant clean package • The binary distribution is in build/dist
  • 8.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Hive Needs Hadoop • Needs Hadoop • Add Hadoop distribution to your path or set HADOOP_HOME • Start Hadoop daemons • bin/start-all.sh
  • 9.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Configure Hive • Create /tmp in HDFS and set appropriate permissions • bin/hadoop fs -mkdir /tmp • bin/hadoop fs -chmod g+w /tmp • Create /user/hive/warehouse and set appropriate permissions • bin/hadoop fs -mkdir /user/hive/warehouse • bin/hadoop fs -chmod g+w /user/hive/warehouse
  • 10.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Default Hive Configuration • Default configuration: conf/hive-default.xml • Override default configuration by redefining properties in: • conf/hive-site.xml • Set HIVE_CONF_DIR to set a new location for the config file • Hive configuration is a overlay on top of Hadoop configuration
  • 11.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Hive Configuration Manipulation • Edit: conf/hive-site.xml • Use SET command on the Hive cli • Pass parameters to Hive • bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2 • set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
  • 12.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Hive by Example -- Getting Started • Start the cli: bin/hive • Basic DDL statements • List the existing tables • SHOW TABLES;
  • 13.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Create Table • CREATE TABLE books (isbn INT, title STRING); • DESCRIBE books; • isbn int • title string • CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol STRING); • What is PARTITION BY vcol?
  • 14.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Logical Table Partitions • A Hive table can be logically partitioned by a virtual column • virtual column is derived by the partition in which the data is stored • A table can have multiple partitions • Each partition in uniquely identified by a virtual column value
  • 15.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Alter Table • ALTER TABLE books ADD COLUMNS (author STRING, category STRING); • Change Column Property • ALTER TABLE table_name CHANGE [COLUMN] • old_column_name new_column_name column_type • [COMMENT column_comment] [FIRST|AFTER column_name]
  • 16.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Alter Table Column Property • ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT "multi-valued"; • old and new column name needs to be specified • Data type changed
  • 17.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Data Types Supported • Primitives: INT, STRING, etc... • Complex types: maps, array, struct
  • 18.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Rename Table • ALTER TABLE books RENAME TO published_contents; • DESCRIBE published_contents; • DESCRIBE books; (Execution error!)
  • 19.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Drop Tables • DROP TABLE published_contents; • DROP TABLE users;
  • 20.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. GroupLens Example -- Getting the Data Set • Movie ratings -- 1 million records • Available in tar.gz format: million-ml-data.tar__0.gz • Extract: tar zxvf million-ml-data.tar__0.gz •
  • 21.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading Rating Data • Format of data in ratings.dat: • UserID::MovieID::Rating::Timestamp • Replace delimiter ‘::’ for ‘#’ • :%s/::/#/g • Save as .hash_delimited
  • 22.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Creating Metadata and Loading the File • hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE; • LOAD DATA LOCAL INPATH <'path/to/flat/file'> OVERWRITE INTO TABLE <table name>;
  • 23.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. File Load Properties • No validation. Developer’s responsibility to make sure schema matches between table schema and the file. • Data can be on the local filesystem or on HDFS • Data copied to Hive HDFS namespace • If OVERWRITE not specified then its data append
  • 24.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Rating Data Load • hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited' • > OVERWRITE INTO TABLE ratings;
  • 25.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. A SQL Style Query • SELECT COUNT(*) FROM ratings;
  • 26.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading movies and users data • Now load the movies and users data in the same way as the ratings data. • Details on the console... • CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE; • add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/ occupation_mapper.py; • INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender, age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid, gender, age, occupation_str, zipcode) FROM users;
  • 27.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Good Old SQL • SELECT * FROM movies LIMIT 5; • SELECT * FROM ratings WHERE movieid = 1; • SELECT COUNT(*) FROM ratings WHERE movieid < 10; • SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5; • SELECT title FROM movies WHERE title = `^Toy+`;
  • 28.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. More Than Good Old SQL • SELECT `*+(id)` FROM ratings WHERE movieid = 1; • regular expression based search on column name • SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid = 1 GROUP BY ratings.rating; (group by) • SELECT * FROM movies ORDER BY movieid DESC; • DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
  • 29.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. JOIN(s) in HiveQL • equality joins, outer joins, left semi-joins • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5; • More than 2 tables: • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
  • 30.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. JOIN(s) in HiveQL • equality joins, outer joins, left semi-joins • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5; • More than 2 tables: • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
  • 31.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Explain Plan to Under the hood MapReduce • EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;
  • 32.
    Confidential, for personaluse only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com