SlideShare a Scribd company logo
1 of 32
Download to read offline
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Essentials of Hive
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryoļ¬deas.com
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




What is Hive?

ā€¢ A data warehouse system for Hadoop


ā€¢ Facilitates data summarization and ad-hoc queries


ā€¢ Allows SQL like querying using HiveQL, by transposing metadata onto data
  stored in HDFS


ā€¢ Can also plug-in custom mappers and reducers
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Supported Platforms

ā€¢ Linux/Unix and Mac OSX


ā€¢ Does not work on Cygwin
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Required Software

ā€¢ Java 1.6.x


ā€¢ Hadoop 0.17.x to 0.20.x
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Download

ā€¢ Source: http://hive.apache.org/releases.html


ā€¢ Version:


   ā€¢ hive-0.7.0


ā€¢ Both binary and source distributions available
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




 Install

ā€¢ Extract: tar zxvf hive-0.7.0-bin.tar.gz


ā€¢ Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive


ā€¢ Set environment variable HIVE_HOME to point to the hive directory


ā€¢ Add $HIVE_HOME/bin to your PATH environment variable
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




 Build From Source

ā€¢ $ svn co http://svn.apache.org/repos/asf/hive/trunk hive


ā€¢ $ cd hive


ā€¢ $ ant clean package


ā€¢ The binary distribution is in build/dist
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Hive Needs Hadoop

ā€¢ Needs Hadoop


  ā€¢ Add Hadoop distribution to your path or set HADOOP_HOME


  ā€¢ Start Hadoop daemons


    ā€¢ bin/start-all.sh
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Conļ¬gure Hive

ā€¢ Create /tmp in HDFS and set appropriate permissions


  ā€¢ bin/hadoop fs -mkdir /tmp


  ā€¢ bin/hadoop fs -chmod g+w /tmp


ā€¢ Create /user/hive/warehouse and set appropriate permissions


  ā€¢ bin/hadoop fs -mkdir /user/hive/warehouse


  ā€¢ bin/hadoop fs -chmod g+w /user/hive/warehouse
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Default Hive Conļ¬guration

ā€¢ Default conļ¬guration: conf/hive-default.xml


ā€¢ Override default conļ¬guration by redeļ¬ning properties in:


   ā€¢ conf/hive-site.xml


ā€¢ Set HIVE_CONF_DIR to set a new location for the conļ¬g ļ¬le


ā€¢ Hive conļ¬guration is a overlay on top of Hadoop conļ¬guration
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Hive Conļ¬guration Manipulation

ā€¢ Edit: conf/hive-site.xml


ā€¢ Use SET command on the Hive cli


ā€¢ Pass parameters to Hive


   ā€¢ bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2


   ā€¢ set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                           Copyright for all other & referenced work is retained by their respective owners.




Hive by Example -- Getting Started

ā€¢ Start the cli: bin/hive


ā€¢ Basic DDL statements


   ā€¢ List the existing tables


      ā€¢ SHOW TABLES;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                             Copyright for all other & referenced work is retained by their respective owners.




Create Table

ā€¢ CREATE TABLE books (isbn INT, title STRING);


ā€¢ DESCRIBE books;


  ā€¢ isbn	    int	


  ā€¢ title	   string


ā€¢ CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol
  STRING);


  ā€¢ What is PARTITION BY vcol?
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Logical Table Partitions

ā€¢ A Hive table can be logically partitioned by a virtual column


ā€¢ virtual column is derived by the partition in which the data is stored


ā€¢ A table can have multiple partitions


ā€¢ Each partition in uniquely identiļ¬ed by a virtual column value
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Alter Table

ā€¢ ALTER TABLE books ADD COLUMNS (author STRING, category STRING);


ā€¢ Change Column Property


  ā€¢ ALTER TABLE table_name CHANGE [COLUMN]


  ā€¢ old_column_name new_column_name column_type


  ā€¢ [COMMENT column_comment] [FIRST|AFTER column_name]
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Alter Table Column Property

ā€¢ ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT
  "multi-valued";


  ā€¢ old and new column name needs to be speciļ¬ed


  ā€¢ Data type changed
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Data Types Supported

ā€¢ Primitives: INT, STRING, etc...


ā€¢ Complex types: maps, array, struct
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Rename Table

ā€¢ ALTER TABLE books RENAME TO published_contents;


ā€¢ DESCRIBE published_contents;


ā€¢ DESCRIBE books; (Execution error!)
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Drop Tables

ā€¢ DROP TABLE published_contents;


ā€¢ DROP TABLE users;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




GroupLens Example -- Getting the Data Set

ā€¢ Movie ratings -- 1 million records


ā€¢ Available in tar.gz format: million-ml-data.tar__0.gz


ā€¢ Extract: tar zxvf million-ml-data.tar__0.gz


ā€¢
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Loading Rating Data

ā€¢ Format of data in ratings.dat:


   ā€¢ UserID::MovieID::Rating::Timestamp


ā€¢ Replace delimiter ā€˜::ā€™ for ā€˜#ā€™


   ā€¢ :%s/::/#/g


ā€¢ Save as .hash_delimited
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Creating Metadata and Loading the File

ā€¢ hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp
  STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED
  AS TEXTFILE;


ā€¢ LOAD DATA LOCAL INPATH <'path/to/ļ¬‚at/ļ¬le'> OVERWRITE INTO TABLE
  <table name>;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




File Load Properties

ā€¢ No validation. Developerā€™s responsibility to make sure schema matches
  between table schema and the ļ¬le.


ā€¢ Data can be on the local ļ¬lesystem or on HDFS


ā€¢ Data copied to Hive HDFS namespace


ā€¢ If OVERWRITE not speciļ¬ed then its data append
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Rating Data Load

ā€¢ hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited'


ā€¢   > OVERWRITE INTO TABLE ratings;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




A SQL Style Query

ā€¢ SELECT COUNT(*) FROM ratings;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Loading movies and users data

ā€¢ Now load the movies and users data in the same way as the ratings data.


  ā€¢ Details on the console...


ā€¢ CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation
  STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED
  BY '#' STORED AS TEXTFILE;


ā€¢ add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/
  occupation_mapper.py;


ā€¢ INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender,
  age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid,
  gender, age, occupation_str, zipcode) FROM users;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Good Old SQL

ā€¢ SELECT * FROM movies LIMIT 5;


ā€¢ SELECT * FROM ratings WHERE movieid = 1;


ā€¢ SELECT COUNT(*) FROM ratings WHERE movieid < 10;


ā€¢ SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;


ā€¢ SELECT title FROM movies WHERE title = `^Toy+`;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




More Than Good Old SQL

ā€¢ SELECT `*+(id)` FROM ratings WHERE movieid = 1;


  ā€¢ regular expression based search on column name


ā€¢ SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid
  = 1 GROUP BY ratings.rating; (group by)


ā€¢ SELECT * FROM movies ORDER BY movieid DESC;


ā€¢ DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




JOIN(s) in HiveQL

ā€¢ equality joins, outer joins, left semi-joins


   ā€¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM
     ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;


ā€¢ More than 2 tables:


   ā€¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title,
     users.gender FROM ratings JOIN movies ON (ratings.movieid =
     movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




JOIN(s) in HiveQL

ā€¢ equality joins, outer joins, left semi-joins


   ā€¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM
     ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;


ā€¢ More than 2 tables:


   ā€¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title,
     users.gender FROM ratings JOIN movies ON (ratings.movieid =
     movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Explain Plan to Under the hood MapReduce

ā€¢ EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating =
  5;
Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Questions?




ā€¢ blog: shanky.org | twitter: @tshanky


ā€¢ st@treasuryoļ¬deas.com

More Related Content

What's hot

Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
Hortonworks
Ā 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the Cloud
Andrei Savu
Ā 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Yahoo Developer Network
Ā 
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet EnterprisePuppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
cbowlesUT
Ā 
Hortonworks HBase Meetup Presentation
Hortonworks HBase Meetup PresentationHortonworks HBase Meetup Presentation
Hortonworks HBase Meetup Presentation
Hortonworks
Ā 

What's hot (19)

May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
Ā 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
Ā 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the Cloud
Ā 
Apache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersApache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCoders
Ā 
H cat berlinbuzzwords2012
H cat berlinbuzzwords2012H cat berlinbuzzwords2012
H cat berlinbuzzwords2012
Ā 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Ā 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
Ā 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
Ā 
Website designing company_in_delhi_phpwebdevelopment
Website designing company_in_delhi_phpwebdevelopmentWebsite designing company_in_delhi_phpwebdevelopment
Website designing company_in_delhi_phpwebdevelopment
Ā 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Ā 
API Design
API DesignAPI Design
API Design
Ā 
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet EnterprisePuppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Ā 
Hortonworks HBase Meetup Presentation
Hortonworks HBase Meetup PresentationHortonworks HBase Meetup Presentation
Hortonworks HBase Meetup Presentation
Ā 
REDIS327
REDIS327REDIS327
REDIS327
Ā 
Puppet Camp DC: Puppet for Everybody
Puppet Camp DC: Puppet for EverybodyPuppet Camp DC: Puppet for Everybody
Puppet Camp DC: Puppet for Everybody
Ā 
Amebać‚µćƒ¼ćƒ“ć‚¹ć®ćƒ­ć‚°č§£ęžåŸŗē›¤
Amebać‚µćƒ¼ćƒ“ć‚¹ć®ćƒ­ć‚°č§£ęžåŸŗē›¤Amebać‚µćƒ¼ćƒ“ć‚¹ć®ćƒ­ć‚°č§£ęžåŸŗē›¤
Amebać‚µćƒ¼ćƒ“ć‚¹ć®ćƒ­ć‚°č§£ęžåŸŗē›¤
Ā 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic Commands
Ā 
Rails 6 Multi-DB å®Ÿęˆ¦ęŠ•å…„
Rails 6 Multi-DB å®Ÿęˆ¦ęŠ•å…„Rails 6 Multi-DB å®Ÿęˆ¦ęŠ•å…„
Rails 6 Multi-DB å®Ÿęˆ¦ęŠ•å…„
Ā 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
Ā 

Viewers also liked

HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
Rohit Dubey
Ā 
SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
Korea Sdec
Ā 
Hive
HiveHive
Hive
Min Zhou
Ā 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
Korea Sdec
Ā 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
Ā 

Viewers also liked (20)

HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
Ā 
Benchmark Mail Tutorial
Benchmark Mail TutorialBenchmark Mail Tutorial
Benchmark Mail Tutorial
Ā 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Ā 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
Ā 
SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
Ā 
Hive
HiveHive
Hive
Ā 
Adobe Spark Step by Step Guide
Adobe Spark Step by Step GuideAdobe Spark Step by Step Guide
Adobe Spark Step by Step Guide
Ā 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
Ā 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
Ā 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Ā 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
Ā 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
Ā 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Ā 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Ā 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Ā 
QCon SĆ£o Paulo: Real-Time Analytics with Spark Streaming
QCon SĆ£o Paulo: Real-Time Analytics with Spark StreamingQCon SĆ£o Paulo: Real-Time Analytics with Spark Streaming
QCon SĆ£o Paulo: Real-Time Analytics with Spark Streaming
Ā 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
Ā 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Ā 
Internal Hive
Internal HiveInternal Hive
Internal Hive
Ā 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Ā 

Similar to SDEC2011 Essentials of Hive

SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
Korea Sdec
Ā 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
Korea Sdec
Ā 

Similar to SDEC2011 Essentials of Hive (20)

SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
Ā 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Ā 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
Ā 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Ā 
6.hive
6.hive6.hive
6.hive
Ā 
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
Ā 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
Ā 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
Ā 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
Ā 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Ā 
Ozone and HDFSā€™s evolution
Ozone and HDFSā€™s evolutionOzone and HDFSā€™s evolution
Ozone and HDFSā€™s evolution
Ā 
Apache Hive
Apache HiveApache Hive
Apache Hive
Ā 
Application architectures with hadoop ā€“Ā big data techcon 2014
Application architectures with hadoop ā€“Ā big data techcon 2014Application architectures with hadoop ā€“Ā big data techcon 2014
Application architectures with hadoop ā€“Ā big data techcon 2014
Ā 
Application architectures with Hadoop ā€“Ā Big Data TechCon 2014
Application architectures with Hadoop ā€“Ā Big Data TechCon 2014Application architectures with Hadoop ā€“Ā Big Data TechCon 2014
Application architectures with Hadoop ā€“Ā Big Data TechCon 2014
Ā 
CIS13: Big Data Platform Vendorā€™s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendorā€™s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendorā€™s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendorā€™s Perspective: Insights from the Bleeding Edge
Ā 
Asp folders and web configurations
Asp folders and web configurationsAsp folders and web configurations
Asp folders and web configurations
Ā 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018
Ā 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
Ā 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
Ā 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
Ā 

More from Korea Sdec

SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
Korea Sdec
Ā 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
Korea Sdec
Ā 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
Korea Sdec
Ā 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & Experiences
Korea Sdec
Ā 

More from Korea Sdec (9)

SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestion
Ā 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
Ā 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
Ā 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 Rapidant
Ā 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
Ā 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
Ā 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & Experiences
Ā 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
Ā 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloud
Ā 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(ā˜Žļø+971_581248768%)**%*]'#abortion pills for sale in dubai@
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Ā 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
Ā 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
Ā 

Recently uploaded (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Ā 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Ā 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Ā 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
Ā 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Ā 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Ā 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Ā 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Ā 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Ā 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Ā 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Ā 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Ā 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Ā 

SDEC2011 Essentials of Hive

  • 1. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Hive Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryoļ¬deas.com
  • 2. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. What is Hive? ā€¢ A data warehouse system for Hadoop ā€¢ Facilitates data summarization and ad-hoc queries ā€¢ Allows SQL like querying using HiveQL, by transposing metadata onto data stored in HDFS ā€¢ Can also plug-in custom mappers and reducers
  • 3. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Supported Platforms ā€¢ Linux/Unix and Mac OSX ā€¢ Does not work on Cygwin
  • 4. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Required Software ā€¢ Java 1.6.x ā€¢ Hadoop 0.17.x to 0.20.x
  • 5. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Download ā€¢ Source: http://hive.apache.org/releases.html ā€¢ Version: ā€¢ hive-0.7.0 ā€¢ Both binary and source distributions available
  • 6. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Install ā€¢ Extract: tar zxvf hive-0.7.0-bin.tar.gz ā€¢ Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive ā€¢ Set environment variable HIVE_HOME to point to the hive directory ā€¢ Add $HIVE_HOME/bin to your PATH environment variable
  • 7. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Build From Source ā€¢ $ svn co http://svn.apache.org/repos/asf/hive/trunk hive ā€¢ $ cd hive ā€¢ $ ant clean package ā€¢ The binary distribution is in build/dist
  • 8. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Hive Needs Hadoop ā€¢ Needs Hadoop ā€¢ Add Hadoop distribution to your path or set HADOOP_HOME ā€¢ Start Hadoop daemons ā€¢ bin/start-all.sh
  • 9. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Conļ¬gure Hive ā€¢ Create /tmp in HDFS and set appropriate permissions ā€¢ bin/hadoop fs -mkdir /tmp ā€¢ bin/hadoop fs -chmod g+w /tmp ā€¢ Create /user/hive/warehouse and set appropriate permissions ā€¢ bin/hadoop fs -mkdir /user/hive/warehouse ā€¢ bin/hadoop fs -chmod g+w /user/hive/warehouse
  • 10. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Default Hive Conļ¬guration ā€¢ Default conļ¬guration: conf/hive-default.xml ā€¢ Override default conļ¬guration by redeļ¬ning properties in: ā€¢ conf/hive-site.xml ā€¢ Set HIVE_CONF_DIR to set a new location for the conļ¬g ļ¬le ā€¢ Hive conļ¬guration is a overlay on top of Hadoop conļ¬guration
  • 11. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Hive Conļ¬guration Manipulation ā€¢ Edit: conf/hive-site.xml ā€¢ Use SET command on the Hive cli ā€¢ Pass parameters to Hive ā€¢ bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2 ā€¢ set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
  • 12. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Hive by Example -- Getting Started ā€¢ Start the cli: bin/hive ā€¢ Basic DDL statements ā€¢ List the existing tables ā€¢ SHOW TABLES;
  • 13. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Create Table ā€¢ CREATE TABLE books (isbn INT, title STRING); ā€¢ DESCRIBE books; ā€¢ isbn int ā€¢ title string ā€¢ CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol STRING); ā€¢ What is PARTITION BY vcol?
  • 14. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Logical Table Partitions ā€¢ A Hive table can be logically partitioned by a virtual column ā€¢ virtual column is derived by the partition in which the data is stored ā€¢ A table can have multiple partitions ā€¢ Each partition in uniquely identiļ¬ed by a virtual column value
  • 15. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Alter Table ā€¢ ALTER TABLE books ADD COLUMNS (author STRING, category STRING); ā€¢ Change Column Property ā€¢ ALTER TABLE table_name CHANGE [COLUMN] ā€¢ old_column_name new_column_name column_type ā€¢ [COMMENT column_comment] [FIRST|AFTER column_name]
  • 16. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Alter Table Column Property ā€¢ ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT "multi-valued"; ā€¢ old and new column name needs to be speciļ¬ed ā€¢ Data type changed
  • 17. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Data Types Supported ā€¢ Primitives: INT, STRING, etc... ā€¢ Complex types: maps, array, struct
  • 18. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Rename Table ā€¢ ALTER TABLE books RENAME TO published_contents; ā€¢ DESCRIBE published_contents; ā€¢ DESCRIBE books; (Execution error!)
  • 19. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Drop Tables ā€¢ DROP TABLE published_contents; ā€¢ DROP TABLE users;
  • 20. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. GroupLens Example -- Getting the Data Set ā€¢ Movie ratings -- 1 million records ā€¢ Available in tar.gz format: million-ml-data.tar__0.gz ā€¢ Extract: tar zxvf million-ml-data.tar__0.gz ā€¢
  • 21. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading Rating Data ā€¢ Format of data in ratings.dat: ā€¢ UserID::MovieID::Rating::Timestamp ā€¢ Replace delimiter ā€˜::ā€™ for ā€˜#ā€™ ā€¢ :%s/::/#/g ā€¢ Save as .hash_delimited
  • 22. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Creating Metadata and Loading the File ā€¢ hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE; ā€¢ LOAD DATA LOCAL INPATH <'path/to/ļ¬‚at/ļ¬le'> OVERWRITE INTO TABLE <table name>;
  • 23. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. File Load Properties ā€¢ No validation. Developerā€™s responsibility to make sure schema matches between table schema and the ļ¬le. ā€¢ Data can be on the local ļ¬lesystem or on HDFS ā€¢ Data copied to Hive HDFS namespace ā€¢ If OVERWRITE not speciļ¬ed then its data append
  • 24. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Rating Data Load ā€¢ hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited' ā€¢ > OVERWRITE INTO TABLE ratings;
  • 25. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. A SQL Style Query ā€¢ SELECT COUNT(*) FROM ratings;
  • 26. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading movies and users data ā€¢ Now load the movies and users data in the same way as the ratings data. ā€¢ Details on the console... ā€¢ CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE; ā€¢ add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/ occupation_mapper.py; ā€¢ INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender, age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid, gender, age, occupation_str, zipcode) FROM users;
  • 27. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Good Old SQL ā€¢ SELECT * FROM movies LIMIT 5; ā€¢ SELECT * FROM ratings WHERE movieid = 1; ā€¢ SELECT COUNT(*) FROM ratings WHERE movieid < 10; ā€¢ SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5; ā€¢ SELECT title FROM movies WHERE title = `^Toy+`;
  • 28. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. More Than Good Old SQL ā€¢ SELECT `*+(id)` FROM ratings WHERE movieid = 1; ā€¢ regular expression based search on column name ā€¢ SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid = 1 GROUP BY ratings.rating; (group by) ā€¢ SELECT * FROM movies ORDER BY movieid DESC; ā€¢ DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
  • 29. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. JOIN(s) in HiveQL ā€¢ equality joins, outer joins, left semi-joins ā€¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5; ā€¢ More than 2 tables: ā€¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
  • 30. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. JOIN(s) in HiveQL ā€¢ equality joins, outer joins, left semi-joins ā€¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5; ā€¢ More than 2 tables: ā€¢ SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
  • 31. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Explain Plan to Under the hood MapReduce ā€¢ EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;
  • 32. Conļ¬dential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Questions? ā€¢ blog: shanky.org | twitter: @tshanky ā€¢ st@treasuryoļ¬deas.com