SlideShare a Scribd company logo
1 of 59
Download to read offline
Welcome to Hive
Hive
Hive - Introduction
● Data warehouse infrastructure tool
● Process structured data in Hadoop
● Resides on top of Hadoop
● Makes data churning easy
● Provides SQL like queries
Hive
Why Do We Need Hive?
● Developers face problem in writing MapReduce logic
● How to port existing
○ relational databases
○ SQL infrastructure with Hadoop?
● End users are familiar with SQL queries than MapReduce
and Pig
● Hive’s SQL-like query language makes data churning easy
Hive
Hive - Components
Hive
MapReduce
YARN
HiveServer2
Port - 10000
Hive
Hive - Limitations
• Does not provide row level updates (earlier
versions)
• Not suitable for OLTP
• Queries have higher latency
• Start-up overhead for MapReduce jobs
• Best when large dataset is maintained and mined
Hive
Hive - Data Types - Numeric
● TINYINT (1-byte signed integer)
● SMALLINT (2-byte signed integer)
● INT (4-byte signed integer)
● BIGINT (8-byte signed integer)
● FLOAT (4-byte single precision floating point number)
● DOUBLE (8-byte double precision floating point number)
● DECIMAL (User defined precisions)
Hive
Hive - Data Types - Date/Time
● TIMESTAMP ( Hive version > 0.8.0 )
● DATE ( Hive version > 0.12.0 ) - YYYY-MM-DD
Hive
Hive - Data Types - String
● STRING
● VARCHAR ( Hive version > 0.12.0 )
● CHAR ( Hive version > 0.13.0 )
Hive
Hive - Data Types - Misc
● BOOLEAN
● BINARY ( Hive version > 0.8.0 )
Hive
Hive - Data Types - Complex
arrays: ARRAY<data_type>
maps: MAP<primitive_type, data_type>
structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
union: UNIONTYPE<data_type, data_type, ...> ( Hive version > 0.7.0 )
Hive
Hive - Data Types - Example
CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT, email:STRING>
)
Hive
Hive - Data Types - Example
CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT, email:STRING>
)
“John”
Hive
Hive - Data Types - Example
CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT, email:STRING>
)
40000.00
Hive
Hive - Data Types - Example
CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT, email:STRING>
)
[“Michael”, “Rumi”]
Hive
Hive - Data Types - Example
CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT, email:STRING>
)
{
“Insurance”: 500.00,
“Charity”: 600.00
}
Hive
Hive - Data Types - Example
CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT, email:STRING>
)
“street” : “2711”,
“city”: “Sydney”,
“state”: “Wales”,
“zip”: 560064
Hive
Hive - Data Types - Example
CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT, email:STRING>
)
“fbid”:168478292
Hive
• Stores the metadata of tables into a relational database
• Metadata includes
• Details of databases and tables
• Table definitions: name of table, columns, partitions etc.
Hive - Metastore
Hive
Hive - Warehouse
● Hive tables are stored in the Hive warehouse
directory
● /apps/hive/warehouse on HDFS
● At the location specified in the table definition
Hive
Hive - Getting Started - Command Line
● Login to CloudxLab Linux console
● Type “hive” to access hive shell
● By default database named “default” will be selected as current
db for the current session
● Type “SHOW DATABASES” to see list of all databases
Hive
Hive - Getting Started - Command Line
● “SHOW TABLES” will list tables in current selected database
which is “default” database.
● Create your own database with your login name
● CREATE DATABASE abhinav9884;
● DESCRIBE DATABASE abhinav9884;
● DROP DATABASE abhinav9884;
Hive
Hive - Getting Started - Command Line
● CREATE DATABASE abhinav9884;
● USE abhinav9884;
● CREATE TABLE x (a INT);
Hive
Hive - Getting Started - Hue
● Login to Hue
● Click on “Query Editors” and select “Hive”
● Select your database (abhinav9984) from the list
● SELECT * FROM x;
● DESCRIBE x;
● DESCRIBE FORMATTED x;
Hive
Hive - Tables
● Managed tables
● External tables
Hive
Hive - Managed Tables
● Aka Internal
● Lifecycle managed by Hive
● Data is stored in the warehouse directory
● Dropping the table deletes data from warehouse
Hive
● The lifecycle is not managed by Hive
● Hive assumes that it does not own the data
● Dropping the table does not delete the underlying data
● Metadata will be deleted
Hive - External Tables
Hive
Hive - Managed Tables - Example
CREATE TABLE nyse(
exchange1 STRING,
symbol1 STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
DESCRIBE nyse;
DESCRIBE FORMATTED nyse;
Hive
Hive - Loading Data - From Local Directory
● hadoop fs -copyToLocal /data/NYSE_daily
● Launch Hive
● use yourdatabase;
● load data local inpath 'NYSE_daily' overwrite into table nyse;
● Copies the data from local file system to warehouse
Hive
CREATE TABLE nyse_hdfs(
exchange1 STRING,
symbol1 STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
Hive - Loading Data - From HDFS
Hive
Hive - Loading Data - From HDFS
● Copy /data/NYSE_daily to your home directory in HDFS
● load data inpath 'hdfs:///user/abhinav9884/NYSE_daily' overwrite into table
nyse_hdfs;
● Moves the data from specified location to warehouse
● Check if NYSE_daily is in your home directory in HDFS
Hive
Hive - External Tables
CREATE EXTERNAL TABLE nyse_external (
exchange1 STRING,
symbol1 STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
LOCATION '/user/abhinav9884/NYSE_daily';
describe formatted nyse_external;
Hive
Hive - S3 Based External Table
create external table miniwikistats (
projcode string,
pagename string,
pageviews int,
bytes int)
partitioned by(dt string)
row format delimited fields terminated by ' '
lines terminated by 'n'
location 's3n://paid/default-datasets/miniwikistats/';
Hive
Hive - Select Statements
● Select all columns
SELECT * FROM nyse;
● Select only required columns
SELECT exchange1, symbol1 FROM nyse;
Hive
● Find average opening price for each stock
SELECT symbol1, AVG(price_open) AS avg_price FROM nyse
GROUP BY symbol1;
● To improve performance set top-level aggregation in map phase
SET hive.map.aggr=true;
Hive - Aggregations
Hive
Hive - Saving Data
● In local file system
insert overwrite local directory '/home/abhinav9884/onlycmc'
select * from nyse where symbol1 = 'CMC';
● In HDFS
insert overwrite directory 'onlycmc' select * from nyse where
symbol1 = 'CMC';
Hive
Hive - Tables - DDL - ALTER
● Rename a table
ALTER TABLE x RENAME TO x1;
● Change datatype of column
ALTER TABLE x1 CHANGE a a FLOAT;
● Add columns in existing table
ALTER TABLE x1 ADD COLUMNS (b FLOAT, c INT);
Hive
#First name, Department, Year of joining
Mark, Engineering, 2012
Jon, HR, 2012
Monica, Finance, 2015
Steve, Engineering, 2012
Michael, Marketing, 2015
Hive - Partitions
Hive
● Data is located at /data/bdhs/employees/ on HDFS
● Copy data to your home directory in HDFS
hadoop fs -cp /data/bdhs/employees .
● Create table
CREATE TABLE employees(
name STRING,
department STRING,
somedate DATE
)
PARTITIONED BY(year STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Hive - Partitions - Hands-on
Hive
● Load dataset 2012.csv
load data inpath 'hdfs:///user/sandeepgiri9034/employees/2012.csv' into table
employees partition (year=2012);
● Load dataset 2015.csv
load data inpath 'hdfs:///user/sandeepgiri9034/employees/2015.csv' into table
employees partition (year=2015);
● SHOW PARTITIONS employees;
● Check warehouse and metastore
Hive - Partitions - Hands-on
Hive
• To avoid the full table scan
• The data is stored in different files in warehouse defined by the
partitions
• Define the partitions using “partition by” in “create table”
• We can also add a partition later
• Partition can happen on multiple columns (year=2012, month=10,
day=12)
Hive - Partitions - Summary
Hive
● SELECT * FROM employees where department='Engineering';
● Create a view
CREATE VIEW employees_engineering AS
SELECT * FROM employees where department='Engineering';
● Now query from the view
SELECT * FROM employees_engineering;
Hive - Views
Hive
• Allows a query to be saved and treated like a table
• Logical construct - does not store data
• Hides the query complexity
• Divide long and complicated query into smaller and manageable
pieces
• Similar to writing a function in a programming language
Hive - Views - Summary
Hive
Hive - Load JSON Data
• Download JSON-SERDE BINARIES
• ADD JAR
hdfs:///data/serde/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar;
• Create Table
CREATE EXTERNAL TABLE tweets_raw (
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/abhinav9884/senti/upload/data/tweets_raw';
Hive
ORDER BY x
● Guarantees global ordering
● Data goes through just one reducer
● This is unacceptable for large datasets as it will overload the
reducer
● You end up one sorted file as output
Hive - Sorting & Distributing - Order By
Hive
SORT BY x
● Orders data at each of N reducers
● Number of reducers are 1 per 1GB
● You end up with N or more sorted files with overlapping ranges
Hive - Sorting & Distributing - Sort By
Hive
Hive - Sorting & Distributing - Sort By
Aaron
Bain
Adam
Barbara
Reducer 1 Reducer 2
Aaron
Bain
Adam
Barbara
Concatenated File
Hive
DISTRIBUTE BY x
● Ensures each of N reducers gets non-overlapping ranges of x
● But doesn't sort the output of each reducer
● You end up with N or unsorted files with non-overlapping ranges
Hive - Sorting & Distributing - Distribute By
Hive
Hive - Sorting & Distributing - Distribute By
Adam
Aaron
Barbara
Bain
Reducer 1 Reducer 2
Hive
CLUSTER BY x
● Gives global ordering
● Is the same as (DISTRIBUTE BY x and SORT BY x)
● CLUSTER BY is basically the more scalable version of ORDER BY
Hive - Sorting & Distributing - Cluster By
Hive
Hive - Bucketing
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User'
)
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) INTO 32 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '001'
COLLECTION ITEMS TERMINATED BY '002'
MAP KEYS TERMINATED BY '003'
STORED AS SEQUENCEFILE;
Hive
● Optimized Row Columnar file format
● Provides a highly efficient way to store Hive data
● Improves performance when
○ Reading
○ Writing
○ Processing
● Has a built-in index, min/max values, and other aggregations
● Proven in large-scale deployments
○ Facebook uses the ORC file format for a 300+ PB deployment
Hive - ORC Files
Hive
Hive - ORC Files - Example
CREATE TABLE orc_table (
first_name STRING,
last_name STRING
) STORED AS ORC;
INSERT INTO orc_table VALUES('John', 'Gill');
SELECT * from orc_table;
To Know more, please visit https://orc.apache.org
Hive
Hive - Quick Recap
• Each table has got a location
• By default the table is in a directory under the location
/apps/hive/warehouse
• We can override that location by mentioning 'location' in create table
clause
• Load data copies the data if it is local
• Load moves the data if it is on hdfs for both external and managed
tables
• Dropping managed table deletes the data at the 'location'
• Dropping external table does not delete the data at the 'location'
• The metadata is stored in the relational database - hive metastore
Hive
Hive - Connecting to Tableau
• Tableau is a visualization tool
• Tableau allows for instantaneous insight by transforming data into
visually appealing, interactive visualizations called dashboards
Hive
Hive - Connecting to Tableau - Steps
• Download and install Tableau desktop from
https://www.tableau.com/products/desktop
• Download and install Hortonworks ODBC driver for Apache Hive for
your OS
https://hortonworks.com/downloads/
Hive
Hive - Connecting to Tableau - Hands-on
Visualize top 10 stocks with highest opening price on Dec 31, 2009
Hive
1. Copy data from /data/ml100k/u.data into our hdfs home
2. Open Hive in Hue and run following:
CREATE TABLE u_data( userid INT, movieid INT, rating INT, unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;
LOAD DATA INPATH '/user/sandeepgiri9034/u.data' overwrite into table u_data;
select * from u_data limit 5;
select movieid, avg(rating) ar from u_data group by movieid order by ar desc
Hive - Quick Demo
Hive
Hive - Quick Demo
Join with Movie Names
create view top100m as
select movieid, avg(rating) ar from u_data group by movieid order by ar desc
CREATE TABLE m_data( movieid INT, name STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
load data inpath '/user/sandeepgiri9034/u.item' into table m_data;
select * from m_data limit 100
select * from m_data, top100m where top100m.movieid = m_data.movieid
Hive
1. For each movie how many users rated it
2. For movies having more than 30 ratings, what is
the average rating
Hive - Assignment

More Related Content

What's hot

Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoopChirag Ahuja
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 

What's hot (18)

03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
6.hive
6.hive6.hive
6.hive
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
HCatalog
HCatalogHCatalog
HCatalog
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 

Similar to Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab

An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...
Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...
Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...PROIDEA
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
 
Set Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into HadoopSet Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into HadoopContinuent
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6Rohit Agrawal
 

Similar to Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab (20)

An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
HivePart1.pptx
HivePart1.pptxHivePart1.pptx
HivePart1.pptx
 
מיכאל
מיכאלמיכאל
מיכאל
 
Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...
Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...
Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hive presentation
Hive presentationHive presentation
Hive presentation
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
Set Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into HadoopSet Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into Hadoop
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning OverviewCloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksCloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural NetsCloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabCloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random ForestsCloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision TreesCloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

Recently uploaded

Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 

Recently uploaded (20)

Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. Hive Hive - Introduction ● Data warehouse infrastructure tool ● Process structured data in Hadoop ● Resides on top of Hadoop ● Makes data churning easy ● Provides SQL like queries
  • 3. Hive Why Do We Need Hive? ● Developers face problem in writing MapReduce logic ● How to port existing ○ relational databases ○ SQL infrastructure with Hadoop? ● End users are familiar with SQL queries than MapReduce and Pig ● Hive’s SQL-like query language makes data churning easy
  • 5. Hive Hive - Limitations • Does not provide row level updates (earlier versions) • Not suitable for OLTP • Queries have higher latency • Start-up overhead for MapReduce jobs • Best when large dataset is maintained and mined
  • 6. Hive Hive - Data Types - Numeric ● TINYINT (1-byte signed integer) ● SMALLINT (2-byte signed integer) ● INT (4-byte signed integer) ● BIGINT (8-byte signed integer) ● FLOAT (4-byte single precision floating point number) ● DOUBLE (8-byte double precision floating point number) ● DECIMAL (User defined precisions)
  • 7. Hive Hive - Data Types - Date/Time ● TIMESTAMP ( Hive version > 0.8.0 ) ● DATE ( Hive version > 0.12.0 ) - YYYY-MM-DD
  • 8. Hive Hive - Data Types - String ● STRING ● VARCHAR ( Hive version > 0.12.0 ) ● CHAR ( Hive version > 0.13.0 )
  • 9. Hive Hive - Data Types - Misc ● BOOLEAN ● BINARY ( Hive version > 0.8.0 )
  • 10. Hive Hive - Data Types - Complex arrays: ARRAY<data_type> maps: MAP<primitive_type, data_type> structs: STRUCT<col_name : data_type [COMMENT col_comment], ...> union: UNIONTYPE<data_type, data_type, ...> ( Hive version > 0.7.0 )
  • 11. Hive Hive - Data Types - Example CREATE TABLE employees( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>, auth UNION<fbid:INT, gid:INT, email:STRING> )
  • 12. Hive Hive - Data Types - Example CREATE TABLE employees( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>, auth UNION<fbid:INT, gid:INT, email:STRING> ) “John”
  • 13. Hive Hive - Data Types - Example CREATE TABLE employees( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>, auth UNION<fbid:INT, gid:INT, email:STRING> ) 40000.00
  • 14. Hive Hive - Data Types - Example CREATE TABLE employees( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>, auth UNION<fbid:INT, gid:INT, email:STRING> ) [“Michael”, “Rumi”]
  • 15. Hive Hive - Data Types - Example CREATE TABLE employees( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>, auth UNION<fbid:INT, gid:INT, email:STRING> ) { “Insurance”: 500.00, “Charity”: 600.00 }
  • 16. Hive Hive - Data Types - Example CREATE TABLE employees( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>, auth UNION<fbid:INT, gid:INT, email:STRING> ) “street” : “2711”, “city”: “Sydney”, “state”: “Wales”, “zip”: 560064
  • 17. Hive Hive - Data Types - Example CREATE TABLE employees( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>, auth UNION<fbid:INT, gid:INT, email:STRING> ) “fbid”:168478292
  • 18. Hive • Stores the metadata of tables into a relational database • Metadata includes • Details of databases and tables • Table definitions: name of table, columns, partitions etc. Hive - Metastore
  • 19. Hive Hive - Warehouse ● Hive tables are stored in the Hive warehouse directory ● /apps/hive/warehouse on HDFS ● At the location specified in the table definition
  • 20. Hive Hive - Getting Started - Command Line ● Login to CloudxLab Linux console ● Type “hive” to access hive shell ● By default database named “default” will be selected as current db for the current session ● Type “SHOW DATABASES” to see list of all databases
  • 21. Hive Hive - Getting Started - Command Line ● “SHOW TABLES” will list tables in current selected database which is “default” database. ● Create your own database with your login name ● CREATE DATABASE abhinav9884; ● DESCRIBE DATABASE abhinav9884; ● DROP DATABASE abhinav9884;
  • 22. Hive Hive - Getting Started - Command Line ● CREATE DATABASE abhinav9884; ● USE abhinav9884; ● CREATE TABLE x (a INT);
  • 23. Hive Hive - Getting Started - Hue ● Login to Hue ● Click on “Query Editors” and select “Hive” ● Select your database (abhinav9984) from the list ● SELECT * FROM x; ● DESCRIBE x; ● DESCRIBE FORMATTED x;
  • 24. Hive Hive - Tables ● Managed tables ● External tables
  • 25. Hive Hive - Managed Tables ● Aka Internal ● Lifecycle managed by Hive ● Data is stored in the warehouse directory ● Dropping the table deletes data from warehouse
  • 26. Hive ● The lifecycle is not managed by Hive ● Hive assumes that it does not own the data ● Dropping the table does not delete the underlying data ● Metadata will be deleted Hive - External Tables
  • 27. Hive Hive - Managed Tables - Example CREATE TABLE nyse( exchange1 STRING, symbol1 STRING, ymd STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; DESCRIBE nyse; DESCRIBE FORMATTED nyse;
  • 28. Hive Hive - Loading Data - From Local Directory ● hadoop fs -copyToLocal /data/NYSE_daily ● Launch Hive ● use yourdatabase; ● load data local inpath 'NYSE_daily' overwrite into table nyse; ● Copies the data from local file system to warehouse
  • 29. Hive CREATE TABLE nyse_hdfs( exchange1 STRING, symbol1 STRING, ymd STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; Hive - Loading Data - From HDFS
  • 30. Hive Hive - Loading Data - From HDFS ● Copy /data/NYSE_daily to your home directory in HDFS ● load data inpath 'hdfs:///user/abhinav9884/NYSE_daily' overwrite into table nyse_hdfs; ● Moves the data from specified location to warehouse ● Check if NYSE_daily is in your home directory in HDFS
  • 31. Hive Hive - External Tables CREATE EXTERNAL TABLE nyse_external ( exchange1 STRING, symbol1 STRING, ymd STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION '/user/abhinav9884/NYSE_daily'; describe formatted nyse_external;
  • 32. Hive Hive - S3 Based External Table create external table miniwikistats ( projcode string, pagename string, pageviews int, bytes int) partitioned by(dt string) row format delimited fields terminated by ' ' lines terminated by 'n' location 's3n://paid/default-datasets/miniwikistats/';
  • 33. Hive Hive - Select Statements ● Select all columns SELECT * FROM nyse; ● Select only required columns SELECT exchange1, symbol1 FROM nyse;
  • 34. Hive ● Find average opening price for each stock SELECT symbol1, AVG(price_open) AS avg_price FROM nyse GROUP BY symbol1; ● To improve performance set top-level aggregation in map phase SET hive.map.aggr=true; Hive - Aggregations
  • 35. Hive Hive - Saving Data ● In local file system insert overwrite local directory '/home/abhinav9884/onlycmc' select * from nyse where symbol1 = 'CMC'; ● In HDFS insert overwrite directory 'onlycmc' select * from nyse where symbol1 = 'CMC';
  • 36. Hive Hive - Tables - DDL - ALTER ● Rename a table ALTER TABLE x RENAME TO x1; ● Change datatype of column ALTER TABLE x1 CHANGE a a FLOAT; ● Add columns in existing table ALTER TABLE x1 ADD COLUMNS (b FLOAT, c INT);
  • 37. Hive #First name, Department, Year of joining Mark, Engineering, 2012 Jon, HR, 2012 Monica, Finance, 2015 Steve, Engineering, 2012 Michael, Marketing, 2015 Hive - Partitions
  • 38. Hive ● Data is located at /data/bdhs/employees/ on HDFS ● Copy data to your home directory in HDFS hadoop fs -cp /data/bdhs/employees . ● Create table CREATE TABLE employees( name STRING, department STRING, somedate DATE ) PARTITIONED BY(year STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Hive - Partitions - Hands-on
  • 39. Hive ● Load dataset 2012.csv load data inpath 'hdfs:///user/sandeepgiri9034/employees/2012.csv' into table employees partition (year=2012); ● Load dataset 2015.csv load data inpath 'hdfs:///user/sandeepgiri9034/employees/2015.csv' into table employees partition (year=2015); ● SHOW PARTITIONS employees; ● Check warehouse and metastore Hive - Partitions - Hands-on
  • 40. Hive • To avoid the full table scan • The data is stored in different files in warehouse defined by the partitions • Define the partitions using “partition by” in “create table” • We can also add a partition later • Partition can happen on multiple columns (year=2012, month=10, day=12) Hive - Partitions - Summary
  • 41. Hive ● SELECT * FROM employees where department='Engineering'; ● Create a view CREATE VIEW employees_engineering AS SELECT * FROM employees where department='Engineering'; ● Now query from the view SELECT * FROM employees_engineering; Hive - Views
  • 42. Hive • Allows a query to be saved and treated like a table • Logical construct - does not store data • Hides the query complexity • Divide long and complicated query into smaller and manageable pieces • Similar to writing a function in a programming language Hive - Views - Summary
  • 43. Hive Hive - Load JSON Data • Download JSON-SERDE BINARIES • ADD JAR hdfs:///data/serde/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar; • Create Table CREATE EXTERNAL TABLE tweets_raw ( ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/abhinav9884/senti/upload/data/tweets_raw';
  • 44. Hive ORDER BY x ● Guarantees global ordering ● Data goes through just one reducer ● This is unacceptable for large datasets as it will overload the reducer ● You end up one sorted file as output Hive - Sorting & Distributing - Order By
  • 45. Hive SORT BY x ● Orders data at each of N reducers ● Number of reducers are 1 per 1GB ● You end up with N or more sorted files with overlapping ranges Hive - Sorting & Distributing - Sort By
  • 46. Hive Hive - Sorting & Distributing - Sort By Aaron Bain Adam Barbara Reducer 1 Reducer 2 Aaron Bain Adam Barbara Concatenated File
  • 47. Hive DISTRIBUTE BY x ● Ensures each of N reducers gets non-overlapping ranges of x ● But doesn't sort the output of each reducer ● You end up with N or unsorted files with non-overlapping ranges Hive - Sorting & Distributing - Distribute By
  • 48. Hive Hive - Sorting & Distributing - Distribute By Adam Aaron Barbara Bain Reducer 1 Reducer 2
  • 49. Hive CLUSTER BY x ● Gives global ordering ● Is the same as (DISTRIBUTE BY x and SORT BY x) ● CLUSTER BY is basically the more scalable version of ORDER BY Hive - Sorting & Distributing - Cluster By
  • 50. Hive Hive - Bucketing CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User' ) COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' COLLECTION ITEMS TERMINATED BY '002' MAP KEYS TERMINATED BY '003' STORED AS SEQUENCEFILE;
  • 51. Hive ● Optimized Row Columnar file format ● Provides a highly efficient way to store Hive data ● Improves performance when ○ Reading ○ Writing ○ Processing ● Has a built-in index, min/max values, and other aggregations ● Proven in large-scale deployments ○ Facebook uses the ORC file format for a 300+ PB deployment Hive - ORC Files
  • 52. Hive Hive - ORC Files - Example CREATE TABLE orc_table ( first_name STRING, last_name STRING ) STORED AS ORC; INSERT INTO orc_table VALUES('John', 'Gill'); SELECT * from orc_table; To Know more, please visit https://orc.apache.org
  • 53. Hive Hive - Quick Recap • Each table has got a location • By default the table is in a directory under the location /apps/hive/warehouse • We can override that location by mentioning 'location' in create table clause • Load data copies the data if it is local • Load moves the data if it is on hdfs for both external and managed tables • Dropping managed table deletes the data at the 'location' • Dropping external table does not delete the data at the 'location' • The metadata is stored in the relational database - hive metastore
  • 54. Hive Hive - Connecting to Tableau • Tableau is a visualization tool • Tableau allows for instantaneous insight by transforming data into visually appealing, interactive visualizations called dashboards
  • 55. Hive Hive - Connecting to Tableau - Steps • Download and install Tableau desktop from https://www.tableau.com/products/desktop • Download and install Hortonworks ODBC driver for Apache Hive for your OS https://hortonworks.com/downloads/
  • 56. Hive Hive - Connecting to Tableau - Hands-on Visualize top 10 stocks with highest opening price on Dec 31, 2009
  • 57. Hive 1. Copy data from /data/ml100k/u.data into our hdfs home 2. Open Hive in Hue and run following: CREATE TABLE u_data( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; LOAD DATA INPATH '/user/sandeepgiri9034/u.data' overwrite into table u_data; select * from u_data limit 5; select movieid, avg(rating) ar from u_data group by movieid order by ar desc Hive - Quick Demo
  • 58. Hive Hive - Quick Demo Join with Movie Names create view top100m as select movieid, avg(rating) ar from u_data group by movieid order by ar desc CREATE TABLE m_data( movieid INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE; load data inpath '/user/sandeepgiri9034/u.item' into table m_data; select * from m_data limit 100 select * from m_data, top100m where top100m.movieid = m_data.movieid
  • 59. Hive 1. For each movie how many users rated it 2. For movies having more than 30 ratings, what is the average rating Hive - Assignment