Hive

BALA KRISHNA G
Global Big Data Bootcamp – Jan 2014
(http://globalbigdataconference.com)

Global Big Data Conference ...
My introduction
Senior Software and Research Engineer
Big data trainer
Experience on Hadoop and Strom for more than 1.5 ye...
Agenda
Class structure
– 1 hour lecture and 1 ½ hour lab

Lecture
–
–
–
–
–
–
–

Need for Hive
Hive history
Hive powered b...
Need for Hive

Do I need to
learn JAVA?

Speaker : Bala

Global Big Data Conference - 2014

Don’t worry!
I am here to
resc...
Need for Hive contd.,
In general, one MR job is not suffice to derive BI (Business
Intelligence)
Oftentimes, require a ser...
Need for Hive contd.,
20 lines of code in Hive can result into ~200 lines of Java code
Lowers the development time signifi...
Need for Hive contd.,
Just focuses on “WHAT” part of your data analysis
“HOW” part is rest assured by framework
HOW

Speak...
Hive powered by

Uses for processing large amount of user and
central to meet company reporting need’s

Data analytics and...
What is Hive?

Data warehouse built on top of Hadoop
Provides an SQL like interface to analyze data
An open source project...
Hive Architecture

HIVE

Meta
Store

CLI
Web
Interface
Python

ODBC
Perl

Speaker : Bala

Driver

HADOOP

Map
Reduce

Comp...
Metastore
Stores metadata of tables like database location, owner,
creation time, access attributes, table schema, etc.,
C...
Hive Query Life cycle Insight

Speaker : Bala

Global Big Data Conference - 2014

12
Hive Query Life cycle contd.,
1

Hive
Interface
14

11

10

Execution
Engine

13

Driver

12

Hadoop
Map
Reduce

9

Metast...
Data Models
Database: Holds namespace for tables
Table: Container of actual data
sample
Id

Name

Age

Sex

State

In Hive...
Data Models contd.,
Partition: Horizontal slice of table by a partition key
Let say sample table is partitioned by state c...
Data Models contd.,
Bucket: Divides into further chunks by an other column for
sampling
Let say sample table is partitione...
Data Loading Techniques
Managed Table: Tables managed by Hive Ware House
– Copy file from local file system to Hive Ware H...
Data Loading Techniques contd.,
External Table: Tables are just referenced by Hive Ware House
– Directly managing file in ...
Data Loading Techniques contd.,
Explain when to go for external table and managed table?

Speaker : Bala

Global Big Data ...
Question - 01
In which scenario you use Hive?
1.
2.

Structured data

3.

Any kind of data

4.

Speaker : Bala

Completely...
Question – 01 answer

2. Hive is mainly used to analyze
structured data. Typically, Hive runs on
the data that is generate...
Question - 02
Which option is not correct about
Metastore?
1.
2.

It has information about number of
partitions and number...
Question – 02 answer

4. Metastore stores only the metadata.
Actual data is stored in HDFS.

Speaker : Bala

Global Big Da...
Question – 03 (last question)
What is incorrect about Hive?
1.
2.

Hive runs on top of HDFS

3.

Hive is a proprietary sof...
Question – 03 answer

3. Hive is an open source. Not a
proprietary software. Hive community
is growing very rapidly.

Spea...
Hive Query Language (Hive QL)
Data types – provides types for variables
DDL – provides a way to define databases, tables, ...
Data types

Booleans:

Primitive Types

TINYINT (1 byte)
SMALLINT (2 bytes)
INT (4 bytes)
BIGINT (8 bytes)

BOOLEAN
(TRUE ...
Data types contd.,
ARRAY

Usage

collection of multiple
same data type values

name ARRAY <primitive type>
ex: marks ARRAY...
Data types contd.,
Key must be a primitive in MAP
Referencing complex types
Previous example:
– marks ARRAY<INT>
– record ...
DDL
CREATE TABLE sample(id INT, name STRING,
schema
STRING, state STRING)
COMMENT ‘This is a sample table’
PARTITIONED BY ...
SerDe

SerDe stands for Serializer and Deserializer

Deserializer
HDFS
File

InputFile
Format

<Key,
Value>

Deserializ
er...
DDL contd.,
CREATE EXTERNAL TABLE external_sample(id INT, name STRING,
age INT, sex STRING, state STRING)
LOCATION ‘/user/...
DDL contd.,
DELETE TABLE sample
Since sample table is managed by Hive, it deletes entire data along with
metadata
DELETE T...
DML
Load data into managed table from local file system
LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE
sample;
...
DML contd.,
Insert results into a new table
INSERT OVERWRITE TABLE newsample
SELECT * from sample;
newsample table must be...
Query statements
To list available databases
SHOW DATABASES;

To use a particular database
USE <databasename>;

To list al...
Query statements contd.,
select
SELECT * FROM sample;

Aggregation functions
SELECT COUNT(DISTINCT state) FROM sample;

Gr...
Query statements contd.,
Joins
SELECT s.* , o.*
FROM sample s
JOIN orders o
ON (s.id = o.id)

Left join and Right joins ar...
Custom Functions
UDF:
– User defined function
– Complex/additional logic can be expressed
– Operates on row by row

UDAF:
...
Hive Limitations
Not suitable for unstructured data
Perfectly suitable for OLAP system (analysis)
Representing machine lea...
Important practical tips
Hive logs: /tmp/$USER/hive.log
To know available functions: SET FUNCTIONS
To know help about a sp...
References
Hadoop: The Definitive Guide -Tom White
https://cwiki.apache.org/confluence/display/Hive/Home
http://www.sfbaya...
Q/A

Speaker : Bala

Global Big Data Conference - 2014

43
Speaker : Bala

Global Big Data Conference - 2014

44
Backup slides

Speaker : Bala

Global Big Data Conference - 2014

45
Schema on Read (?)
[To do] where to put this slide?
Explain what is schema on read
Explain what is schema on write
Advanta...
Upcoming SlideShare
Loading in …5
×

Hive

1,604 views
1,501 views

Published on

This presentation is one of my talks at "Global Big Data Conference" held in end of January'14. This presentation is mainly targeted the audience to let them understand overview of Hive and getting hands-on-experience on Hive Query Language. The overview part focuses on What is the need for Hive? Hive Architecture, Hive Components, Hive Query Language, and many others.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,604
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
145
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Hive

  1. 1. Hive BALA KRISHNA G Global Big Data Bootcamp – Jan 2014 (http://globalbigdataconference.com) Global Big Data Conference - 2014
  2. 2. My introduction Senior Software and Research Engineer Big data trainer Experience on Hadoop and Strom for more than 1.5 years Worked at various big companies SUN/ORACLE, IBM, etc., www.linkedin.com/in/gbalakrishna/ bala.gsbk@outlook.com Speaker : Bala Global Big Data Conference - 2014 2
  3. 3. Agenda Class structure – 1 hour lecture and 1 ½ hour lab Lecture – – – – – – – Need for Hive Hive history Hive powered by What is Hive? Hive Architecture Hive Query Life cycle Hive Query Language (HiveQL) Lab: – Extensive hands-on-experience on Hive – Derive various insights from a real-world dataset by Hive Speaker : Bala Global Big Data Conference - 2014 3
  4. 4. Need for Hive Do I need to learn JAVA? Speaker : Bala Global Big Data Conference - 2014 Don’t worry! I am here to rescue you 4
  5. 5. Need for Hive contd., In general, one MR job is not suffice to derive BI (Business Intelligence) Oftentimes, require a series of complex MR jobs chained together (Advanced data processing) MR 4 MR 1 MR 6 MR 2 MR 3 MR 5 Speaker : Bala Global Big Data Conference - 2014 legends MR – Map Reduce Mapper Task Reducer Task 5
  6. 6. Need for Hive contd., 20 lines of code in Hive can result into ~200 lines of Java code Lowers the development time significantly (~16 times) 300 300 code 250 200 200 Minutes 250 time 150 100 150 100 50 50 0 0 Hadoop Speaker : Bala Pig Hadoop Global Big Data Conference - 2014 Pig 6
  7. 7. Need for Hive contd., Just focuses on “WHAT” part of your data analysis “HOW” part is rest assured by framework HOW Speaker : Bala Global Big Data Conference - 2014 7
  8. 8. Hive powered by Uses for processing large amount of user and central to meet company reporting need’s Data analytics and Data cleaning Ad hoc queries reporting and analytics And many more… https://cwiki.apache.org/confluence/display/Hive/PoweredBy Speaker : Bala Global Big Data Conference - 2014 8
  9. 9. What is Hive? Data warehouse built on top of Hadoop Provides an SQL like interface to analyze data An open source project under apache Works on high throughput and high latency principle (same as Hadoop) Ability to plug-in custom Map Reduce programs Mainly targeted for structured data Hides Map Reduce program complexities to end user Speaker : Bala Global Big Data Conference - 2014 9
  10. 10. Hive Architecture HIVE Meta Store CLI Web Interface Python ODBC Perl Speaker : Bala Driver HADOOP Map Reduce Compiler Optimizer Hive Thrift Server HDFS Plan executor Global Big Data Conference - 2014 10
  11. 11. Metastore Stores metadata of tables like database location, owner, creation time, access attributes, table schema, etc., Comprises of two components 1) Service 2) Data storage Hive Service Embedded Metastore Driver Metastore Service Local Metastore Driver Metastore Service Remote Metastore Driver Speaker : Bala Derby MySQL Metastore Server Global Big Data Conference - 2014 MySQL 11
  12. 12. Hive Query Life cycle Insight Speaker : Bala Global Big Data Conference - 2014 12
  13. 13. Hive Query Life cycle contd., 1 Hive Interface 14 11 10 Execution Engine 13 Driver 12 Hadoop Map Reduce 9 Metastore 2 Compiler 3 Parser Semantic Analyzer 8 5 4 Speaker : Bala Physical plan Optimizer generator 6 6 Global Big Data Conference - 2014 Logical plan generator 7 7 Optimizer 13
  14. 14. Data Models Database: Holds namespace for tables Table: Container of actual data sample Id Name Age Sex State In Hive warehouse stored as a folder /user/$USER/warehouse/sample Speaker : Bala Global Big Data Conference - 2014 14
  15. 15. Data Models contd., Partition: Horizontal slice of table by a partition key Let say sample table is partitioned by state column sample Id Name Age Sex State Partition 1 Partition 2 Stored as many subfolders under sample directory /user/$USER/warehouse/State=AL/ /user/$USER/warehouse/State=NC/ /user/$USER/warehouse/State=GA/ /user/$USER/warehouse/State=ND/ Speaker : Bala Global Big Data Conference - 2014 15
  16. 16. Data Models contd., Bucket: Divides into further chunks by an other column for sampling Let say sample table is partitioned by ‘State’ column and clustered by ‘Age’ column of 2 buckets In warehouse, the data is stored as /user/$USER/warehouse/State=AL/part-00000 /user/$USER/warehouse/State=AL/part-00001 /user/$USER/warehouse/State=GA/part-00000 /user/$USER/warehouse/State=GA/part-00001 . . /user/$USER/warehouse/State=ND/part-00000 /user/$USER/warehouse/State=ND/part-00001 Speaker : Bala Global Big Data Conference - 2014 16
  17. 17. Data Loading Techniques Managed Table: Tables managed by Hive Ware House – Copy file from local file system to Hive Ware House 1) Local FS copy HDFS File Hive Warehouse – Copy file from HDFS to Hive Ware House 2) HDFS File Speaker : Bala copy Hive Warehouse Global Big Data Conference - 2014 17
  18. 18. Data Loading Techniques contd., External Table: Tables are just referenced by Hive Ware House – Directly managing file in HDFS with out copying it into Hive Ware House 3) HDFS File Speaker : Bala Referenced referenced Global Big Data Conference - 2014 Hive Warehouse 18
  19. 19. Data Loading Techniques contd., Explain when to go for external table and managed table? Speaker : Bala Global Big Data Conference - 2014 19
  20. 20. Question - 01 In which scenario you use Hive? 1. 2. Structured data 3. Any kind of data 4. Speaker : Bala Completely unstructured nasty data None of the above Global Big Data Conference - 2014 20
  21. 21. Question – 01 answer 2. Hive is mainly used to analyze structured data. Typically, Hive runs on the data that is generated by MapReduce job (or) pig Speaker : Bala Global Big Data Conference - 2014 21
  22. 22. Question - 02 Which option is not correct about Metastore? 1. 2. It has information about number of partitions and number of buckets 3. It can give you time at which the table is created 4. Speaker : Bala It stores the table location It stores the actual data Global Big Data Conference - 2014 22
  23. 23. Question – 02 answer 4. Metastore stores only the metadata. Actual data is stored in HDFS. Speaker : Bala Global Big Data Conference - 2014 23
  24. 24. Question – 03 (last question) What is incorrect about Hive? 1. 2. Hive runs on top of HDFS 3. Hive is a proprietary software 4. Speaker : Bala Hive internally generates MapReduce jobs to serve your query Hive supports multiple interfaces to interact with Global Big Data Conference - 2014 24
  25. 25. Question – 03 answer 3. Hive is an open source. Not a proprietary software. Hive community is growing very rapidly. Speaker : Bala Global Big Data Conference - 2014 25
  26. 26. Hive Query Language (Hive QL) Data types – provides types for variables DDL – provides a way to define databases, tables, etc., DML – provides a way to modify content Query statements – provides a way to retrieve the content Speaker : Bala Global Big Data Conference - 2014 26
  27. 27. Data types Booleans: Primitive Types TINYINT (1 byte) SMALLINT (2 bytes) INT (4 bytes) BIGINT (8 bytes) BOOLEAN (TRUE or FALSE) String: STRING (sequence of characters) Speaker : Bala Integers: Floating point numbers: Usage variable_name <Data Type> ex: name STRING Global Big Data Conference - 2014 Float (4 bytes) Double (8 bytes) 27
  28. 28. Data types contd., ARRAY Usage collection of multiple same data type values name ARRAY <primitive type> ex: marks ARRAY<INT> Complex Types Usage STRUCT collection of multiple different data type values MAP collection of (key, value) pairs Speaker : Bala Global Big Data Conference - 2014 name STRUCT <type1, type2, type3, …> ex: record STRUCT <name STRING, id INT, marks ARRAY<INT>> Usage name MAP <key, value> ex: score MAP<STRING, INT> 28
  29. 29. Data types contd., Key must be a primitive in MAP Referencing complex types Previous example: – marks ARRAY<INT> – record STRUCT <name STRING, id INT, marks ARRAY<INT>> – score MAP<STRING, INT> SELECT marks[0], record.name, score[‘joe’] Complex type inside a complex type is allowed – array inside a struct (as seen before) Speaker : Bala Global Big Data Conference - 2014 29
  30. 30. DDL CREATE TABLE sample(id INT, name STRING, schema STRING, state STRING) COMMENT ‘This is a sample table’ PARTITIONED BY (state STRING) age INT, sex comments for readability partition data by state column ROW FORMAT DELIMITED rows are delimited by ‘n’ FIELDS TERMINATED BY ‘,’ fields are terminated by ‘,’ STORED AS TEXTFILE; store file as a text file Table is created in warehouse directory and completely managed by Hive Specific row format and file format can be expressed by custom SerDe Speaker : Bala Global Big Data Conference - 2014 30
  31. 31. SerDe SerDe stands for Serializer and Deserializer Deserializer HDFS File InputFile Format <Key, Value> Deserializ er Row Serializer <Key, Value> OutputFile Format HDFS File Serializer Row Speaker : Bala Global Big Data Conference - 2014 31
  32. 32. DDL contd., CREATE EXTERNAL TABLE external_sample(id INT, name STRING, age INT, sex STRING, state STRING) LOCATION ‘/user/department/sample’ Table is not created in warehouse directory and just referenced by Hive The file referenced is in HDFS (hdfs://user/department/sample) Speaker : Bala Global Big Data Conference - 2014 32
  33. 33. DDL contd., DELETE TABLE sample Since sample table is managed by Hive, it deletes entire data along with metadata DELETE TABLE external_sample Since external_sample table is *not* managed by Hive, it just deletes the metadata leaving actual data untouched Speaker : Bala Global Big Data Conference - 2014 33
  34. 34. DML Load data into managed table from local file system LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE sample; The file ‘/home/hive/sample.txt’ is in local file system It is copied into Hive warehouse folder Load data into managed table from HDFS LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE sample; The file ‘/user/hive/sample.txt’ is in HDFS It is copied into Hive warehouse folder Speaker : Bala Global Big Data Conference - 2014 34
  35. 35. DML contd., Insert results into a new table INSERT OVERWRITE TABLE newsample SELECT * from sample; newsample table must be created before hand select query results are loaded (overwritten) into new sample Create a new table with automatically derived schema CREATE TABLE newsample AS SELECT * from sample; creates newsample time with automatically derived schema query results are populated into it Speaker : Bala Global Big Data Conference - 2014 35
  36. 36. Query statements To list available databases SHOW DATABASES; To use a particular database USE <databasename>; To list all tables available in a database SHOW TABLES; Speaker : Bala Global Big Data Conference - 2014 36
  37. 37. Query statements contd., select SELECT * FROM sample; Aggregation functions SELECT COUNT(DISTINCT state) FROM sample; Group by, Sort by, Order by SELECT COUNT(*) FROM sample GROUP BY state; SELECT * FROM sample SORT BY id DESC; FROM sample SELECT * ORDER BY id ASC; Speaker : Bala Global Big Data Conference - 2014 37
  38. 38. Query statements contd., Joins SELECT s.* , o.* FROM sample s JOIN orders o ON (s.id = o.id) Left join and Right joins are also supported Multiple joins are accepted Speaker : Bala Global Big Data Conference - 2014 38
  39. 39. Custom Functions UDF: – User defined function – Complex/additional logic can be expressed – Operates on row by row UDAF: – User defined aggregate function – Custom aggregated function logic can be written – Operates on groups retrieved by group by clause UDTF: – User defined table function – Operates on entire table Speaker : Bala Global Big Data Conference - 2014 39
  40. 40. Hive Limitations Not suitable for unstructured data Perfectly suitable for OLAP system (analysis) Representing machine learning algorithms can be a challenging task Performance tradeoff with actual MR programs in various scenarios – The gap is narrowing with release to release Speaker : Bala Global Big Data Conference - 2014 40
  41. 41. Important practical tips Hive logs: /tmp/$USER/hive.log To know available functions: SET FUNCTIONS To know help about a specific function: DESCRIBE FUNCTION <function_name> Explain about config files the one in /usr/lib/hive/conf folder – hive-site.xml, hive-default.xml, (or) specify custom file using –f option ? SETTING parameters in the hive session Speaker : Bala Global Big Data Conference - 2014 41
  42. 42. References Hadoop: The Definitive Guide -Tom White https://cwiki.apache.org/confluence/display/Hive/Home http://www.sfbayacm.org/wp/wpcontent/uploads/2010/01/sig_2010_v21.pdf Venner, Jason (2009). Pro Hadoop http://hortonworks.com/big-data-insights/how-facebook-uses-hadoopand-hive/ Speaker : Bala Global Big Data Conference - 2014 42
  43. 43. Q/A Speaker : Bala Global Big Data Conference - 2014 43
  44. 44. Speaker : Bala Global Big Data Conference - 2014 44
  45. 45. Backup slides Speaker : Bala Global Big Data Conference - 2014 45
  46. 46. Schema on Read (?) [To do] where to put this slide? Explain what is schema on read Explain what is schema on write Advantages of using schema on read – Faster load time – Impacts query time Speaker : Bala Global Big Data Conference - 2014 46

×