Hive

BALA KRISHNA G
Global Big Data Bootcamp – Jan 2014
(http://globalbigdataconference.com)

Global Big Data Conference - 2014
My introduction
Senior Software and Research Engineer
Big data trainer
Experience on Hadoop and Strom for more than 1.5 years
Worked at various big companies SUN/ORACLE, IBM, etc.,

www.linkedin.com/in/gbalakrishna/
bala.gsbk@outlook.com

Speaker : Bala

Global Big Data Conference - 2014

2
Agenda
Class structure
– 1 hour lecture and 1 ½ hour lab

Lecture
–
–
–
–
–
–
–

Need for Hive
Hive history
Hive powered by
What is Hive?
Hive Architecture
Hive Query Life cycle
Hive Query Language (HiveQL)

Lab:
– Extensive hands-on-experience on Hive
– Derive various insights from a real-world dataset by Hive

Speaker : Bala

Global Big Data Conference - 2014

3
Need for Hive

Do I need to
learn JAVA?

Speaker : Bala

Global Big Data Conference - 2014

Don’t worry!
I am here to
rescue you

4
Need for Hive contd.,
In general, one MR job is not suffice to derive BI (Business
Intelligence)
Oftentimes, require a series of complex MR jobs chained
together (Advanced data processing)
MR 4
MR 1
MR 6

MR 2

MR 3
MR 5
Speaker : Bala

Global Big Data Conference - 2014

legends
MR – Map Reduce
Mapper Task
Reducer Task
5
Need for Hive contd.,
20 lines of code in Hive can result into ~200 lines of Java code
Lowers the development time significantly (~16 times)

300

300

code

250

200

200

Minutes

250

time

150
100

150
100

50

50

0

0
Hadoop

Speaker : Bala

Pig

Hadoop
Global Big Data Conference - 2014

Pig
6
Need for Hive contd.,
Just focuses on “WHAT” part of your data analysis
“HOW” part is rest assured by framework
HOW

Speaker : Bala

Global Big Data Conference - 2014

7
Hive powered by

Uses for processing large amount of user and
central to meet company reporting need’s

Data analytics and Data cleaning

Ad hoc queries reporting and analytics
And many more…
https://cwiki.apache.org/confluence/display/Hive/PoweredBy
Speaker : Bala

Global Big Data Conference - 2014

8
What is Hive?

Data warehouse built on top of Hadoop
Provides an SQL like interface to analyze data
An open source project under apache
Works on high throughput and high latency
principle (same as Hadoop)
Ability to plug-in custom Map Reduce programs
Mainly targeted for structured data
Hides Map Reduce program complexities to end
user
Speaker : Bala

Global Big Data Conference - 2014

9
Hive Architecture

HIVE

Meta
Store

CLI
Web
Interface
Python

ODBC
Perl

Speaker : Bala

Driver

HADOOP

Map
Reduce

Compiler

Optimizer
Hive Thrift
Server

HDFS
Plan
executor

Global Big Data Conference - 2014

10
Metastore
Stores metadata of tables like database location, owner,
creation time, access attributes, table schema, etc.,
Comprises of two components 1) Service 2) Data storage
Hive Service
Embedded
Metastore

Driver

Metastore
Service

Local
Metastore

Driver

Metastore
Service

Remote
Metastore

Driver

Speaker : Bala

Derby

MySQL

Metastore
Server

Global Big Data Conference - 2014

MySQL

11
Hive Query Life cycle Insight

Speaker : Bala

Global Big Data Conference - 2014

12
Hive Query Life cycle contd.,
1

Hive
Interface
14

11

10

Execution
Engine

13

Driver

12

Hadoop
Map
Reduce

9

Metastore

2

Compiler

3

Parser

Semantic
Analyzer

8

5

4

Speaker : Bala

Physical
plan
Optimizer
generator

6
6

Global Big Data Conference - 2014

Logical
plan
generator

7
7

Optimizer

13
Data Models
Database: Holds namespace for tables
Table: Container of actual data
sample
Id

Name

Age

Sex

State

In Hive warehouse
stored as a folder
/user/$USER/warehouse/sample

Speaker : Bala

Global Big Data Conference - 2014

14
Data Models contd.,
Partition: Horizontal slice of table by a partition key
Let say sample table is partitioned by state column
sample
Id

Name

Age

Sex

State

Partition 1

Partition 2

Stored as many subfolders under sample directory
/user/$USER/warehouse/State=AL/

/user/$USER/warehouse/State=NC/

/user/$USER/warehouse/State=GA/

/user/$USER/warehouse/State=ND/

Speaker : Bala

Global Big Data Conference - 2014

15
Data Models contd.,
Bucket: Divides into further chunks by an other column for
sampling
Let say sample table is partitioned by ‘State’ column and
clustered by ‘Age’ column of 2 buckets
In warehouse, the data is stored as
/user/$USER/warehouse/State=AL/part-00000
/user/$USER/warehouse/State=AL/part-00001
/user/$USER/warehouse/State=GA/part-00000
/user/$USER/warehouse/State=GA/part-00001

.
.
/user/$USER/warehouse/State=ND/part-00000
/user/$USER/warehouse/State=ND/part-00001
Speaker : Bala

Global Big Data Conference - 2014

16
Data Loading Techniques
Managed Table: Tables managed by Hive Ware House
– Copy file from local file system to Hive Ware House
1)

Local FS

copy

HDFS

File

Hive
Warehouse

– Copy file from HDFS to Hive Ware House
2)
HDFS
File

Speaker : Bala

copy

Hive
Warehouse

Global Big Data Conference - 2014

17
Data Loading Techniques contd.,
External Table: Tables are just referenced by Hive Ware House
– Directly managing file in HDFS with out copying it into Hive Ware House

3)
HDFS
File

Speaker : Bala

Referenced
referenced

Global Big Data Conference - 2014

Hive
Warehouse

18
Data Loading Techniques contd.,
Explain when to go for external table and managed table?

Speaker : Bala

Global Big Data Conference - 2014

19
Question - 01
In which scenario you use Hive?
1.
2.

Structured data

3.

Any kind of data

4.

Speaker : Bala

Completely unstructured nasty data

None of the above

Global Big Data Conference - 2014

20
Question – 01 answer

2. Hive is mainly used to analyze
structured data. Typically, Hive runs on
the data that is generated by
MapReduce job (or) pig

Speaker : Bala

Global Big Data Conference - 2014

21
Question - 02
Which option is not correct about
Metastore?
1.
2.

It has information about number of
partitions and number of buckets

3.

It can give you time at which the table is
created

4.

Speaker : Bala

It stores the table location

It stores the actual data

Global Big Data Conference - 2014

22
Question – 02 answer

4. Metastore stores only the metadata.
Actual data is stored in HDFS.

Speaker : Bala

Global Big Data Conference - 2014

23
Question – 03 (last question)
What is incorrect about Hive?
1.
2.

Hive runs on top of HDFS

3.

Hive is a proprietary software

4.

Speaker : Bala

Hive internally generates MapReduce
jobs to serve your query

Hive supports multiple interfaces to
interact with

Global Big Data Conference - 2014

24
Question – 03 answer

3. Hive is an open source. Not a
proprietary software. Hive community
is growing very rapidly.

Speaker : Bala

Global Big Data Conference - 2014

25
Hive Query Language (Hive QL)
Data types – provides types for variables
DDL – provides a way to define databases, tables, etc.,
DML – provides a way to modify content
Query statements – provides a way to retrieve the content

Speaker : Bala

Global Big Data Conference - 2014

26
Data types

Booleans:

Primitive Types

TINYINT (1 byte)
SMALLINT (2 bytes)
INT (4 bytes)
BIGINT (8 bytes)

BOOLEAN
(TRUE or FALSE)

String:
STRING
(sequence of
characters)

Speaker : Bala

Integers:

Floating point
numbers:
Usage
variable_name <Data Type>
ex: name STRING

Global Big Data Conference - 2014

Float (4 bytes)
Double (8 bytes)

27
Data types contd.,
ARRAY

Usage

collection of multiple
same data type values

name ARRAY <primitive type>
ex: marks ARRAY<INT>

Complex Types
Usage
STRUCT
collection of multiple
different data type
values

MAP
collection of
(key, value) pairs

Speaker : Bala

Global Big Data Conference - 2014

name STRUCT <type1, type2,
type3, …>
ex: record STRUCT <name
STRING, id INT, marks
ARRAY<INT>>

Usage
name MAP <key, value>
ex: score MAP<STRING, INT>

28
Data types contd.,
Key must be a primitive in MAP
Referencing complex types
Previous example:
– marks ARRAY<INT>
– record STRUCT <name STRING, id INT, marks ARRAY<INT>>
– score MAP<STRING, INT>
SELECT marks[0], record.name, score[‘joe’]

Complex type inside a complex type is allowed
– array inside a struct (as seen before)

Speaker : Bala

Global Big Data Conference - 2014

29
DDL
CREATE TABLE sample(id INT, name STRING,
schema
STRING, state STRING)
COMMENT ‘This is a sample table’
PARTITIONED BY (state STRING)

age INT,

sex

comments for readability
partition data by state column

ROW FORMAT DELIMITED

rows are delimited by ‘n’

FIELDS TERMINATED BY ‘,’

fields are terminated by ‘,’

STORED AS TEXTFILE;

store file as a text file

Table is created in warehouse directory and completely managed by Hive
Specific row format and file format can be expressed by custom SerDe

Speaker : Bala

Global Big Data Conference - 2014

30
SerDe

SerDe stands for Serializer and Deserializer

Deserializer
HDFS
File

InputFile
Format

<Key,
Value>

Deserializ
er

Row

Serializer

<Key,
Value>

OutputFile
Format

HDFS
File

Serializer

Row

Speaker : Bala

Global Big Data Conference - 2014

31
DDL contd.,
CREATE EXTERNAL TABLE external_sample(id INT, name STRING,
age INT, sex STRING, state STRING)
LOCATION ‘/user/department/sample’

Table is not created in warehouse directory and just referenced by Hive
The file referenced is in HDFS (hdfs://user/department/sample)

Speaker : Bala

Global Big Data Conference - 2014

32
DDL contd.,
DELETE TABLE sample
Since sample table is managed by Hive, it deletes entire data along with
metadata
DELETE TABLE external_sample
Since external_sample table is *not* managed by Hive, it just deletes the
metadata leaving actual data untouched

Speaker : Bala

Global Big Data Conference - 2014

33
DML
Load data into managed table from local file system
LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE
sample;
The file ‘/home/hive/sample.txt’ is in local file system
It is copied into Hive warehouse folder

Load data into managed table from HDFS
LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE
sample;
The file ‘/user/hive/sample.txt’ is in HDFS

It is copied into Hive warehouse folder

Speaker : Bala

Global Big Data Conference - 2014

34
DML contd.,
Insert results into a new table
INSERT OVERWRITE TABLE newsample
SELECT * from sample;
newsample table must be created before hand
select query results are loaded (overwritten) into new sample

Create a new table with automatically derived schema
CREATE TABLE newsample
AS SELECT * from sample;
creates newsample time with automatically derived schema
query results are populated into it

Speaker : Bala

Global Big Data Conference - 2014

35
Query statements
To list available databases
SHOW DATABASES;

To use a particular database
USE <databasename>;

To list all tables available in a database
SHOW TABLES;

Speaker : Bala

Global Big Data Conference - 2014

36
Query statements contd.,
select
SELECT * FROM sample;

Aggregation functions
SELECT COUNT(DISTINCT state) FROM sample;

Group by, Sort by, Order by
SELECT COUNT(*) FROM sample GROUP BY state;
SELECT * FROM sample SORT BY id DESC;

FROM sample SELECT * ORDER BY id ASC;

Speaker : Bala

Global Big Data Conference - 2014

37
Query statements contd.,
Joins
SELECT s.* , o.*
FROM sample s
JOIN orders o
ON (s.id = o.id)

Left join and Right joins are also supported
Multiple joins are accepted

Speaker : Bala

Global Big Data Conference - 2014

38
Custom Functions
UDF:
– User defined function
– Complex/additional logic can be expressed
– Operates on row by row

UDAF:
– User defined aggregate function
– Custom aggregated function logic can be written
– Operates on groups retrieved by group by clause

UDTF:
– User defined table function
– Operates on entire table

Speaker : Bala

Global Big Data Conference - 2014

39
Hive Limitations
Not suitable for unstructured data
Perfectly suitable for OLAP system (analysis)
Representing machine learning algorithms can be a challenging
task
Performance tradeoff with actual MR programs in various
scenarios
– The gap is narrowing with release to release

Speaker : Bala

Global Big Data Conference - 2014

40
Important practical tips
Hive logs: /tmp/$USER/hive.log
To know available functions: SET FUNCTIONS
To know help about a specific function: DESCRIBE FUNCTION
<function_name>
Explain about config files the one in /usr/lib/hive/conf folder
– hive-site.xml, hive-default.xml, (or) specify custom file using –f option ?

SETTING parameters in the hive session

Speaker : Bala

Global Big Data Conference - 2014

41
References
Hadoop: The Definitive Guide -Tom White
https://cwiki.apache.org/confluence/display/Hive/Home
http://www.sfbayacm.org/wp/wpcontent/uploads/2010/01/sig_2010_v21.pdf
Venner, Jason (2009). Pro Hadoop
http://hortonworks.com/big-data-insights/how-facebook-uses-hadoopand-hive/

Speaker : Bala

Global Big Data Conference - 2014

42
Q/A

Speaker : Bala

Global Big Data Conference - 2014

43
Speaker : Bala

Global Big Data Conference - 2014

44
Backup slides

Speaker : Bala

Global Big Data Conference - 2014

45
Schema on Read (?)
[To do] where to put this slide?
Explain what is schema on read
Explain what is schema on write
Advantages of using schema on read
– Faster load time
– Impacts query time

Speaker : Bala

Global Big Data Conference - 2014

46

Hive

  • 1.
    Hive BALA KRISHNA G GlobalBig Data Bootcamp – Jan 2014 (http://globalbigdataconference.com) Global Big Data Conference - 2014
  • 2.
    My introduction Senior Softwareand Research Engineer Big data trainer Experience on Hadoop and Strom for more than 1.5 years Worked at various big companies SUN/ORACLE, IBM, etc., www.linkedin.com/in/gbalakrishna/ bala.gsbk@outlook.com Speaker : Bala Global Big Data Conference - 2014 2
  • 3.
    Agenda Class structure – 1hour lecture and 1 ½ hour lab Lecture – – – – – – – Need for Hive Hive history Hive powered by What is Hive? Hive Architecture Hive Query Life cycle Hive Query Language (HiveQL) Lab: – Extensive hands-on-experience on Hive – Derive various insights from a real-world dataset by Hive Speaker : Bala Global Big Data Conference - 2014 3
  • 4.
    Need for Hive DoI need to learn JAVA? Speaker : Bala Global Big Data Conference - 2014 Don’t worry! I am here to rescue you 4
  • 5.
    Need for Hivecontd., In general, one MR job is not suffice to derive BI (Business Intelligence) Oftentimes, require a series of complex MR jobs chained together (Advanced data processing) MR 4 MR 1 MR 6 MR 2 MR 3 MR 5 Speaker : Bala Global Big Data Conference - 2014 legends MR – Map Reduce Mapper Task Reducer Task 5
  • 6.
    Need for Hivecontd., 20 lines of code in Hive can result into ~200 lines of Java code Lowers the development time significantly (~16 times) 300 300 code 250 200 200 Minutes 250 time 150 100 150 100 50 50 0 0 Hadoop Speaker : Bala Pig Hadoop Global Big Data Conference - 2014 Pig 6
  • 7.
    Need for Hivecontd., Just focuses on “WHAT” part of your data analysis “HOW” part is rest assured by framework HOW Speaker : Bala Global Big Data Conference - 2014 7
  • 8.
    Hive powered by Usesfor processing large amount of user and central to meet company reporting need’s Data analytics and Data cleaning Ad hoc queries reporting and analytics And many more… https://cwiki.apache.org/confluence/display/Hive/PoweredBy Speaker : Bala Global Big Data Conference - 2014 8
  • 9.
    What is Hive? Datawarehouse built on top of Hadoop Provides an SQL like interface to analyze data An open source project under apache Works on high throughput and high latency principle (same as Hadoop) Ability to plug-in custom Map Reduce programs Mainly targeted for structured data Hides Map Reduce program complexities to end user Speaker : Bala Global Big Data Conference - 2014 9
  • 10.
    Hive Architecture HIVE Meta Store CLI Web Interface Python ODBC Perl Speaker :Bala Driver HADOOP Map Reduce Compiler Optimizer Hive Thrift Server HDFS Plan executor Global Big Data Conference - 2014 10
  • 11.
    Metastore Stores metadata oftables like database location, owner, creation time, access attributes, table schema, etc., Comprises of two components 1) Service 2) Data storage Hive Service Embedded Metastore Driver Metastore Service Local Metastore Driver Metastore Service Remote Metastore Driver Speaker : Bala Derby MySQL Metastore Server Global Big Data Conference - 2014 MySQL 11
  • 12.
    Hive Query Lifecycle Insight Speaker : Bala Global Big Data Conference - 2014 12
  • 13.
    Hive Query Lifecycle contd., 1 Hive Interface 14 11 10 Execution Engine 13 Driver 12 Hadoop Map Reduce 9 Metastore 2 Compiler 3 Parser Semantic Analyzer 8 5 4 Speaker : Bala Physical plan Optimizer generator 6 6 Global Big Data Conference - 2014 Logical plan generator 7 7 Optimizer 13
  • 14.
    Data Models Database: Holdsnamespace for tables Table: Container of actual data sample Id Name Age Sex State In Hive warehouse stored as a folder /user/$USER/warehouse/sample Speaker : Bala Global Big Data Conference - 2014 14
  • 15.
    Data Models contd., Partition:Horizontal slice of table by a partition key Let say sample table is partitioned by state column sample Id Name Age Sex State Partition 1 Partition 2 Stored as many subfolders under sample directory /user/$USER/warehouse/State=AL/ /user/$USER/warehouse/State=NC/ /user/$USER/warehouse/State=GA/ /user/$USER/warehouse/State=ND/ Speaker : Bala Global Big Data Conference - 2014 15
  • 16.
    Data Models contd., Bucket:Divides into further chunks by an other column for sampling Let say sample table is partitioned by ‘State’ column and clustered by ‘Age’ column of 2 buckets In warehouse, the data is stored as /user/$USER/warehouse/State=AL/part-00000 /user/$USER/warehouse/State=AL/part-00001 /user/$USER/warehouse/State=GA/part-00000 /user/$USER/warehouse/State=GA/part-00001 . . /user/$USER/warehouse/State=ND/part-00000 /user/$USER/warehouse/State=ND/part-00001 Speaker : Bala Global Big Data Conference - 2014 16
  • 17.
    Data Loading Techniques ManagedTable: Tables managed by Hive Ware House – Copy file from local file system to Hive Ware House 1) Local FS copy HDFS File Hive Warehouse – Copy file from HDFS to Hive Ware House 2) HDFS File Speaker : Bala copy Hive Warehouse Global Big Data Conference - 2014 17
  • 18.
    Data Loading Techniquescontd., External Table: Tables are just referenced by Hive Ware House – Directly managing file in HDFS with out copying it into Hive Ware House 3) HDFS File Speaker : Bala Referenced referenced Global Big Data Conference - 2014 Hive Warehouse 18
  • 19.
    Data Loading Techniquescontd., Explain when to go for external table and managed table? Speaker : Bala Global Big Data Conference - 2014 19
  • 20.
    Question - 01 Inwhich scenario you use Hive? 1. 2. Structured data 3. Any kind of data 4. Speaker : Bala Completely unstructured nasty data None of the above Global Big Data Conference - 2014 20
  • 21.
    Question – 01answer 2. Hive is mainly used to analyze structured data. Typically, Hive runs on the data that is generated by MapReduce job (or) pig Speaker : Bala Global Big Data Conference - 2014 21
  • 22.
    Question - 02 Whichoption is not correct about Metastore? 1. 2. It has information about number of partitions and number of buckets 3. It can give you time at which the table is created 4. Speaker : Bala It stores the table location It stores the actual data Global Big Data Conference - 2014 22
  • 23.
    Question – 02answer 4. Metastore stores only the metadata. Actual data is stored in HDFS. Speaker : Bala Global Big Data Conference - 2014 23
  • 24.
    Question – 03(last question) What is incorrect about Hive? 1. 2. Hive runs on top of HDFS 3. Hive is a proprietary software 4. Speaker : Bala Hive internally generates MapReduce jobs to serve your query Hive supports multiple interfaces to interact with Global Big Data Conference - 2014 24
  • 25.
    Question – 03answer 3. Hive is an open source. Not a proprietary software. Hive community is growing very rapidly. Speaker : Bala Global Big Data Conference - 2014 25
  • 26.
    Hive Query Language(Hive QL) Data types – provides types for variables DDL – provides a way to define databases, tables, etc., DML – provides a way to modify content Query statements – provides a way to retrieve the content Speaker : Bala Global Big Data Conference - 2014 26
  • 27.
    Data types Booleans: Primitive Types TINYINT(1 byte) SMALLINT (2 bytes) INT (4 bytes) BIGINT (8 bytes) BOOLEAN (TRUE or FALSE) String: STRING (sequence of characters) Speaker : Bala Integers: Floating point numbers: Usage variable_name <Data Type> ex: name STRING Global Big Data Conference - 2014 Float (4 bytes) Double (8 bytes) 27
  • 28.
    Data types contd., ARRAY Usage collectionof multiple same data type values name ARRAY <primitive type> ex: marks ARRAY<INT> Complex Types Usage STRUCT collection of multiple different data type values MAP collection of (key, value) pairs Speaker : Bala Global Big Data Conference - 2014 name STRUCT <type1, type2, type3, …> ex: record STRUCT <name STRING, id INT, marks ARRAY<INT>> Usage name MAP <key, value> ex: score MAP<STRING, INT> 28
  • 29.
    Data types contd., Keymust be a primitive in MAP Referencing complex types Previous example: – marks ARRAY<INT> – record STRUCT <name STRING, id INT, marks ARRAY<INT>> – score MAP<STRING, INT> SELECT marks[0], record.name, score[‘joe’] Complex type inside a complex type is allowed – array inside a struct (as seen before) Speaker : Bala Global Big Data Conference - 2014 29
  • 30.
    DDL CREATE TABLE sample(idINT, name STRING, schema STRING, state STRING) COMMENT ‘This is a sample table’ PARTITIONED BY (state STRING) age INT, sex comments for readability partition data by state column ROW FORMAT DELIMITED rows are delimited by ‘n’ FIELDS TERMINATED BY ‘,’ fields are terminated by ‘,’ STORED AS TEXTFILE; store file as a text file Table is created in warehouse directory and completely managed by Hive Specific row format and file format can be expressed by custom SerDe Speaker : Bala Global Big Data Conference - 2014 30
  • 31.
    SerDe SerDe stands forSerializer and Deserializer Deserializer HDFS File InputFile Format <Key, Value> Deserializ er Row Serializer <Key, Value> OutputFile Format HDFS File Serializer Row Speaker : Bala Global Big Data Conference - 2014 31
  • 32.
    DDL contd., CREATE EXTERNALTABLE external_sample(id INT, name STRING, age INT, sex STRING, state STRING) LOCATION ‘/user/department/sample’ Table is not created in warehouse directory and just referenced by Hive The file referenced is in HDFS (hdfs://user/department/sample) Speaker : Bala Global Big Data Conference - 2014 32
  • 33.
    DDL contd., DELETE TABLEsample Since sample table is managed by Hive, it deletes entire data along with metadata DELETE TABLE external_sample Since external_sample table is *not* managed by Hive, it just deletes the metadata leaving actual data untouched Speaker : Bala Global Big Data Conference - 2014 33
  • 34.
    DML Load data intomanaged table from local file system LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE sample; The file ‘/home/hive/sample.txt’ is in local file system It is copied into Hive warehouse folder Load data into managed table from HDFS LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE sample; The file ‘/user/hive/sample.txt’ is in HDFS It is copied into Hive warehouse folder Speaker : Bala Global Big Data Conference - 2014 34
  • 35.
    DML contd., Insert resultsinto a new table INSERT OVERWRITE TABLE newsample SELECT * from sample; newsample table must be created before hand select query results are loaded (overwritten) into new sample Create a new table with automatically derived schema CREATE TABLE newsample AS SELECT * from sample; creates newsample time with automatically derived schema query results are populated into it Speaker : Bala Global Big Data Conference - 2014 35
  • 36.
    Query statements To listavailable databases SHOW DATABASES; To use a particular database USE <databasename>; To list all tables available in a database SHOW TABLES; Speaker : Bala Global Big Data Conference - 2014 36
  • 37.
    Query statements contd., select SELECT* FROM sample; Aggregation functions SELECT COUNT(DISTINCT state) FROM sample; Group by, Sort by, Order by SELECT COUNT(*) FROM sample GROUP BY state; SELECT * FROM sample SORT BY id DESC; FROM sample SELECT * ORDER BY id ASC; Speaker : Bala Global Big Data Conference - 2014 37
  • 38.
    Query statements contd., Joins SELECTs.* , o.* FROM sample s JOIN orders o ON (s.id = o.id) Left join and Right joins are also supported Multiple joins are accepted Speaker : Bala Global Big Data Conference - 2014 38
  • 39.
    Custom Functions UDF: – Userdefined function – Complex/additional logic can be expressed – Operates on row by row UDAF: – User defined aggregate function – Custom aggregated function logic can be written – Operates on groups retrieved by group by clause UDTF: – User defined table function – Operates on entire table Speaker : Bala Global Big Data Conference - 2014 39
  • 40.
    Hive Limitations Not suitablefor unstructured data Perfectly suitable for OLAP system (analysis) Representing machine learning algorithms can be a challenging task Performance tradeoff with actual MR programs in various scenarios – The gap is narrowing with release to release Speaker : Bala Global Big Data Conference - 2014 40
  • 41.
    Important practical tips Hivelogs: /tmp/$USER/hive.log To know available functions: SET FUNCTIONS To know help about a specific function: DESCRIBE FUNCTION <function_name> Explain about config files the one in /usr/lib/hive/conf folder – hive-site.xml, hive-default.xml, (or) specify custom file using –f option ? SETTING parameters in the hive session Speaker : Bala Global Big Data Conference - 2014 41
  • 42.
    References Hadoop: The DefinitiveGuide -Tom White https://cwiki.apache.org/confluence/display/Hive/Home http://www.sfbayacm.org/wp/wpcontent/uploads/2010/01/sig_2010_v21.pdf Venner, Jason (2009). Pro Hadoop http://hortonworks.com/big-data-insights/how-facebook-uses-hadoopand-hive/ Speaker : Bala Global Big Data Conference - 2014 42
  • 43.
    Q/A Speaker : Bala GlobalBig Data Conference - 2014 43
  • 44.
    Speaker : Bala GlobalBig Data Conference - 2014 44
  • 45.
    Backup slides Speaker :Bala Global Big Data Conference - 2014 45
  • 46.
    Schema on Read(?) [To do] where to put this slide? Explain what is schema on read Explain what is schema on write Advantages of using schema on read – Faster load time – Impacts query time Speaker : Bala Global Big Data Conference - 2014 46