Content
What all you need to
know about HIVE at very
high level:
✦ Architecture
✦ Workflow
✦ Read On Schema
Approach
✦ Functions
✦ Join Strategies
HIVE: Architecture
Refer: https://www.tutorialspoint.com/hive/hive_introduction.htm
HIVE: Workflow
Refer: https://www.tutorialspoint.com/hive/hive_introduction.htm
HIVE: Schema on Read
Approach
Let user to redefine tables to match the data without touching the data, unlike Mysql’s Schema on Write
approach. (Reference: https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/)
No predetermined structure so the data can be presented in a schema that is most relevant to the task at
hand.
Upfront modeling exercise disappears.
Hive has serialization and deserialization adapters to let the user do this, so it
isn’t intended for online tasks requiring heavy read/write traffic.
HIVE:
Seralization/Deserialization
HIVE: Functions
There are three types of function APIs in Hive:
Built-In Functions
UDF (User Defined Functions- Normal functions) is a function that takes one or
more columns from a row as argument and returns a single value or object. Eg-
concat(arg1, arg2)
UDTF (User Defined Table Functions) takes zero or more inputs and and produces
multiple columns or rows of output. Eg: explode()
UDAF (User Defined Aggregate Functions)
Macros a function that uses other Hive functions.
Reference:
https://www.qubole.com/resources/hive-function-cheat-sheet/
HIVE: Join
Hive allows only equi-join.
So ON clause can have only equal conditions
(=) combined with AND operator only.
HIVE: Join Strategies
Map-Reduce Join
Map Side Join (join during map phase)
Reduce Side Join (join during reduce phase)
Hive Shuffle Join
Hive Map-Side Join (Broadcast Join)
Hive Bucket Join
HIVE: Map-Side Join
Join the records by key during read
of the input files
Highly constraint
Both tables should be sorted on
same join key
Both tables should have same
number of partitions.
Usually achieved when both input
tables were created by (different)
MapReduce jobs having the same
amount of reducers using the same
(join) key.

Apache Hive

  • 2.
    Content What all youneed to know about HIVE at very high level: ✦ Architecture ✦ Workflow ✦ Read On Schema Approach ✦ Functions ✦ Join Strategies
  • 3.
  • 4.
  • 5.
    HIVE: Schema onRead Approach Let user to redefine tables to match the data without touching the data, unlike Mysql’s Schema on Write approach. (Reference: https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/) No predetermined structure so the data can be presented in a schema that is most relevant to the task at hand. Upfront modeling exercise disappears.
  • 6.
    Hive has serializationand deserialization adapters to let the user do this, so it isn’t intended for online tasks requiring heavy read/write traffic. HIVE: Seralization/Deserialization
  • 7.
    HIVE: Functions There arethree types of function APIs in Hive: Built-In Functions UDF (User Defined Functions- Normal functions) is a function that takes one or more columns from a row as argument and returns a single value or object. Eg- concat(arg1, arg2) UDTF (User Defined Table Functions) takes zero or more inputs and and produces multiple columns or rows of output. Eg: explode() UDAF (User Defined Aggregate Functions) Macros a function that uses other Hive functions. Reference: https://www.qubole.com/resources/hive-function-cheat-sheet/
  • 8.
    HIVE: Join Hive allowsonly equi-join. So ON clause can have only equal conditions (=) combined with AND operator only.
  • 9.
    HIVE: Join Strategies Map-ReduceJoin Map Side Join (join during map phase) Reduce Side Join (join during reduce phase) Hive Shuffle Join Hive Map-Side Join (Broadcast Join) Hive Bucket Join
  • 10.
    HIVE: Map-Side Join Jointhe records by key during read of the input files Highly constraint Both tables should be sorted on same join key Both tables should have same number of partitions. Usually achieved when both input tables were created by (different) MapReduce jobs having the same amount of reducers using the same (join) key.