Hive Fundamentals and Architecture for Big Data Warehousing

| © Copyright 2015 Hitachi Consulting1
Hive with HDInsight
Big Data Warehousing
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.

This is not a tutorial!
The purpose of these slides is to give you a overview of Hive features and
capabilities, and to get you excited to learn more about it

 What is Hive?
 Hive Architecture & Components
 Hive vs. RDBMS
 Getting Started
 Loading Data into Hive (Introducing Sqoop)
 Data Definition Language – Creating Hive Tables
 Data Manipulation Language – Processing Data in Hive
 Data Querying Language – Querying Data in Hive
 ETL and Automation (Introducing Oozie)
Outline

Hive fundamentals

A metadata services that project tabular schemas over the data files in HDFS
folders, and enable the contents of folders to be queried/processed as tables,
using an SQL-like querying language (HiveQL).
What is Hive?
Apache HIVE is a data warehouse system for Hadoop

 Initially developed by Facebook in 2007 to enable developers to process data in Hadoop using
their SQL scripting skills.
What is Hive?

 HiveQL looks a lot like ANSI SQL; if you know Transact-SQL you’ll feel comfortable learning Hive.
What is Hive?

 Queries are translated into MapReduce, Tez, or Spark jobs.
What is Hive?

 Queries are translated into MapReduce, Tez, or Spark jobs.
 Results look like a standard relational database row set, various vendors have created ODBC
drivers that interact with Hive results.
What is Hive?

Hadoop is great!
MapReduce is very low level (lack of expressiveness).
Higher-level data processing languages are needed.
Instead of writing MapReduce code in Java, Hive does this dirty work for you, so
you can focus on the processing logic itself as SQL Script.
Hive is best suited for data warehouse applications, rather than OLTP, where a
large datasets are processed in batch mode
Why Hive?

Why Hive?
MapReduce vs Hive – Word Count
Given a set of documents, count the occurrences of
each word in all the documents, and order by word
occurrence frequency

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}
public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
int sum = 0;
for (IntWritable val : values) {
sum += val.get();}
context.write(key, new IntWritable(sum));
}}
Why Hive?
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
MapReduce Java

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}
public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
int sum = 0;
for (IntWritable val : values) {
sum += val.get();}
context.write(key, new IntWritable(sum));
}}
Why Hive?
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
MapReduce Java
Hive

Hadoop Distributed File System (HDFS)
Applications
In-Memory Stream SQL NoSQL Machine
Learning
….
Batch
Yet Another Resource Negotiator (YARN)
Search Orchest.
Mgmnt
Acquisition
Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
Hive and the Hadoop Ecosystem
Hive and the zoo…

Hadoop Distributed File System (HDFS)
Hive
….
Yet Another Resource Negotiator (YARN)Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
Hive Architecture and Components
Hive and the zoo…
JDBCODBC
Hive Web
Interface (HWI)
Metastore
Thrift server
Command Line
Interface (CLI)
Compiler, Optimizer, Executor

•Traditional RDBM – MySQL/Postgres in Hadoop – Azure SQLDB in HDInsight
•Stores the system catalog: metadata about databases, tables, columns, partitions, etc.Metastore
•Performs column pruning , partitions elimination, and index utilizationOptimizer
•Converts HiveQL into a directed acyclic graph of MapReduce, Tez, or Spark Task, based on the
execution engineCompiler
•Submits the Tasks produced by the compiler to YARN in a proper dependency order
•Interacts with the underlying Hadoop instanceExecutor
•Provides a Thrift interface and a JDBC/ODBC server
•Enables Hive integration with other applicationsHiveServer
•Command Line Interface(CLI)
•Web UI
•JDBC/ODBC driver
Client Components
Tell me more…

Recent Hive releases
Hive Version Key Features
0.14
 Cost-based optimizer for star and bushy join queries
 Temporary tables
 Transactions with ACID semantics
 Spark as execution engine
0.13
 Hive on Tez, vectorize query engine & cost-based optimizer
 Dynamic partition loads and smaller hash tables
 CHAR & DECIMAL datatypes, subqueries for IN/NOT IN
0.12
 Vectorized query engine & ORCFile
 Support for VARCHAR and DATE semantics,
 Windowing, analytics, and enhanced aggregation functions
Hive 0.10
BATCH
Hive 0.13
INTERACTIVE
Hive 0.14
SUB-SECOND
Read-only Data
HiveQL
MR
Read-only Data
HiveQL ++
MR, Tez
Modify w/Transactions
MR, Tez, Spark
HiveQL +++
Enterprise SQL at Hadoop Scale
Stinger

Hive execution – MapReduce vs Tez
Tez is a framework for building high performance batch and interactive data processing applications,
coordinated by YARN in Hadoop. Improves over MapReduce speed and maintains its scalability

Hive execution – MapReduce vs Tez
Reducer Reducer
Reducer
Reducer
Reducer
Mapper Mapper Mapper
Mapper Mapper
Mapper Mapper
Mapper Mapper
Mapper Mapper Mapper
Mapper Mapper
Reducer Reducer
ReducerReducer
Reducer
HDFS HDFS
HDFS
Tez is a framework for building high performance batch and interactive data processing applications,
coordinated by YARN in Hadoop. Improves over MapReduce speed and maintains its scalability
MapReduce Tez

Hive vs RDBMS
The face-off…
Feature RDBMS Hive
Structure Schema On Write Schema On Read
Underlying Data Vendor Specific
Delimited Text, XML, JSON, Avro,
RC/ORC, Sequence Files, +++
Processing Transactional Batch
Consistency Strong Consistency (ACID)
Eventual Consistency + limited
Transactions Options (using ZooKeeper)
Locking Yes Table and Partition
Access/Results SQL/Relational Datasets SQL/Relational Datasets
Indexes Yes Yes
Updates Yes Yes (new in 0.14)
Referential Integrity Yes No
Stored Procedures Yes No
User Defined Functions Yes Java (UDF and UDAF)

Getting Started with Hive on Azure

Getting Started
Creating HDInsight Cluster

Getting Started
Browsing Hive in HDInsight

Getting Started

Getting Started
Creating Hive Project in Visual Studio
 Install Azure SDK for Visual Studio https://azure.microsoft.com/en-gb/downloads/

Getting Started
 Create a hive project

Getting Started
 Submit hive script

Loading Data into Hive

Hive folder structure
Hive default root
directory
Database
directory
Table directory
Partition
(sub)directory(ies)
Data files
Loading data into hive is a matter of moving
data files into this folder structure in HDFS!

Hive folder structure
 Move data to hive folder structure in Azure Blob Storage
(e.g. via AzCopy, PowerShell, WABS APIs or Azure Data Factory)
 HDFS commands to move data from local file system of hdfs
hadoop fs –copyFromLocal <srclocalFilePath> <dstHdfsHiveDirectory>
 Hive LOAD command to load data from hdfs to Hive folder structure
LOAD DATA INPATH ‘srcHdfsDirectory' [OVERWRITE] INTO TABLE <hiveTable>;
 Hive LOAD command to load data from local file system to Hive folder structure
LOAD DATA LOCAL INPATH <srclocalFilePath> INTO TABLE <hiveTable>;
 Hive ODBC (e.g. via SSIS)
 Sqoop

Sqoop is designed to efficiently transfer bulk data between Apache Hadoop
and structured data stores such as relational databases.
Introducing Sqoop: Sql to Hadoop

Sqoop is designed to efficiently transfer bulk data between Apache Hadoop
and structured data stores such as relational databases.
Introducing Sqoop: Sql to Hadoop
Sqoop
Command
HDFS/HBase/
Hive
Relational Database
Hadoop
HDFS/
HBase/ Hive
Sqoop
Map task
Map task
Map task

 sqoop-import
 sqoop-import-all-tables
 sqoop-export
 sqoop-create-hive-table
 sqoop-list-databases
 sqoop-list-tables
 sqoop-help
Introducing Sqoop: Commands

sqoop import
--connect “<idbc Connection String>”
--table <Source Table>
--query “<Source Sql Query>”
--where “<Condition>”
--split-by <Source Table Column>
--target-dir “<hdfs Target Directory>
--compression-codec <java Class for Compression>
--as-textfile
sqoop import
--connect “jdbc:sqlserver://HCBI;user=sa;password=password;database=AdventureWorksDW2012”
--query "SELECT * FROM vwInternetSalesFact WHERE YEAR(UpdateDate) = (GETDAYE())
AND MONTH(UpdateDate) = MONTH(GETDAYE()) AND DAY(UpdateDate) = DAY(GETDAYE())”
--split-by storeId
--target-dir hive/warehouse/adventurework/factInternetSales/20160401
--compression-codec org.apache.hadoop.io.compress.GzipCodec
Introducing Sqoop: Import

Data File Compression in Hadoop
Compression Extension Codec Splittable Efficacy
(Compression Ratio)
Speed
(to compress)
DEFLATE .deflate org.apache.hadoop.io.compress.DefaultCodec No Medium Medium
gzip .gz org.apache.hadoop.io.compress.GzipCodec No Medium Medium
bzip2 .bz2 org.apache.hadoop.io.compress.BZip2Codec Yes High Low
LZO .lzo com.hadoop.compression.lzo.LzopCodec Yes* Low High
LZ4 .lz4 org.apache.hadoop.io.compress.Lz4Codec No Low High
Snappy .snappy org.apache.hadoop.io.compress.SnappyCodec No Low High
 Reduces space needed to store files
 Speeds up data transfer across the network or to or from disk

Data File Compression in Hadoop
Compression Extension Codec Splittable Efficacy
(Compression Ratio)
Speed
(to compress)
DEFLATE .deflate org.apache.hadoop.io.compress.DefaultCodec No Medium Medium
gzip .gz org.apache.hadoop.io.compress.GzipCodec No Medium Medium
bzip2 .bz2 org.apache.hadoop.io.compress.BZip2Codec Yes High Low
LZO .lzo com.hadoop.compression.lzo.LzopCodec Yes* Low High
LZ4 .lz4 org.apache.hadoop.io.compress.Lz4Codec No Low High
Snappy .snappy org.apache.hadoop.io.compress.SnappyCodec No Low High
 Reduces space needed to store files
 Speeds up data transfer across the network or to or from disk
 Split the file into chunks from the source (chunk size ≈ HDFS block),
and compress each chunk separately using medium speed/efficient compression
 For large files, use splittable compressions
 For non-splittable compressed file formats (Avro, ORC, RCFile, etc.), use high speed compression

1. Install Hive ODBC driver http://www.microsoft.com/en-gb/download/details.aspx?id=40886
SSIS with Hive – just a peek

2. Create new data source (DNS) using the installed Hive ODBC driver

3. Configure the (DNS) to connect to your Hadoop cluster

3. Configure the (DNS) to connect to your Hadoop cluster
4. Create your Data Flow task in SSIS

Hive DDL

HiveQL – Data Definition Language
Show me what you got…
SHOW DATABASES;
SHOW TABLES;
DESCRIBE <TableName>;  list column names and data types

Data Types
 Numeric Types: { TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL }
 String Types: { STRING, VARCHAR(LENGTH), CHAR }
 Date/Time: {TIMESTAMP, DATE}
 Other: { BOOLEAN, BINARY }
 Complex Types:
 ARRAY: List of values (primitive or complex) – e.g.: ARRAY<STRING>
 MAPS: Dictionary (key/Value list) – e.g.: MAP<INT,STRING>
 STRUCTS: Object – e.g.: STRUCT<name:STRING,age:INT>

Creating tables
CREATE TABLE statement defines schema metadata to be projected onto data in
a folder - as well as a file parsing mechanism - when the table is queried, by
specifying the following elements:
 Table Type (External or not)
 Table Name
 Column Names & Data Types
 Table Location (Optional)
 Partitioning Column (Optional)
 Clustering & Sorting Columns(Optional)
 Row Format
 File Format

Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
OrderDate DATE,
SalesValue DOUBLE
)

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
OrderDate DATE,
SalesValue DOUBLE
)
Very cool indeed!

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (OrderDate DATE)
A sub-folder - in the table folder - for
transactional records in each date
Partition column does not appear in the
create table body, but can be queried
Partitioned by - > a where clause column

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
A sub-folder hierarchy – in the table
folder – to organize the transactional
records by year/month/day

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS
Data in each partition folder is split into
32 files (buckets)
Records are clustered (distributed)
across the hdfs, and eventually to the
MapReduce Tasks, by City. That is,
records with the same City will end up in
the same file
Used to reduce data movement/
shuffling in MapReduce jobs
Cluster by -> join column(s)
Sort by -> group by column(s)

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
ROW FORMAT DELIMITED
How the row will be serialized/
desacralized (i.e., parsed) for
reading/writing)

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
FIELDS TERMINATED BY ‘,'
COLLECTION ITEMS TERMINATED BY ‘^'
MAP KEYS TERMINATED BY ‘|'
LINES TERMINATED BY 'n‘
More details about how the
records are parsed

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
STORED AS TEXTFILE
How the file are eventually stored in
hdfs. In some cases, the STORE AS
type implies the row format

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
STORED AS TEXTFILE
LOCATION ‘/data/adventureworks/facts’
Explicit location in which the
data for this table should be
stored in hdfs

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde‘
STORED AS TEXTFILE
A pre-built
CSV serializer/
desacralizer

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde’
WITH SERDEPROPERTIES (
“ProductSKU"="$.product_category",
"ProductCategory"="$.product_category ",
“City"="$.city",
““Country="$.country",
“SalesValue"="$.sales_value“ )
STORED AS TEXTFILE
A pre-built
JSON serializer/ deserializer
SERDES properties have to be supplied to
describe the json doc format

Creating tables
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
ROW FORMAT SERDE ‘hitachi.hive.parser.mainframeparser‘
STORED AS TEXTFILE
Java parsers (SERDES) can be
implemented to read/write any custom file
formats

File Formats
Format Description Size Query
Performance
Syntax
TEXTFILE  Delimited text file
 Any Platform
 Split-able
6 5 STORED AS TEXTFILE
SEQUENCEFILE  Binary key-value pairs
 Optimized for MapReduce
 Split-able
5 4 STORED AS SEQUENCEFILE
PARQUET  columnar storage format (compressed, binary)
 Non-split-able
2 6 STORED AS PARQUET
RCFile  Columnar storage format (compressed,
binary)
 Non-split-able
3 2 ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
ORC  Columnar storage format (highly compressed,
binary)
 Optimized for distinct selection and
aggregations
 Vectorization: process a batch up to 1,024
rows together. Each batch is a column vector.
 Non-split-able
1 1 STORED AS ORC
AVRO  serialization system with evolvable schema-
driven binary data
 Cross-platform inter-operability.
 Non-split-able
4 3 STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat‘
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat‘
TBLPROPERTIES ('avro.schema.url'=‘<schemaFileLocation>')

Creating external tables
CREATE EXTERNAL TABLE InternetSales
(
ProductSKU INT,
City STRING,
Country STRING,
SalesValue DOUBLE
)
STORED AS SEQUENCEFILE
Maintained Outside hive directory
Folder is not deleted when executing
DROP TABLE statement (only metadata is
removed from metastore)
Usually used to bring data from landing
area (data lake) to hive. In this case, row
format and stored as specs should match
the data files in the folder
Also can be used spit out data from hive
directories to a shared area with a different
folder structure (as shown in the
year/month/day example).
It is useful to leave the data files where
they are and create external tables in hive
if they are going to be manipulated by
other tools beside hive (such as
MapReduce/Pig/etc.
Location is mandatory for external tables

Creating table like another table
DESCRIBE InternetSales
CREATE [TEMPORARY] TABLE [IF NOT EXIST] InternetSales_Copy
LIKE InterenetSales
Create a table with the
same metadata structure
of another table

Bring it down…
DROP PARTITION <PartitionName>;
 deletes the partition folder and its metadata from the metastore
DROP TABLE <TableName>;
 deletes the table folder and its metadata from the metastore
DROP DATABASE <DatabaseName>;
 deletes the partition folder and its metadata from the metastore

Hive DML

HiveQL – Data Manipulation Language
Create Table As (CTAS)
 Widely-used in MPP systems
 Usually used as a replacement to MERGE statements and avoid updates
 A common scenario is to truncate/drop and re-build the information marts
(i.e., facts/ dimensions/ analytical datasets) each time from the
data warehouse (i.e., 3NF, Data Vault).
 This is where CTAS can be very useful.

DROP TABLE IF EXIST FactTable;
CREATE TABLE FactTable AS --business logic
SELECT… FROM …JOIN… WHERE… GROUP BY… HAVING… UNION SELECT…
FROM StgTables

INSERT…
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';

INSERT…
FROM staged_employees se
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';
Optimized MapReduce translation
Often used to populate fact and
dimension tables data from on
source table

INSERT…
FROM staged_employees se
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';
PARTITION (country, state)
SELECT..., se.cnty, se.st
FROM staged_employees se;
Dynamic partition inserts
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Hive DQL

HiveQL – Data Querying Language
Querying complex types
CREATE TABLE EmployeeSkills (
Identifier INT,
Skills ARRAY<STRING>
)
INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’));
INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’));
INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’));

Identifier INT,
)
SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills Returns the first and last skill
concatenated by a pipe

Identifier INT,
)
SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills
SELECT * FROM EmployeeSkills
WHERE array_contains(skills,’ssrs’) AND array_contains(skills,’sql’);
Returns the first and last skill
Returns employees having ‘ssrs’ and
‘sql’ in their skills

Identifier INT,
)
SELECT skill, count(skill) FROM
(SELECT explode(skill) FROM EmployeeSkills) AS skillsList
GROUP BY skill
Explode => converts the skill array into
rows, each has on value -
Counts the occurrences of each skill

Identifier INT,
)
SELECT skill, count(skill) FROM
(SELECT explode(skill) FROM EmployeeSkills) AS skillsList
GROUP BY skill
SELECT Identifier, SkillList FROM EmployeeSkills
LATERAL VIEW explode(skills) subView as SkillList;
Explode => converts the skill array into
rows, each has on value -
Counts the occurrences of each skill
Lists the skills (one column, several
rows) beside the identifier as a

CREATE TABLE EmployeeProjects (
Identifier INT,
ProjectGrades MAP<STRING,FLOAT>
)
INSERT INTO TABLE EmployeeProjects
SELECT * FROM
(
SELECT 1 AS Identifier, MAP('A', CAST(0.3 AS FLOAT),'B', CAST(0.4 AS FLOAT)) AS ProjectGrade
UNION ALL
SELECT 2 AS Identifier, MAP('A', CAST(0.4 AS FLOAT),'C', CAST(0.5 AS FLOAT),'B', CAST(0.3 AS FLOAT)) AS ProjectGrade
UNION ALL
SELECT 3 AS Identifier, MAP('B', CAST(0.5 AS FLOAT),'C', CAST(0.2 AS FLOAT)) AS ProjectGrade
) AS query;
SELECT * FROM EmployeeProjects
WHERE ProjectGrades[“A”] >0.3;
Retrieve all the employees
that where in Project ‘A’ and
had a grade greater than 0.4

Explain
Hive provides an EXPLAIN command that shows the execution plan for a query
EXPLAIN
SELECT ProductName, SUM(LineItemAmount)
FROM InterentSales
WHERE Year = ‘2014’
GROUP BY ProductName
HAVING COUNT(OrderId) > 10000
ORDER BY SUM(LineItemAmount);

ETL and Automation

ETL and Automation
A common pattern
 Usually the data is extracted from the sources - using
Sqoop, Azure Data Factory, or any other mechanis - and
loaded “as is” to an area in the hdfs (landing area).
HDFS (Data Lake)
Sources Apps
Landing Directories

ETL and Automation
A common pattern
 External Hive tables are created to source this data to
hive system.
HDFS (Data Lake)
Sources Apps
Landing Directories
Hive
External Tables

ETL and Automation
A common pattern
hive system.
 A series of hiveQL scripts are executed to transform
and load data in the external hive tables to a canonical
data model.
 The data in the landing area can be dropped (if is not
used by other tools)
HDFS (Data Lake)
Sources Apps
Landing Directories
Hive
External Tables
Data Warehouse
(Dim. model, Data Vault, etc.)

ETL and Automation
A common pattern
hive system.
 A series of hiveQL scripts are executed to transform
and load data in the external hive tables to a canonical
data model.
 The data in the landing area can be dropped (if is not
used by other tools)
 This process can be automated using Azure Data
Factory, SSIS, Oozie, PowerShell etc. HDFS (Data Lake)
Oozie(Sqoop,Pig-Latin,HiveQLScripts)
AzureDataFactory(DataCopyActivities/HiveJobs)
SSIS–PowerShell-Custom
Sources Apps
Landing Directories
Hive
External Tables
Data Warehouse
(Dim. model, Data Vault, etc.)

ETL and Automation
Introducing Oozie
 Oozie Workflow Document - XML file defining workflow actions
<workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow">
<start to="FirstAction"/>
<action name="FirstAction">
<hive xmlns="uri:oozie:hive-action:0.2">
<script>CreateTable.hql</script>
<param>TABLE_NAME=${tableName}</param>
<param>LOCATION=${tableFolder}</param>
</hive>
<ok to="SecondAction"/>
<error to="fail"/>
</action>
<action name="SecondAction">
…
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Oozie Workflow

ETL and Automation
Introducing Oozie
 Script files - Files used by workflow actions (e.g., HiveQL and Pig-Latin script file).
</hive>
<error to="fail"/>
</action>
…
</action>
<kill name="fail">
</kill>
<end name="end"/>
</workflow-app>
DROP TABLE IF EXISTS ${TABLE_NAME};
CREATE EXTERNAL TABLE ${TABLE_NAME}
(Col1 STRING,
Col2 FLOAT,
Col3 FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY ','
STORED AS TEXTFILE LOCATION
'${LOCATION}';
Oozie Workflow
HiveQL Script

ETL and Automation
Introducing Oozie
 Script files - Files used by workflow actions (e.g., HiveQL and Pig-Latin script file).
 The job.properties file - Configuration file setting parameter values
 Oozie Jobs are scheduled using a Hadoop Agent
</hive>
<error to="fail"/>
</action>
…
</action>
<kill name="fail">
</kill>
<end name="end"/>
</workflow-app>
DROP TABLE IF EXISTS ${TABLE_NAME};
CREATE EXTERNAL TABLE ${TABLE_NAME}
(Col1 STRING,
Col2 FLOAT,
Col3 FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY ','
STORED AS TEXTFILE LOCATION
'${LOCATION}';
nameNode=wasb://my_container@my_storage_account.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=/example/workflow/
tableName=ExampleTable
tableFolder=/example/ExampleTable
Oozie Workflow
HiveQL Script
Job properties

ETL and Automation
PowerShell
 The AzureHDInsightHiveJobDefinition cmdlet
– Create a job definition
– Use Query for explicit HiveQL statements, or File to reference a saved script
– Run the job with the Start-AzureHDInsightJob cmdlet
 The Invoke-Hive cmdlet
– Simpler syntax to run a HiveQL query and wait for the response
– Use Query for explicit HiveQL statements, or File to reference a saved script

ETL and Automation
..NET APIs

ETL and Automation
..NET APIs – Parallel Execution
Step Task Desc HiveQL
1 1 Build Dim 1 Script 1
2 4 Build Fact 1 Script 4
2 5 Build Fact 2 Script 5
3 6 Build Summary Script 6
DW build control table
 Tasks in the same steps can be executed in parallel
 Steps are executed sequentially
 HiveQL column contains the hive script that does the ETL
Can be hosted and scheduled
using an Azure WebJob!

Other features
 Indexes
 Statistics (analyse)
 Archiving
 UDF and UDAF using java
 Locks
 Authorization
 Hive transactions
 Streaming
 Accumulo & HBase integration
 HCatalog and WebHCatalog

Useful Hive Configurations
 SET hive.metastore.warehouse.dir = “<directoty>”;
 SET hive.cli.print.current.db = true | false;
 SET hive.cli.print.header=true | false;
 SET hive.execution.engine = mr | tez | spark;
 SET hive.exec.dynamic.partition = true | false;
 SET hive.exec.dynamic.partition.mode = strict | nonstrict;
 SET hive.exec.max.dynamic.partitions = <value>;
 SET hive.exec.max.created.files = <value>;
 SET hive.map.aggr=true | false;
 SET hive.auto.convert.join = true | false;
 SET hive.optimize.bucketmapjoin=true | false ;
 SET hive.exec.compress.intermediate=true | false;
 SET mapred.map.output.compression.codec=<compression codec>;
 SET hive.exec.compress.output=true | false;
 SET mapred.output.compression.codec = <compression codec>;
 SET hive.archive.enabled=true | false;
 SET hive.optimize.ppd.storage=true | false;

How to Get Started with Hive?
 Read the slides!
 Coursera – Big Data Specialization
https://www.coursera.org/specializations/big-data
 Azure Documentation – HDInsight Emulator
https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-hadoop-emulator-get-started
 MVA – Big Data Analytics with HDInsight: Hadoop on Azure
https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight-hadoop-on-azure-10551
 MVA – Implementing Big Data Analysis
https://mva.microsoft.com/en-US/training-courses/implementing-big-data-analysis-8311?l=44REr2Yy_5404984382
 Azure Documentation – Getting Started with HDInsight
https://azure.microsoft.com/en-gb/documentation/services/hdinsight/
 Azure Documentation – How to Connect Excel to Windows Azure HDInsight via HiveODBC:
https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-connect-excel-hive-odbc-driver/
 Apache Scoop
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html
 Apache Hive
https://cwiki.apache.org/confluence/display/Hive/Home
O’Reliy Books– Programming Hive 2nd Edition

Coming soon…
 NoSQL on Microsoft Azure
 Introduction to Spark on HDInsight
Stay tuned

My Background
Applying Computational Intelligence in Data Mining
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 25+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning,
– evolving neural networks, and
– data clustering
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org

Thank you!

Hive Fundamentals and Architecture for Big Data Warehousing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hive Fundamentals and Architecture for Big Data Warehousing

Similar to Hive Fundamentals and Architecture for Big Data Warehousing (20)

More from Khalid Salama

More from Khalid Salama (6)

Recently uploaded

Recently uploaded (20)

Hive Fundamentals and Architecture for Big Data Warehousing