SlideShare a Scribd company logo
1 of 95
| © Copyright 2015 Hitachi Consulting1
Hive with HDInsight
Big Data Warehousing
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.
| © Copyright 2015 Hitachi Consulting2
This is not a tutorial!
The purpose of these slides is to give you a overview of Hive features and
capabilities, and to get you excited to learn more about it
| © Copyright 2015 Hitachi Consulting3
 What is Hive?
 Hive Architecture & Components
 Hive vs. RDBMS
 Getting Started
 Loading Data into Hive (Introducing Sqoop)
 Data Definition Language – Creating Hive Tables
 Data Manipulation Language – Processing Data in Hive
 Data Querying Language – Querying Data in Hive
 ETL and Automation (Introducing Oozie)
Outline
| © Copyright 2015 Hitachi Consulting4
Hive fundamentals
| © Copyright 2015 Hitachi Consulting5
A metadata services that project tabular schemas over the data files in HDFS
folders, and enable the contents of folders to be queried/processed as tables,
using an SQL-like querying language (HiveQL).
What is Hive?
Apache HIVE is a data warehouse system for Hadoop
| © Copyright 2015 Hitachi Consulting6
A metadata services that project tabular schemas over the data files in HDFS
folders, and enable the contents of folders to be queried/processed as tables,
using an SQL-like querying language (HiveQL).
 Initially developed by Facebook in 2007 to enable developers to process data in Hadoop using
their SQL scripting skills.
What is Hive?
Apache HIVE is a data warehouse system for Hadoop
| © Copyright 2015 Hitachi Consulting7
A metadata services that project tabular schemas over the data files in HDFS
folders, and enable the contents of folders to be queried/processed as tables,
using an SQL-like querying language (HiveQL).
 Initially developed by Facebook in 2007 to enable developers to process data in Hadoop using
their SQL scripting skills.
 HiveQL looks a lot like ANSI SQL; if you know Transact-SQL you’ll feel comfortable learning Hive.
What is Hive?
Apache HIVE is a data warehouse system for Hadoop
| © Copyright 2015 Hitachi Consulting8
A metadata services that project tabular schemas over the data files in HDFS
folders, and enable the contents of folders to be queried/processed as tables,
using an SQL-like querying language (HiveQL).
 Initially developed by Facebook in 2007 to enable developers to process data in Hadoop using
their SQL scripting skills.
 HiveQL looks a lot like ANSI SQL; if you know Transact-SQL you’ll feel comfortable learning Hive.
 Queries are translated into MapReduce, Tez, or Spark jobs.
What is Hive?
Apache HIVE is a data warehouse system for Hadoop
| © Copyright 2015 Hitachi Consulting9
A metadata services that project tabular schemas over the data files in HDFS
folders, and enable the contents of folders to be queried/processed as tables,
using an SQL-like querying language (HiveQL).
 Initially developed by Facebook in 2007 to enable developers to process data in Hadoop using
their SQL scripting skills.
 HiveQL looks a lot like ANSI SQL; if you know Transact-SQL you’ll feel comfortable learning Hive.
 Queries are translated into MapReduce, Tez, or Spark jobs.
 Results look like a standard relational database row set, various vendors have created ODBC
drivers that interact with Hive results.
What is Hive?
Apache HIVE is a data warehouse system for Hadoop
| © Copyright 2015 Hitachi Consulting10
Hadoop is great!
MapReduce is very low level (lack of expressiveness).
Higher-level data processing languages are needed.
Instead of writing MapReduce code in Java, Hive does this dirty work for you, so
you can focus on the processing logic itself as SQL Script.
Hive is best suited for data warehouse applications, rather than OLTP, where a
large datasets are processed in batch mode
Why Hive?
Apache HIVE is a data warehouse system for Hadoop
| © Copyright 2015 Hitachi Consulting11
Why Hive?
MapReduce vs Hive – Word Count
Given a set of documents, count the occurrences of
each word in all the documents, and order by word
occurrence frequency
| © Copyright 2015 Hitachi Consulting12
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}
public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();}
context.write(key, new IntWritable(sum));
}}
Why Hive?
MapReduce vs Hive – Word Count
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
MapReduce Java
| © Copyright 2015 Hitachi Consulting13
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}
public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();}
context.write(key, new IntWritable(sum));
}}
Why Hive?
MapReduce vs Hive – Word Count
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
MapReduce Java
Hive
| © Copyright 2015 Hitachi Consulting14
Hadoop Distributed File System (HDFS)
Applications
In-Memory Stream SQL NoSQL Machine
Learning
….
Batch
Yet Another Resource Negotiator (YARN)
Search Orchest.
Mgmnt
Acquisition
Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
Hive and the Hadoop Ecosystem
Hive and the zoo…
| © Copyright 2015 Hitachi Consulting15
Hadoop Distributed File System (HDFS)
Hive
….
Yet Another Resource Negotiator (YARN)Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
Hive Architecture and Components
Hive and the zoo…
JDBCODBC
Hive Web
Interface (HWI)
Metastore
Thrift server
Command Line
Interface (CLI)
Compiler, Optimizer, Executor
| © Copyright 2015 Hitachi Consulting16
Hive Architecture and Components
•Traditional RDBM – MySQL/Postgres in Hadoop – Azure SQLDB in HDInsight
•Stores the system catalog: metadata about databases, tables, columns, partitions, etc.Metastore
•Performs column pruning , partitions elimination, and index utilizationOptimizer
•Converts HiveQL into a directed acyclic graph of MapReduce, Tez, or Spark Task, based on the
execution engineCompiler
•Submits the Tasks produced by the compiler to YARN in a proper dependency order
•Interacts with the underlying Hadoop instanceExecutor
•Provides a Thrift interface and a JDBC/ODBC server
•Enables Hive integration with other applicationsHiveServer
•Command Line Interface(CLI)
•Web UI
•JDBC/ODBC driver
Client Components
Tell me more…
| © Copyright 2015 Hitachi Consulting17
Hive Architecture and Components
Recent Hive releases
Hive Version Key Features
0.14
 Cost-based optimizer for star and bushy join queries
 Temporary tables
 Transactions with ACID semantics
 Spark as execution engine
0.13
 Hive on Tez, vectorize query engine & cost-based optimizer
 Dynamic partition loads and smaller hash tables
 CHAR & DECIMAL datatypes, subqueries for IN/NOT IN
0.12
 Vectorized query engine & ORCFile
 Support for VARCHAR and DATE semantics,
 Windowing, analytics, and enhanced aggregation functions
Hive 0.10
BATCH
Hive 0.13
INTERACTIVE
Hive 0.14
SUB-SECOND
Read-only Data
HiveQL
MR
Read-only Data
HiveQL ++
MR, Tez
Modify w/Transactions
MR, Tez, Spark
HiveQL +++
Enterprise SQL at Hadoop Scale
Stinger
| © Copyright 2015 Hitachi Consulting18
Hive Architecture and Components
Hive execution – MapReduce vs Tez
Tez is a framework for building high performance batch and interactive data processing applications,
coordinated by YARN in Hadoop. Improves over MapReduce speed and maintains its scalability
| © Copyright 2015 Hitachi Consulting19
Hive Architecture and Components
Hive execution – MapReduce vs Tez
Reducer Reducer
Reducer
Reducer
Reducer
Mapper Mapper Mapper
Mapper Mapper
Mapper Mapper
Mapper Mapper
Mapper Mapper Mapper
Mapper Mapper
Reducer Reducer
ReducerReducer
Reducer
HDFS HDFS
HDFS
Tez is a framework for building high performance batch and interactive data processing applications,
coordinated by YARN in Hadoop. Improves over MapReduce speed and maintains its scalability
MapReduce Tez
| © Copyright 2015 Hitachi Consulting20
Hive vs RDBMS
The face-off…
Feature RDBMS Hive
Structure Schema On Write Schema On Read
Underlying Data Vendor Specific
Delimited Text, XML, JSON, Avro,
RC/ORC, Sequence Files, +++
Processing Transactional Batch
Consistency Strong Consistency (ACID)
Eventual Consistency + limited
Transactions Options (using ZooKeeper)
Locking Yes Table and Partition
Access/Results SQL/Relational Datasets SQL/Relational Datasets
Indexes Yes Yes
Updates Yes Yes (new in 0.14)
Referential Integrity Yes No
Stored Procedures Yes No
User Defined Functions Yes Java (UDF and UDAF)
| © Copyright 2015 Hitachi Consulting21
Getting Started with Hive on Azure
| © Copyright 2015 Hitachi Consulting22
Getting Started
Creating HDInsight Cluster
| © Copyright 2015 Hitachi Consulting23
Getting Started
Creating HDInsight Cluster
| © Copyright 2015 Hitachi Consulting24
Getting Started
Browsing Hive in HDInsight
| © Copyright 2015 Hitachi Consulting25
Getting Started
Browsing Hive in HDInsight
| © Copyright 2015 Hitachi Consulting26
Getting Started
Browsing Hive in HDInsight
| © Copyright 2015 Hitachi Consulting27
Getting Started
Creating Hive Project in Visual Studio
 Install Azure SDK for Visual Studio https://azure.microsoft.com/en-gb/downloads/
| © Copyright 2015 Hitachi Consulting28
Getting Started
Creating Hive Project in Visual Studio
 Install Azure SDK for Visual Studio https://azure.microsoft.com/en-gb/downloads/
 Create a hive project
| © Copyright 2015 Hitachi Consulting29
Getting Started
Creating Hive Project in Visual Studio
 Install Azure SDK for Visual Studio https://azure.microsoft.com/en-gb/downloads/
 Create a hive project
 Submit hive script
| © Copyright 2015 Hitachi Consulting30
Getting Started
Creating Hive Project in Visual Studio
 Install Azure SDK for Visual Studio https://azure.microsoft.com/en-gb/downloads/
 Create a hive project
 Submit hive script
| © Copyright 2015 Hitachi Consulting31
Loading Data into Hive
| © Copyright 2015 Hitachi Consulting32
Loading Data into Hive
Hive folder structure
Hive default root
directory
Database
directory
Table directory
Partition
(sub)directory(ies)
Data files
Loading data into hive is a matter of moving
data files into this folder structure in HDFS!
| © Copyright 2015 Hitachi Consulting33
Loading Data into Hive
Hive folder structure
 Move data to hive folder structure in Azure Blob Storage
(e.g. via AzCopy, PowerShell, WABS APIs or Azure Data Factory)
 HDFS commands to move data from local file system of hdfs
hadoop fs –copyFromLocal <srclocalFilePath> <dstHdfsHiveDirectory>
 Hive LOAD command to load data from hdfs to Hive folder structure
LOAD DATA INPATH ‘srcHdfsDirectory' [OVERWRITE] INTO TABLE <hiveTable>;
 Hive LOAD command to load data from local file system to Hive folder structure
LOAD DATA LOCAL INPATH <srclocalFilePath> INTO TABLE <hiveTable>;
 Hive ODBC (e.g. via SSIS)
 Sqoop
| © Copyright 2015 Hitachi Consulting34
Sqoop is designed to efficiently transfer bulk data between Apache Hadoop
and structured data stores such as relational databases.
Loading Data into Hive
Introducing Sqoop: Sql to Hadoop
| © Copyright 2015 Hitachi Consulting35
Sqoop is designed to efficiently transfer bulk data between Apache Hadoop
and structured data stores such as relational databases.
Loading Data into Hive
Introducing Sqoop: Sql to Hadoop
Sqoop
Command
HDFS/HBase/
Hive
Relational Database
Hadoop
HDFS/
HBase/ Hive
Sqoop
Map task
Map task
Map task
| © Copyright 2015 Hitachi Consulting36
 sqoop-import
 sqoop-import-all-tables
 sqoop-export
 sqoop-create-hive-table
 sqoop-list-databases
 sqoop-list-tables
 sqoop-help
Loading Data into Hive
Introducing Sqoop: Commands
| © Copyright 2015 Hitachi Consulting37
sqoop import
--connect “<idbc Connection String>”
--table <Source Table>
--query “<Source Sql Query>”
--where “<Condition>”
--split-by <Source Table Column>
--target-dir “<hdfs Target Directory>
--compression-codec <java Class for Compression>
--as-textfile
sqoop import
--connect “jdbc:sqlserver://HCBI;user=sa;password=password;database=AdventureWorksDW2012”
--query "SELECT * FROM vwInternetSalesFact WHERE YEAR(UpdateDate) = (GETDAYE())
AND MONTH(UpdateDate) = MONTH(GETDAYE()) AND DAY(UpdateDate) = DAY(GETDAYE())”
--split-by storeId
--target-dir hive/warehouse/adventurework/factInternetSales/20160401
--compression-codec org.apache.hadoop.io.compress.GzipCodec
Loading Data into Hive
Introducing Sqoop: Import
| © Copyright 2015 Hitachi Consulting38
Data File Compression in Hadoop
Compression Extension Codec Splittable Efficacy
(Compression Ratio)
Speed
(to compress)
DEFLATE .deflate org.apache.hadoop.io.compress.DefaultCodec No Medium Medium
gzip .gz org.apache.hadoop.io.compress.GzipCodec No Medium Medium
bzip2 .bz2 org.apache.hadoop.io.compress.BZip2Codec Yes High Low
LZO .lzo com.hadoop.compression.lzo.LzopCodec Yes* Low High
LZ4 .lz4 org.apache.hadoop.io.compress.Lz4Codec No Low High
Snappy .snappy org.apache.hadoop.io.compress.SnappyCodec No Low High
 Reduces space needed to store files
 Speeds up data transfer across the network or to or from disk
| © Copyright 2015 Hitachi Consulting39
Data File Compression in Hadoop
Compression Extension Codec Splittable Efficacy
(Compression Ratio)
Speed
(to compress)
DEFLATE .deflate org.apache.hadoop.io.compress.DefaultCodec No Medium Medium
gzip .gz org.apache.hadoop.io.compress.GzipCodec No Medium Medium
bzip2 .bz2 org.apache.hadoop.io.compress.BZip2Codec Yes High Low
LZO .lzo com.hadoop.compression.lzo.LzopCodec Yes* Low High
LZ4 .lz4 org.apache.hadoop.io.compress.Lz4Codec No Low High
Snappy .snappy org.apache.hadoop.io.compress.SnappyCodec No Low High
 Reduces space needed to store files
 Speeds up data transfer across the network or to or from disk
 Split the file into chunks from the source (chunk size ≈ HDFS block),
and compress each chunk separately using medium speed/efficient compression
 For large files, use splittable compressions
 For non-splittable compressed file formats (Avro, ORC, RCFile, etc.), use high speed compression
| © Copyright 2015 Hitachi Consulting40
1. Install Hive ODBC driver http://www.microsoft.com/en-gb/download/details.aspx?id=40886
Loading Data into Hive
SSIS with Hive – just a peek
| © Copyright 2015 Hitachi Consulting41
1. Install Hive ODBC driver http://www.microsoft.com/en-gb/download/details.aspx?id=40886
2. Create new data source (DNS) using the installed Hive ODBC driver
Loading Data into Hive
SSIS with Hive – just a peek
| © Copyright 2015 Hitachi Consulting42
1. Install Hive ODBC driver http://www.microsoft.com/en-gb/download/details.aspx?id=40886
2. Create new data source (DNS) using the installed Hive ODBC driver
3. Configure the (DNS) to connect to your Hadoop cluster
Loading Data into Hive
SSIS with Hive – just a peek
| © Copyright 2015 Hitachi Consulting43
1. Install Hive ODBC driver http://www.microsoft.com/en-gb/download/details.aspx?id=40886
2. Create new data source (DNS) using the installed Hive ODBC driver
3. Configure the (DNS) to connect to your Hadoop cluster
4. Create your Data Flow task in SSIS
Loading Data into Hive
SSIS with Hive – just a peek
| © Copyright 2015 Hitachi Consulting44
Hive DDL
| © Copyright 2015 Hitachi Consulting45
HiveQL – Data Definition Language
Show me what you got…
SHOW DATABASES;
SHOW TABLES;
DESCRIBE <TableName>;  list column names and data types
| © Copyright 2015 Hitachi Consulting46
HiveQL – Data Definition Language
Data Types
 Numeric Types: { TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL }
 String Types: { STRING, VARCHAR(LENGTH), CHAR }
 Date/Time: {TIMESTAMP, DATE}
 Other: { BOOLEAN, BINARY }
 Complex Types:
 ARRAY: List of values (primitive or complex) – e.g.: ARRAY<STRING>
 MAPS: Dictionary (key/Value list) – e.g.: MAP<INT,STRING>
 STRUCTS: Object – e.g.: STRUCT<name:STRING,age:INT>
| © Copyright 2015 Hitachi Consulting47
HiveQL – Data Definition Language
Creating tables
CREATE TABLE statement defines schema metadata to be projected onto data in
a folder - as well as a file parsing mechanism - when the table is queried, by
specifying the following elements:
 Table Type (External or not)
 Table Name
 Column Names & Data Types
 Table Location (Optional)
 Partitioning Column (Optional)
 Clustering & Sorting Columns(Optional)
 Row Format
 File Format
| © Copyright 2015 Hitachi Consulting48
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
OrderDate DATE,
SalesValue DOUBLE
)
| © Copyright 2015 Hitachi Consulting49
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
OrderDate DATE,
SalesValue DOUBLE
)
Very cool indeed!
| © Copyright 2015 Hitachi Consulting50
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (OrderDate DATE)
A sub-folder - in the table folder - for
transactional records in each date
Partition column does not appear in the
create table body, but can be queried
Partitioned by - > a where clause column
| © Copyright 2015 Hitachi Consulting51
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
A sub-folder hierarchy – in the table
folder – to organize the transactional
records by year/month/day
| © Copyright 2015 Hitachi Consulting52
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS
Data in each partition folder is split into
32 files (buckets)
Records are clustered (distributed)
across the hdfs, and eventually to the
MapReduce Tasks, by City. That is,
records with the same City will end up in
the same file
Used to reduce data movement/
shuffling in MapReduce jobs
Cluster by -> join column(s)
Sort by -> group by column(s)
| © Copyright 2015 Hitachi Consulting53
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS
ROW FORMAT DELIMITED
How the row will be serialized/
desacralized (i.e., parsed) for
reading/writing)
| © Copyright 2015 Hitachi Consulting54
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,'
COLLECTION ITEMS TERMINATED BY ‘^'
MAP KEYS TERMINATED BY ‘|'
LINES TERMINATED BY 'n‘
More details about how the
records are parsed
| © Copyright 2015 Hitachi Consulting55
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,'
COLLECTION ITEMS TERMINATED BY ‘^'
MAP KEYS TERMINATED BY ‘|'
LINES TERMINATED BY 'n‘
STORED AS TEXTFILE
How the file are eventually stored in
hdfs. In some cases, the STORE AS
type implies the row format
| © Copyright 2015 Hitachi Consulting56
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,'
COLLECTION ITEMS TERMINATED BY ‘^'
MAP KEYS TERMINATED BY ‘|'
LINES TERMINATED BY 'n‘
STORED AS TEXTFILE
LOCATION ‘/data/adventureworks/facts’
Explicit location in which the
data for this table should be
stored in hdfs
| © Copyright 2015 Hitachi Consulting57
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde‘
STORED AS TEXTFILE
LOCATION ‘/data/adventureworks/facts’
A pre-built
CSV serializer/
desacralizer
| © Copyright 2015 Hitachi Consulting58
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde’
WITH SERDEPROPERTIES (
“ProductSKU"="$.product_category",
"ProductCategory"="$.product_category ",
“City"="$.city",
““Country="$.country",
“SalesValue"="$.sales_value“ )
STORED AS TEXTFILE
LOCATION ‘/data/adventureworks/facts’
A pre-built
JSON serializer/ deserializer
SERDES properties have to be supplied to
describe the json doc format
| © Copyright 2015 Hitachi Consulting59
HiveQL – Data Definition Language
Creating tables
DROP TABLE IF EXISTS InternetSales
CREATE TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
PARTITIONED BY (Year INT, Month INT, Day INT)
CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS
ROW FORMAT SERDE ‘hitachi.hive.parser.mainframeparser‘
STORED AS TEXTFILE
LOCATION ‘/data/adventureworks/facts’
Java parsers (SERDES) can be
implemented to read/write any custom file
formats
| © Copyright 2015 Hitachi Consulting60
HiveQL – Data Definition Language
File Formats
Format Description Size Query
Performance
Syntax
TEXTFILE  Delimited text file
 Any Platform
 Split-able
6 5 STORED AS TEXTFILE
SEQUENCEFILE  Binary key-value pairs
 Optimized for MapReduce
 Split-able
5 4 STORED AS SEQUENCEFILE
PARQUET  columnar storage format (compressed, binary)
 Non-split-able
2 6 STORED AS PARQUET
RCFile  Columnar storage format (compressed,
binary)
 Non-split-able
3 2 ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
ORC  Columnar storage format (highly compressed,
binary)
 Optimized for distinct selection and
aggregations
 Vectorization: process a batch up to 1,024
rows together. Each batch is a column vector.
 Non-split-able
1 1 STORED AS ORC
AVRO  serialization system with evolvable schema-
driven binary data
 Cross-platform inter-operability.
 Non-split-able
4 3 STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat‘
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat‘
TBLPROPERTIES ('avro.schema.url'=‘<schemaFileLocation>')
| © Copyright 2015 Hitachi Consulting61
HiveQL – Data Definition Language
Creating external tables
CREATE EXTERNAL TABLE InternetSales
(
ProductSKU INT,
ProductCategory STRING
City STRING,
Country STRING,
SalesValue DOUBLE
)
STORED AS SEQUENCEFILE
LOCATION ‘/data/adventureworks/facts’
Maintained Outside hive directory
Folder is not deleted when executing
DROP TABLE statement (only metadata is
removed from metastore)
Usually used to bring data from landing
area (data lake) to hive. In this case, row
format and stored as specs should match
the data files in the folder
Also can be used spit out data from hive
directories to a shared area with a different
folder structure (as shown in the
year/month/day example).
It is useful to leave the data files where
they are and create external tables in hive
if they are going to be manipulated by
other tools beside hive (such as
MapReduce/Pig/etc.
Location is mandatory for external tables
| © Copyright 2015 Hitachi Consulting62
HiveQL – Data Definition Language
Creating table like another table
DESCRIBE InternetSales
CREATE [TEMPORARY] TABLE [IF NOT EXIST] InternetSales_Copy
LIKE InterenetSales
Create a table with the
same metadata structure
of another table
| © Copyright 2015 Hitachi Consulting63
HiveQL – Data Definition Language
Bring it down…
DROP PARTITION <PartitionName>;
 deletes the partition folder and its metadata from the metastore
DROP TABLE <TableName>;
 deletes the table folder and its metadata from the metastore
DROP DATABASE <DatabaseName>;
 deletes the partition folder and its metadata from the metastore
| © Copyright 2015 Hitachi Consulting64
Hive DML
| © Copyright 2015 Hitachi Consulting65
HiveQL – Data Manipulation Language
Create Table As (CTAS)
 Widely-used in MPP systems
 Usually used as a replacement to MERGE statements and avoid updates
 A common scenario is to truncate/drop and re-build the information marts
(i.e., facts/ dimensions/ analytical datasets) each time from the
data warehouse (i.e., 3NF, Data Vault).
 This is where CTAS can be very useful.
| © Copyright 2015 Hitachi Consulting66
HiveQL – Data Manipulation Language
Create Table As (CTAS)
 Widely-used in MPP systems
 Usually used as a replacement to MERGE statements and avoid updates
 A common scenario is to truncate/drop and re-build the information marts
(i.e., facts/ dimensions/ analytical datasets) each time from the
data warehouse (i.e., 3NF, Data Vault).
 This is where CTAS can be very useful.
DROP TABLE IF EXIST FactTable;
CREATE TABLE FactTable AS --business logic
SELECT… FROM …JOIN… WHERE… GROUP BY… HAVING… UNION SELECT…
FROM StgTables
| © Copyright 2015 Hitachi Consulting67
HiveQL – Data Manipulation Language
Create Table As (CTAS)
 Widely-used in MPP systems
 Usually used as a replacement to MERGE statements and avoid updates
 A common scenario is to truncate/drop and re-build the information marts
(i.e., facts/ dimensions/ analytical datasets) each time from the
data warehouse (i.e., 3NF, Data Vault).
 This is where CTAS can be very useful.
DROP TABLE IF EXIST FactTable;
CREATE TABLE FactTable AS --business logic
SELECT… FROM …JOIN… WHERE… GROUP BY… HAVING… UNION SELECT…
FROM StgTables
| © Copyright 2015 Hitachi Consulting68
HiveQL – Data Manipulation Language
INSERT…
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';
| © Copyright 2015 Hitachi Consulting69
HiveQL – Data Manipulation Language
INSERT…
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';
FROM staged_employees se
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';
Optimized MapReduce translation
Often used to populate fact and
dimension tables data from on
source table
| © Copyright 2015 Hitachi Consulting70
HiveQL – Data Manipulation Language
INSERT…
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';
FROM staged_employees se
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';
INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT..., se.cnty, se.st
FROM staged_employees se;
Dynamic partition inserts
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
| © Copyright 2015 Hitachi Consulting71
Hive DQL
| © Copyright 2015 Hitachi Consulting72
HiveQL – Data Querying Language
Querying complex types
CREATE TABLE EmployeeSkills (
Identifier INT,
Skills ARRAY<STRING>
)
INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’));
INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’));
INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’));
| © Copyright 2015 Hitachi Consulting73
HiveQL – Data Querying Language
Querying complex types
CREATE TABLE EmployeeSkills (
Identifier INT,
Skills ARRAY<STRING>
)
INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’));
INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’));
INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’));
SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills Returns the first and last skill
concatenated by a pipe
| © Copyright 2015 Hitachi Consulting74
HiveQL – Data Querying Language
Querying complex types
CREATE TABLE EmployeeSkills (
Identifier INT,
Skills ARRAY<STRING>
)
INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’));
INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’));
INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’));
SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills
SELECT * FROM EmployeeSkills
WHERE array_contains(skills,’ssrs’) AND array_contains(skills,’sql’);
Returns the first and last skill
concatenated by a pipe
Returns employees having ‘ssrs’ and
‘sql’ in their skills
| © Copyright 2015 Hitachi Consulting75
HiveQL – Data Querying Language
Querying complex types
CREATE TABLE EmployeeSkills (
Identifier INT,
Skills ARRAY<STRING>
)
INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’));
INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’));
INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’));
SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills
SELECT * FROM EmployeeSkills
WHERE array_contains(skills,’ssrs’) AND array_contains(skills,’sql’);
SELECT skill, count(skill) FROM
(SELECT explode(skill) FROM EmployeeSkills) AS skillsList
GROUP BY skill
Returns the first and last skill
concatenated by a pipe
Returns employees having ‘ssrs’ and
‘sql’ in their skills
Explode => converts the skill array into
rows, each has on value -
Counts the occurrences of each skill
| © Copyright 2015 Hitachi Consulting76
HiveQL – Data Querying Language
Querying complex types
CREATE TABLE EmployeeSkills (
Identifier INT,
Skills ARRAY<STRING>
)
INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’));
INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’));
INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’));
SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills
SELECT * FROM EmployeeSkills
WHERE array_contains(skills,’ssrs’) AND array_contains(skills,’sql’);
SELECT skill, count(skill) FROM
(SELECT explode(skill) FROM EmployeeSkills) AS skillsList
GROUP BY skill
SELECT Identifier, SkillList FROM EmployeeSkills
LATERAL VIEW explode(skills) subView as SkillList;
Returns the first and last skill
concatenated by a pipe
Returns employees having ‘ssrs’ and
‘sql’ in their skills
Explode => converts the skill array into
rows, each has on value -
Counts the occurrences of each skill
Lists the skills (one column, several
rows) beside the identifier as a
| © Copyright 2015 Hitachi Consulting77
HiveQL – Data Querying Language
Querying complex types
CREATE TABLE EmployeeProjects (
Identifier INT,
ProjectGrades MAP<STRING,FLOAT>
)
INSERT INTO TABLE EmployeeProjects
SELECT * FROM
(
SELECT 1 AS Identifier, MAP('A', CAST(0.3 AS FLOAT),'B', CAST(0.4 AS FLOAT)) AS ProjectGrade
UNION ALL
SELECT 2 AS Identifier, MAP('A', CAST(0.4 AS FLOAT),'C', CAST(0.5 AS FLOAT),'B', CAST(0.3 AS FLOAT)) AS ProjectGrade
UNION ALL
SELECT 3 AS Identifier, MAP('B', CAST(0.5 AS FLOAT),'C', CAST(0.2 AS FLOAT)) AS ProjectGrade
) AS query;
SELECT * FROM EmployeeProjects
WHERE ProjectGrades[“A”] >0.3;
Retrieve all the employees
that where in Project ‘A’ and
had a grade greater than 0.4
| © Copyright 2015 Hitachi Consulting78
HiveQL – Data Querying Language
Explain
Hive provides an EXPLAIN command that shows the execution plan for a query
EXPLAIN
SELECT ProductName, SUM(LineItemAmount)
FROM InterentSales
WHERE Year = ‘2014’
GROUP BY ProductName
HAVING COUNT(OrderId) > 10000
ORDER BY SUM(LineItemAmount);
| © Copyright 2015 Hitachi Consulting79
ETL and Automation
| © Copyright 2015 Hitachi Consulting80
ETL and Automation
A common pattern
 Usually the data is extracted from the sources - using
Sqoop, Azure Data Factory, or any other mechanis - and
loaded “as is” to an area in the hdfs (landing area).
HDFS (Data Lake)
Sources Apps
Landing Directories
| © Copyright 2015 Hitachi Consulting81
ETL and Automation
A common pattern
 Usually the data is extracted from the sources - using
Sqoop, Azure Data Factory, or any other mechanis - and
loaded “as is” to an area in the hdfs (landing area).
 External Hive tables are created to source this data to
hive system.
HDFS (Data Lake)
Sources Apps
Landing Directories
Hive
External Tables
| © Copyright 2015 Hitachi Consulting82
ETL and Automation
A common pattern
 Usually the data is extracted from the sources - using
Sqoop, Azure Data Factory, or any other mechanis - and
loaded “as is” to an area in the hdfs (landing area).
 External Hive tables are created to source this data to
hive system.
 A series of hiveQL scripts are executed to transform
and load data in the external hive tables to a canonical
data model.
 The data in the landing area can be dropped (if is not
used by other tools)
HDFS (Data Lake)
Sources Apps
Landing Directories
Hive
External Tables
Data Warehouse
(Dim. model, Data Vault, etc.)
| © Copyright 2015 Hitachi Consulting83
ETL and Automation
A common pattern
 Usually the data is extracted from the sources - using
Sqoop, Azure Data Factory, or any other mechanis - and
loaded “as is” to an area in the hdfs (landing area).
 External Hive tables are created to source this data to
hive system.
 A series of hiveQL scripts are executed to transform
and load data in the external hive tables to a canonical
data model.
 The data in the landing area can be dropped (if is not
used by other tools)
 This process can be automated using Azure Data
Factory, SSIS, Oozie, PowerShell etc. HDFS (Data Lake)
Oozie(Sqoop,Pig-Latin,HiveQLScripts)
AzureDataFactory(DataCopyActivities/HiveJobs)
SSIS–PowerShell-Custom
Sources Apps
Landing Directories
Hive
External Tables
Data Warehouse
(Dim. model, Data Vault, etc.)
| © Copyright 2015 Hitachi Consulting84
ETL and Automation
Introducing Oozie
 Oozie Workflow Document - XML file defining workflow actions
<workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow">
<start to="FirstAction"/>
<action name="FirstAction">
<hive xmlns="uri:oozie:hive-action:0.2">
<script>CreateTable.hql</script>
<param>TABLE_NAME=${tableName}</param>
<param>LOCATION=${tableFolder}</param>
</hive>
<ok to="SecondAction"/>
<error to="fail"/>
</action>
<action name="SecondAction">
…
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Oozie Workflow
| © Copyright 2015 Hitachi Consulting85
ETL and Automation
Introducing Oozie
 Oozie Workflow Document - XML file defining workflow actions
 Script files - Files used by workflow actions (e.g., HiveQL and Pig-Latin script file).
<workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow">
<start to="FirstAction"/>
<action name="FirstAction">
<hive xmlns="uri:oozie:hive-action:0.2">
<script>CreateTable.hql</script>
<param>TABLE_NAME=${tableName}</param>
<param>LOCATION=${tableFolder}</param>
</hive>
<ok to="SecondAction"/>
<error to="fail"/>
</action>
<action name="SecondAction">
…
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
DROP TABLE IF EXISTS ${TABLE_NAME};
CREATE EXTERNAL TABLE ${TABLE_NAME}
(Col1 STRING,
Col2 FLOAT,
Col3 FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY ','
STORED AS TEXTFILE LOCATION
'${LOCATION}';
Oozie Workflow
HiveQL Script
| © Copyright 2015 Hitachi Consulting86
ETL and Automation
Introducing Oozie
 Oozie Workflow Document - XML file defining workflow actions
 Script files - Files used by workflow actions (e.g., HiveQL and Pig-Latin script file).
 The job.properties file - Configuration file setting parameter values
 Oozie Jobs are scheduled using a Hadoop Agent
<workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow">
<start to="FirstAction"/>
<action name="FirstAction">
<hive xmlns="uri:oozie:hive-action:0.2">
<script>CreateTable.hql</script>
<param>TABLE_NAME=${tableName}</param>
<param>LOCATION=${tableFolder}</param>
</hive>
<ok to="SecondAction"/>
<error to="fail"/>
</action>
<action name="SecondAction">
…
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
DROP TABLE IF EXISTS ${TABLE_NAME};
CREATE EXTERNAL TABLE ${TABLE_NAME}
(Col1 STRING,
Col2 FLOAT,
Col3 FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY ','
STORED AS TEXTFILE LOCATION
'${LOCATION}';
nameNode=wasb://my_container@my_storage_account.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=/example/workflow/
tableName=ExampleTable
tableFolder=/example/ExampleTable
Oozie Workflow
HiveQL Script
Job properties
| © Copyright 2015 Hitachi Consulting87
ETL and Automation
PowerShell
 The AzureHDInsightHiveJobDefinition cmdlet
– Create a job definition
– Use Query for explicit HiveQL statements, or File to reference a saved script
– Run the job with the Start-AzureHDInsightJob cmdlet
 The Invoke-Hive cmdlet
– Simpler syntax to run a HiveQL query and wait for the response
– Use Query for explicit HiveQL statements, or File to reference a saved script
| © Copyright 2015 Hitachi Consulting88
ETL and Automation
..NET APIs
| © Copyright 2015 Hitachi Consulting89
ETL and Automation
..NET APIs – Parallel Execution
Step Task Desc HiveQL
1 1 Build Dim 1 Script 1
1 2 Build Dim 2 Script 2
1 3 Build Dim 3 Script 3
2 4 Build Fact 1 Script 4
2 5 Build Fact 2 Script 5
3 6 Build Summary Script 6
DW build control table
 Tasks in the same steps can be executed in parallel
 Steps are executed sequentially
 HiveQL column contains the hive script that does the ETL
Can be hosted and scheduled
using an Azure WebJob!
| © Copyright 2015 Hitachi Consulting90
Other features
 Indexes
 Statistics (analyse)
 Archiving
 UDF and UDAF using java
 Locks
 Authorization
 Hive transactions
 Streaming
 Accumulo & HBase integration
 HCatalog and WebHCatalog
| © Copyright 2015 Hitachi Consulting91
Useful Hive Configurations
 SET hive.metastore.warehouse.dir = “<directoty>”;
 SET hive.cli.print.current.db = true | false;
 SET hive.cli.print.header=true | false;
 SET hive.execution.engine = mr | tez | spark;
 SET hive.exec.dynamic.partition = true | false;
 SET hive.exec.dynamic.partition.mode = strict | nonstrict;
 SET hive.exec.max.dynamic.partitions = <value>;
 SET hive.exec.max.created.files = <value>;
 SET hive.map.aggr=true | false;
 SET hive.auto.convert.join = true | false;
 SET hive.optimize.bucketmapjoin=true | false ;
 SET hive.exec.compress.intermediate=true | false;
 SET mapred.map.output.compression.codec=<compression codec>;
 SET hive.exec.compress.output=true | false;
 SET mapred.output.compression.codec = <compression codec>;
 SET hive.archive.enabled=true | false;
 SET hive.optimize.ppd.storage=true | false;
| © Copyright 2015 Hitachi Consulting92
How to Get Started with Hive?
 Read the slides!
 Coursera – Big Data Specialization
https://www.coursera.org/specializations/big-data
 Azure Documentation – HDInsight Emulator
https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-hadoop-emulator-get-started
 MVA – Big Data Analytics with HDInsight: Hadoop on Azure
https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight-hadoop-on-azure-10551
 MVA – Implementing Big Data Analysis
https://mva.microsoft.com/en-US/training-courses/implementing-big-data-analysis-8311?l=44REr2Yy_5404984382
 Azure Documentation – Getting Started with HDInsight
https://azure.microsoft.com/en-gb/documentation/services/hdinsight/
 Azure Documentation – How to Connect Excel to Windows Azure HDInsight via HiveODBC:
https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-connect-excel-hive-odbc-driver/
 Apache Scoop
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html
 Apache Hive
https://cwiki.apache.org/confluence/display/Hive/Home
O’Reliy Books– Programming Hive 2nd Edition
| © Copyright 2015 Hitachi Consulting93
Coming soon…
 NoSQL on Microsoft Azure
 Introduction to Spark on HDInsight
Stay tuned
| © Copyright 2015 Hitachi Consulting94
My Background
Applying Computational Intelligence in Data Mining
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 25+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning,
– evolving neural networks, and
– data clustering
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org
| © Copyright 2015 Hitachi Consulting95
Thank you!

More Related Content

What's hot

Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
Red Hat Openshift on Microsoft Azure
Red Hat Openshift on Microsoft AzureRed Hat Openshift on Microsoft Azure
Red Hat Openshift on Microsoft AzureJohn Archer
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...DataWorks Summit/Hadoop Summit
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData StoryLynn Langit
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopLynn Langit
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
Oracle PL/SQL 12c and 18c New Features + RADstack + Community Sites
Oracle PL/SQL 12c and 18c New Features + RADstack + Community SitesOracle PL/SQL 12c and 18c New Features + RADstack + Community Sites
Oracle PL/SQL 12c and 18c New Features + RADstack + Community SitesSteven Feuerstein
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resumeTarun P
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionTorsten Steinbach
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
 

What's hot (20)

Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Red Hat Openshift on Microsoft Azure
Red Hat Openshift on Microsoft AzureRed Hat Openshift on Microsoft Azure
Red Hat Openshift on Microsoft Azure
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 
Data-In-Motion Unleashed
Data-In-Motion UnleashedData-In-Motion Unleashed
Data-In-Motion Unleashed
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Oracle PL/SQL 12c and 18c New Features + RADstack + Community Sites
Oracle PL/SQL 12c and 18c New Features + RADstack + Community SitesOracle PL/SQL 12c and 18c New Features + RADstack + Community Sites
Oracle PL/SQL 12c and 18c New Features + RADstack + Community Sites
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 

Similar to Hive Fundamentals and Architecture for Big Data Warehousing

Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Hadoop Training in Hyderabad | Online Training
Hadoop Training in Hyderabad | Online TrainingHadoop Training in Hyderabad | Online Training
Hadoop Training in Hyderabad | Online TrainingN Benchmark IT Solutions
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxAnonymous9etQKwW
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive Rupak Roy
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 

Similar to Hive Fundamentals and Architecture for Big Data Warehousing (20)

Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop Training in Hyderabad | Online Training
Hadoop Training in Hyderabad | Online TrainingHadoop Training in Hyderabad | Online Training
Hadoop Training in Hyderabad | Online Training
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hive
HiveHive
Hive
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
hive.pptx
hive.pptxhive.pptx
hive.pptx
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Hive_Pig.pptx
Hive_Pig.pptxHive_Pig.pptx
Hive_Pig.pptx
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
מיכאל
מיכאלמיכאל
מיכאל
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 

More from Khalid Salama

Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryKhalid Salama
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureKhalid Salama
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!Khalid Salama
 

More from Khalid Salama (6)

Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous Delivery
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 

Recently uploaded

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Recently uploaded (20)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 

Hive Fundamentals and Architecture for Big Data Warehousing

  • 1. | © Copyright 2015 Hitachi Consulting1 Hive with HDInsight Big Data Warehousing Khalid M. Salama, Ph.D. Business Insights & Analytics Hitachi Consulting UK We Make it Happen. Better.
  • 2. | © Copyright 2015 Hitachi Consulting2 This is not a tutorial! The purpose of these slides is to give you a overview of Hive features and capabilities, and to get you excited to learn more about it
  • 3. | © Copyright 2015 Hitachi Consulting3  What is Hive?  Hive Architecture & Components  Hive vs. RDBMS  Getting Started  Loading Data into Hive (Introducing Sqoop)  Data Definition Language – Creating Hive Tables  Data Manipulation Language – Processing Data in Hive  Data Querying Language – Querying Data in Hive  ETL and Automation (Introducing Oozie) Outline
  • 4. | © Copyright 2015 Hitachi Consulting4 Hive fundamentals
  • 5. | © Copyright 2015 Hitachi Consulting5 A metadata services that project tabular schemas over the data files in HDFS folders, and enable the contents of folders to be queried/processed as tables, using an SQL-like querying language (HiveQL). What is Hive? Apache HIVE is a data warehouse system for Hadoop
  • 6. | © Copyright 2015 Hitachi Consulting6 A metadata services that project tabular schemas over the data files in HDFS folders, and enable the contents of folders to be queried/processed as tables, using an SQL-like querying language (HiveQL).  Initially developed by Facebook in 2007 to enable developers to process data in Hadoop using their SQL scripting skills. What is Hive? Apache HIVE is a data warehouse system for Hadoop
  • 7. | © Copyright 2015 Hitachi Consulting7 A metadata services that project tabular schemas over the data files in HDFS folders, and enable the contents of folders to be queried/processed as tables, using an SQL-like querying language (HiveQL).  Initially developed by Facebook in 2007 to enable developers to process data in Hadoop using their SQL scripting skills.  HiveQL looks a lot like ANSI SQL; if you know Transact-SQL you’ll feel comfortable learning Hive. What is Hive? Apache HIVE is a data warehouse system for Hadoop
  • 8. | © Copyright 2015 Hitachi Consulting8 A metadata services that project tabular schemas over the data files in HDFS folders, and enable the contents of folders to be queried/processed as tables, using an SQL-like querying language (HiveQL).  Initially developed by Facebook in 2007 to enable developers to process data in Hadoop using their SQL scripting skills.  HiveQL looks a lot like ANSI SQL; if you know Transact-SQL you’ll feel comfortable learning Hive.  Queries are translated into MapReduce, Tez, or Spark jobs. What is Hive? Apache HIVE is a data warehouse system for Hadoop
  • 9. | © Copyright 2015 Hitachi Consulting9 A metadata services that project tabular schemas over the data files in HDFS folders, and enable the contents of folders to be queried/processed as tables, using an SQL-like querying language (HiveQL).  Initially developed by Facebook in 2007 to enable developers to process data in Hadoop using their SQL scripting skills.  HiveQL looks a lot like ANSI SQL; if you know Transact-SQL you’ll feel comfortable learning Hive.  Queries are translated into MapReduce, Tez, or Spark jobs.  Results look like a standard relational database row set, various vendors have created ODBC drivers that interact with Hive results. What is Hive? Apache HIVE is a data warehouse system for Hadoop
  • 10. | © Copyright 2015 Hitachi Consulting10 Hadoop is great! MapReduce is very low level (lack of expressiveness). Higher-level data processing languages are needed. Instead of writing MapReduce code in Java, Hive does this dirty work for you, so you can focus on the processing logic itself as SQL Script. Hive is best suited for data warehouse applications, rather than OLTP, where a large datasets are processed in batch mode Why Hive? Apache HIVE is a data warehouse system for Hadoop
  • 11. | © Copyright 2015 Hitachi Consulting11 Why Hive? MapReduce vs Hive – Word Count Given a set of documents, count the occurrences of each word in all the documents, and order by word occurrence frequency
  • 12. | © Copyright 2015 Hitachi Consulting12 import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }}} public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get();} context.write(key, new IntWritable(sum)); }} Why Hive? MapReduce vs Hive – Word Count public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } MapReduce Java
  • 13. | © Copyright 2015 Hitachi Consulting13 import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }}} public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get();} context.write(key, new IntWritable(sum)); }} Why Hive? MapReduce vs Hive – Word Count public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word; MapReduce Java Hive
  • 14. | © Copyright 2015 Hitachi Consulting14 Hadoop Distributed File System (HDFS) Applications In-Memory Stream SQL NoSQL Machine Learning …. Batch Yet Another Resource Negotiator (YARN) Search Orchest. Mgmnt Acquisition Named Node DataNode 1 DataNode 2 DataNode 3 DataNode N Hive and the Hadoop Ecosystem Hive and the zoo…
  • 15. | © Copyright 2015 Hitachi Consulting15 Hadoop Distributed File System (HDFS) Hive …. Yet Another Resource Negotiator (YARN)Named Node DataNode 1 DataNode 2 DataNode 3 DataNode N Hive Architecture and Components Hive and the zoo… JDBCODBC Hive Web Interface (HWI) Metastore Thrift server Command Line Interface (CLI) Compiler, Optimizer, Executor
  • 16. | © Copyright 2015 Hitachi Consulting16 Hive Architecture and Components •Traditional RDBM – MySQL/Postgres in Hadoop – Azure SQLDB in HDInsight •Stores the system catalog: metadata about databases, tables, columns, partitions, etc.Metastore •Performs column pruning , partitions elimination, and index utilizationOptimizer •Converts HiveQL into a directed acyclic graph of MapReduce, Tez, or Spark Task, based on the execution engineCompiler •Submits the Tasks produced by the compiler to YARN in a proper dependency order •Interacts with the underlying Hadoop instanceExecutor •Provides a Thrift interface and a JDBC/ODBC server •Enables Hive integration with other applicationsHiveServer •Command Line Interface(CLI) •Web UI •JDBC/ODBC driver Client Components Tell me more…
  • 17. | © Copyright 2015 Hitachi Consulting17 Hive Architecture and Components Recent Hive releases Hive Version Key Features 0.14  Cost-based optimizer for star and bushy join queries  Temporary tables  Transactions with ACID semantics  Spark as execution engine 0.13  Hive on Tez, vectorize query engine & cost-based optimizer  Dynamic partition loads and smaller hash tables  CHAR & DECIMAL datatypes, subqueries for IN/NOT IN 0.12  Vectorized query engine & ORCFile  Support for VARCHAR and DATE semantics,  Windowing, analytics, and enhanced aggregation functions Hive 0.10 BATCH Hive 0.13 INTERACTIVE Hive 0.14 SUB-SECOND Read-only Data HiveQL MR Read-only Data HiveQL ++ MR, Tez Modify w/Transactions MR, Tez, Spark HiveQL +++ Enterprise SQL at Hadoop Scale Stinger
  • 18. | © Copyright 2015 Hitachi Consulting18 Hive Architecture and Components Hive execution – MapReduce vs Tez Tez is a framework for building high performance batch and interactive data processing applications, coordinated by YARN in Hadoop. Improves over MapReduce speed and maintains its scalability
  • 19. | © Copyright 2015 Hitachi Consulting19 Hive Architecture and Components Hive execution – MapReduce vs Tez Reducer Reducer Reducer Reducer Reducer Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Reducer Reducer ReducerReducer Reducer HDFS HDFS HDFS Tez is a framework for building high performance batch and interactive data processing applications, coordinated by YARN in Hadoop. Improves over MapReduce speed and maintains its scalability MapReduce Tez
  • 20. | © Copyright 2015 Hitachi Consulting20 Hive vs RDBMS The face-off… Feature RDBMS Hive Structure Schema On Write Schema On Read Underlying Data Vendor Specific Delimited Text, XML, JSON, Avro, RC/ORC, Sequence Files, +++ Processing Transactional Batch Consistency Strong Consistency (ACID) Eventual Consistency + limited Transactions Options (using ZooKeeper) Locking Yes Table and Partition Access/Results SQL/Relational Datasets SQL/Relational Datasets Indexes Yes Yes Updates Yes Yes (new in 0.14) Referential Integrity Yes No Stored Procedures Yes No User Defined Functions Yes Java (UDF and UDAF)
  • 21. | © Copyright 2015 Hitachi Consulting21 Getting Started with Hive on Azure
  • 22. | © Copyright 2015 Hitachi Consulting22 Getting Started Creating HDInsight Cluster
  • 23. | © Copyright 2015 Hitachi Consulting23 Getting Started Creating HDInsight Cluster
  • 24. | © Copyright 2015 Hitachi Consulting24 Getting Started Browsing Hive in HDInsight
  • 25. | © Copyright 2015 Hitachi Consulting25 Getting Started Browsing Hive in HDInsight
  • 26. | © Copyright 2015 Hitachi Consulting26 Getting Started Browsing Hive in HDInsight
  • 27. | © Copyright 2015 Hitachi Consulting27 Getting Started Creating Hive Project in Visual Studio  Install Azure SDK for Visual Studio https://azure.microsoft.com/en-gb/downloads/
  • 28. | © Copyright 2015 Hitachi Consulting28 Getting Started Creating Hive Project in Visual Studio  Install Azure SDK for Visual Studio https://azure.microsoft.com/en-gb/downloads/  Create a hive project
  • 29. | © Copyright 2015 Hitachi Consulting29 Getting Started Creating Hive Project in Visual Studio  Install Azure SDK for Visual Studio https://azure.microsoft.com/en-gb/downloads/  Create a hive project  Submit hive script
  • 30. | © Copyright 2015 Hitachi Consulting30 Getting Started Creating Hive Project in Visual Studio  Install Azure SDK for Visual Studio https://azure.microsoft.com/en-gb/downloads/  Create a hive project  Submit hive script
  • 31. | © Copyright 2015 Hitachi Consulting31 Loading Data into Hive
  • 32. | © Copyright 2015 Hitachi Consulting32 Loading Data into Hive Hive folder structure Hive default root directory Database directory Table directory Partition (sub)directory(ies) Data files Loading data into hive is a matter of moving data files into this folder structure in HDFS!
  • 33. | © Copyright 2015 Hitachi Consulting33 Loading Data into Hive Hive folder structure  Move data to hive folder structure in Azure Blob Storage (e.g. via AzCopy, PowerShell, WABS APIs or Azure Data Factory)  HDFS commands to move data from local file system of hdfs hadoop fs –copyFromLocal <srclocalFilePath> <dstHdfsHiveDirectory>  Hive LOAD command to load data from hdfs to Hive folder structure LOAD DATA INPATH ‘srcHdfsDirectory' [OVERWRITE] INTO TABLE <hiveTable>;  Hive LOAD command to load data from local file system to Hive folder structure LOAD DATA LOCAL INPATH <srclocalFilePath> INTO TABLE <hiveTable>;  Hive ODBC (e.g. via SSIS)  Sqoop
  • 34. | © Copyright 2015 Hitachi Consulting34 Sqoop is designed to efficiently transfer bulk data between Apache Hadoop and structured data stores such as relational databases. Loading Data into Hive Introducing Sqoop: Sql to Hadoop
  • 35. | © Copyright 2015 Hitachi Consulting35 Sqoop is designed to efficiently transfer bulk data between Apache Hadoop and structured data stores such as relational databases. Loading Data into Hive Introducing Sqoop: Sql to Hadoop Sqoop Command HDFS/HBase/ Hive Relational Database Hadoop HDFS/ HBase/ Hive Sqoop Map task Map task Map task
  • 36. | © Copyright 2015 Hitachi Consulting36  sqoop-import  sqoop-import-all-tables  sqoop-export  sqoop-create-hive-table  sqoop-list-databases  sqoop-list-tables  sqoop-help Loading Data into Hive Introducing Sqoop: Commands
  • 37. | © Copyright 2015 Hitachi Consulting37 sqoop import --connect “<idbc Connection String>” --table <Source Table> --query “<Source Sql Query>” --where “<Condition>” --split-by <Source Table Column> --target-dir “<hdfs Target Directory> --compression-codec <java Class for Compression> --as-textfile sqoop import --connect “jdbc:sqlserver://HCBI;user=sa;password=password;database=AdventureWorksDW2012” --query "SELECT * FROM vwInternetSalesFact WHERE YEAR(UpdateDate) = (GETDAYE()) AND MONTH(UpdateDate) = MONTH(GETDAYE()) AND DAY(UpdateDate) = DAY(GETDAYE())” --split-by storeId --target-dir hive/warehouse/adventurework/factInternetSales/20160401 --compression-codec org.apache.hadoop.io.compress.GzipCodec Loading Data into Hive Introducing Sqoop: Import
  • 38. | © Copyright 2015 Hitachi Consulting38 Data File Compression in Hadoop Compression Extension Codec Splittable Efficacy (Compression Ratio) Speed (to compress) DEFLATE .deflate org.apache.hadoop.io.compress.DefaultCodec No Medium Medium gzip .gz org.apache.hadoop.io.compress.GzipCodec No Medium Medium bzip2 .bz2 org.apache.hadoop.io.compress.BZip2Codec Yes High Low LZO .lzo com.hadoop.compression.lzo.LzopCodec Yes* Low High LZ4 .lz4 org.apache.hadoop.io.compress.Lz4Codec No Low High Snappy .snappy org.apache.hadoop.io.compress.SnappyCodec No Low High  Reduces space needed to store files  Speeds up data transfer across the network or to or from disk
  • 39. | © Copyright 2015 Hitachi Consulting39 Data File Compression in Hadoop Compression Extension Codec Splittable Efficacy (Compression Ratio) Speed (to compress) DEFLATE .deflate org.apache.hadoop.io.compress.DefaultCodec No Medium Medium gzip .gz org.apache.hadoop.io.compress.GzipCodec No Medium Medium bzip2 .bz2 org.apache.hadoop.io.compress.BZip2Codec Yes High Low LZO .lzo com.hadoop.compression.lzo.LzopCodec Yes* Low High LZ4 .lz4 org.apache.hadoop.io.compress.Lz4Codec No Low High Snappy .snappy org.apache.hadoop.io.compress.SnappyCodec No Low High  Reduces space needed to store files  Speeds up data transfer across the network or to or from disk  Split the file into chunks from the source (chunk size ≈ HDFS block), and compress each chunk separately using medium speed/efficient compression  For large files, use splittable compressions  For non-splittable compressed file formats (Avro, ORC, RCFile, etc.), use high speed compression
  • 40. | © Copyright 2015 Hitachi Consulting40 1. Install Hive ODBC driver http://www.microsoft.com/en-gb/download/details.aspx?id=40886 Loading Data into Hive SSIS with Hive – just a peek
  • 41. | © Copyright 2015 Hitachi Consulting41 1. Install Hive ODBC driver http://www.microsoft.com/en-gb/download/details.aspx?id=40886 2. Create new data source (DNS) using the installed Hive ODBC driver Loading Data into Hive SSIS with Hive – just a peek
  • 42. | © Copyright 2015 Hitachi Consulting42 1. Install Hive ODBC driver http://www.microsoft.com/en-gb/download/details.aspx?id=40886 2. Create new data source (DNS) using the installed Hive ODBC driver 3. Configure the (DNS) to connect to your Hadoop cluster Loading Data into Hive SSIS with Hive – just a peek
  • 43. | © Copyright 2015 Hitachi Consulting43 1. Install Hive ODBC driver http://www.microsoft.com/en-gb/download/details.aspx?id=40886 2. Create new data source (DNS) using the installed Hive ODBC driver 3. Configure the (DNS) to connect to your Hadoop cluster 4. Create your Data Flow task in SSIS Loading Data into Hive SSIS with Hive – just a peek
  • 44. | © Copyright 2015 Hitachi Consulting44 Hive DDL
  • 45. | © Copyright 2015 Hitachi Consulting45 HiveQL – Data Definition Language Show me what you got… SHOW DATABASES; SHOW TABLES; DESCRIBE <TableName>;  list column names and data types
  • 46. | © Copyright 2015 Hitachi Consulting46 HiveQL – Data Definition Language Data Types  Numeric Types: { TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL }  String Types: { STRING, VARCHAR(LENGTH), CHAR }  Date/Time: {TIMESTAMP, DATE}  Other: { BOOLEAN, BINARY }  Complex Types:  ARRAY: List of values (primitive or complex) – e.g.: ARRAY<STRING>  MAPS: Dictionary (key/Value list) – e.g.: MAP<INT,STRING>  STRUCTS: Object – e.g.: STRUCT<name:STRING,age:INT>
  • 47. | © Copyright 2015 Hitachi Consulting47 HiveQL – Data Definition Language Creating tables CREATE TABLE statement defines schema metadata to be projected onto data in a folder - as well as a file parsing mechanism - when the table is queried, by specifying the following elements:  Table Type (External or not)  Table Name  Column Names & Data Types  Table Location (Optional)  Partitioning Column (Optional)  Clustering & Sorting Columns(Optional)  Row Format  File Format
  • 48. | © Copyright 2015 Hitachi Consulting48 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, OrderDate DATE, SalesValue DOUBLE )
  • 49. | © Copyright 2015 Hitachi Consulting49 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, OrderDate DATE, SalesValue DOUBLE ) Very cool indeed!
  • 50. | © Copyright 2015 Hitachi Consulting50 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (OrderDate DATE) A sub-folder - in the table folder - for transactional records in each date Partition column does not appear in the create table body, but can be queried Partitioned by - > a where clause column
  • 51. | © Copyright 2015 Hitachi Consulting51 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (Year INT, Month INT, Day INT) A sub-folder hierarchy – in the table folder – to organize the transactional records by year/month/day
  • 52. | © Copyright 2015 Hitachi Consulting52 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (Year INT, Month INT, Day INT) CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS Data in each partition folder is split into 32 files (buckets) Records are clustered (distributed) across the hdfs, and eventually to the MapReduce Tasks, by City. That is, records with the same City will end up in the same file Used to reduce data movement/ shuffling in MapReduce jobs Cluster by -> join column(s) Sort by -> group by column(s)
  • 53. | © Copyright 2015 Hitachi Consulting53 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (Year INT, Month INT, Day INT) CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS ROW FORMAT DELIMITED How the row will be serialized/ desacralized (i.e., parsed) for reading/writing)
  • 54. | © Copyright 2015 Hitachi Consulting54 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (Year INT, Month INT, Day INT) CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,' COLLECTION ITEMS TERMINATED BY ‘^' MAP KEYS TERMINATED BY ‘|' LINES TERMINATED BY 'n‘ More details about how the records are parsed
  • 55. | © Copyright 2015 Hitachi Consulting55 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (Year INT, Month INT, Day INT) CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,' COLLECTION ITEMS TERMINATED BY ‘^' MAP KEYS TERMINATED BY ‘|' LINES TERMINATED BY 'n‘ STORED AS TEXTFILE How the file are eventually stored in hdfs. In some cases, the STORE AS type implies the row format
  • 56. | © Copyright 2015 Hitachi Consulting56 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (Year INT, Month INT, Day INT) CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,' COLLECTION ITEMS TERMINATED BY ‘^' MAP KEYS TERMINATED BY ‘|' LINES TERMINATED BY 'n‘ STORED AS TEXTFILE LOCATION ‘/data/adventureworks/facts’ Explicit location in which the data for this table should be stored in hdfs
  • 57. | © Copyright 2015 Hitachi Consulting57 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (Year INT, Month INT, Day INT) CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde‘ STORED AS TEXTFILE LOCATION ‘/data/adventureworks/facts’ A pre-built CSV serializer/ desacralizer
  • 58. | © Copyright 2015 Hitachi Consulting58 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (Year INT, Month INT, Day INT) CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde’ WITH SERDEPROPERTIES ( “ProductSKU"="$.product_category", "ProductCategory"="$.product_category ", “City"="$.city", ““Country="$.country", “SalesValue"="$.sales_value“ ) STORED AS TEXTFILE LOCATION ‘/data/adventureworks/facts’ A pre-built JSON serializer/ deserializer SERDES properties have to be supplied to describe the json doc format
  • 59. | © Copyright 2015 Hitachi Consulting59 HiveQL – Data Definition Language Creating tables DROP TABLE IF EXISTS InternetSales CREATE TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) PARTITIONED BY (Year INT, Month INT, Day INT) CLUSTERED BY (City) SORTED BY (Product) INTO 32 BUCKETS ROW FORMAT SERDE ‘hitachi.hive.parser.mainframeparser‘ STORED AS TEXTFILE LOCATION ‘/data/adventureworks/facts’ Java parsers (SERDES) can be implemented to read/write any custom file formats
  • 60. | © Copyright 2015 Hitachi Consulting60 HiveQL – Data Definition Language File Formats Format Description Size Query Performance Syntax TEXTFILE  Delimited text file  Any Platform  Split-able 6 5 STORED AS TEXTFILE SEQUENCEFILE  Binary key-value pairs  Optimized for MapReduce  Split-able 5 4 STORED AS SEQUENCEFILE PARQUET  columnar storage format (compressed, binary)  Non-split-able 2 6 STORED AS PARQUET RCFile  Columnar storage format (compressed, binary)  Non-split-able 3 2 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' ORC  Columnar storage format (highly compressed, binary)  Optimized for distinct selection and aggregations  Vectorization: process a batch up to 1,024 rows together. Each batch is a column vector.  Non-split-able 1 1 STORED AS ORC AVRO  serialization system with evolvable schema- driven binary data  Cross-platform inter-operability.  Non-split-able 4 3 STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat‘ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat‘ TBLPROPERTIES ('avro.schema.url'=‘<schemaFileLocation>')
  • 61. | © Copyright 2015 Hitachi Consulting61 HiveQL – Data Definition Language Creating external tables CREATE EXTERNAL TABLE InternetSales ( ProductSKU INT, ProductCategory STRING City STRING, Country STRING, SalesValue DOUBLE ) STORED AS SEQUENCEFILE LOCATION ‘/data/adventureworks/facts’ Maintained Outside hive directory Folder is not deleted when executing DROP TABLE statement (only metadata is removed from metastore) Usually used to bring data from landing area (data lake) to hive. In this case, row format and stored as specs should match the data files in the folder Also can be used spit out data from hive directories to a shared area with a different folder structure (as shown in the year/month/day example). It is useful to leave the data files where they are and create external tables in hive if they are going to be manipulated by other tools beside hive (such as MapReduce/Pig/etc. Location is mandatory for external tables
  • 62. | © Copyright 2015 Hitachi Consulting62 HiveQL – Data Definition Language Creating table like another table DESCRIBE InternetSales CREATE [TEMPORARY] TABLE [IF NOT EXIST] InternetSales_Copy LIKE InterenetSales Create a table with the same metadata structure of another table
  • 63. | © Copyright 2015 Hitachi Consulting63 HiveQL – Data Definition Language Bring it down… DROP PARTITION <PartitionName>;  deletes the partition folder and its metadata from the metastore DROP TABLE <TableName>;  deletes the table folder and its metadata from the metastore DROP DATABASE <DatabaseName>;  deletes the partition folder and its metadata from the metastore
  • 64. | © Copyright 2015 Hitachi Consulting64 Hive DML
  • 65. | © Copyright 2015 Hitachi Consulting65 HiveQL – Data Manipulation Language Create Table As (CTAS)  Widely-used in MPP systems  Usually used as a replacement to MERGE statements and avoid updates  A common scenario is to truncate/drop and re-build the information marts (i.e., facts/ dimensions/ analytical datasets) each time from the data warehouse (i.e., 3NF, Data Vault).  This is where CTAS can be very useful.
  • 66. | © Copyright 2015 Hitachi Consulting66 HiveQL – Data Manipulation Language Create Table As (CTAS)  Widely-used in MPP systems  Usually used as a replacement to MERGE statements and avoid updates  A common scenario is to truncate/drop and re-build the information marts (i.e., facts/ dimensions/ analytical datasets) each time from the data warehouse (i.e., 3NF, Data Vault).  This is where CTAS can be very useful. DROP TABLE IF EXIST FactTable; CREATE TABLE FactTable AS --business logic SELECT… FROM …JOIN… WHERE… GROUP BY… HAVING… UNION SELECT… FROM StgTables
  • 67. | © Copyright 2015 Hitachi Consulting67 HiveQL – Data Manipulation Language Create Table As (CTAS)  Widely-used in MPP systems  Usually used as a replacement to MERGE statements and avoid updates  A common scenario is to truncate/drop and re-build the information marts (i.e., facts/ dimensions/ analytical datasets) each time from the data warehouse (i.e., 3NF, Data Vault).  This is where CTAS can be very useful. DROP TABLE IF EXIST FactTable; CREATE TABLE FactTable AS --business logic SELECT… FROM …JOIN… WHERE… GROUP BY… HAVING… UNION SELECT… FROM StgTables
  • 68. | © Copyright 2015 Hitachi Consulting68 HiveQL – Data Manipulation Language INSERT… INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR') SELECT * FROM staged_employees se WHERE se.cnty = 'US' AND se.st = 'OR';
  • 69. | © Copyright 2015 Hitachi Consulting69 HiveQL – Data Manipulation Language INSERT… INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR') SELECT * FROM staged_employees se WHERE se.cnty = 'US' AND se.st = 'OR'; FROM staged_employees se INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR') SELECT * WHERE se.cnty = 'US' AND se.st = 'OR' INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'CA') SELECT * WHERE se.cnty = 'US' AND se.st = 'CA' INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'IL') SELECT * WHERE se.cnty = 'US' AND se.st = 'IL'; Optimized MapReduce translation Often used to populate fact and dimension tables data from on source table
  • 70. | © Copyright 2015 Hitachi Consulting70 HiveQL – Data Manipulation Language INSERT… INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR') SELECT * FROM staged_employees se WHERE se.cnty = 'US' AND se.st = 'OR'; FROM staged_employees se INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR') SELECT * WHERE se.cnty = 'US' AND se.st = 'OR' INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'CA') SELECT * WHERE se.cnty = 'US' AND se.st = 'CA' INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'IL') SELECT * WHERE se.cnty = 'US' AND se.st = 'IL'; INSERT OVERWRITE TABLE employees PARTITION (country, state) SELECT..., se.cnty, se.st FROM staged_employees se; Dynamic partition inserts set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;
  • 71. | © Copyright 2015 Hitachi Consulting71 Hive DQL
  • 72. | © Copyright 2015 Hitachi Consulting72 HiveQL – Data Querying Language Querying complex types CREATE TABLE EmployeeSkills ( Identifier INT, Skills ARRAY<STRING> ) INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’)); INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’)); INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’));
  • 73. | © Copyright 2015 Hitachi Consulting73 HiveQL – Data Querying Language Querying complex types CREATE TABLE EmployeeSkills ( Identifier INT, Skills ARRAY<STRING> ) INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’)); INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’)); INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’)); SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills Returns the first and last skill concatenated by a pipe
  • 74. | © Copyright 2015 Hitachi Consulting74 HiveQL – Data Querying Language Querying complex types CREATE TABLE EmployeeSkills ( Identifier INT, Skills ARRAY<STRING> ) INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’)); INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’)); INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’)); SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills SELECT * FROM EmployeeSkills WHERE array_contains(skills,’ssrs’) AND array_contains(skills,’sql’); Returns the first and last skill concatenated by a pipe Returns employees having ‘ssrs’ and ‘sql’ in their skills
  • 75. | © Copyright 2015 Hitachi Consulting75 HiveQL – Data Querying Language Querying complex types CREATE TABLE EmployeeSkills ( Identifier INT, Skills ARRAY<STRING> ) INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’)); INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’)); INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’)); SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills SELECT * FROM EmployeeSkills WHERE array_contains(skills,’ssrs’) AND array_contains(skills,’sql’); SELECT skill, count(skill) FROM (SELECT explode(skill) FROM EmployeeSkills) AS skillsList GROUP BY skill Returns the first and last skill concatenated by a pipe Returns employees having ‘ssrs’ and ‘sql’ in their skills Explode => converts the skill array into rows, each has on value - Counts the occurrences of each skill
  • 76. | © Copyright 2015 Hitachi Consulting76 HiveQL – Data Querying Language Querying complex types CREATE TABLE EmployeeSkills ( Identifier INT, Skills ARRAY<STRING> ) INSERT INTO EmployeeSkills VALUES (1, array(‘sql’,’ssrs’)); INSERT INTO EmployeeSkills VALUES (2, array(‘sql’,’ssrs’,’SSIS’)); INSERT INTO EmployeeSkills VALUES (3, array(‘ssas’,’ ssis’)); SELECT Identifier, concat_ws(‘|’, skills[0], skills[size(skills)-1] ) FROM EmployeeSkills SELECT * FROM EmployeeSkills WHERE array_contains(skills,’ssrs’) AND array_contains(skills,’sql’); SELECT skill, count(skill) FROM (SELECT explode(skill) FROM EmployeeSkills) AS skillsList GROUP BY skill SELECT Identifier, SkillList FROM EmployeeSkills LATERAL VIEW explode(skills) subView as SkillList; Returns the first and last skill concatenated by a pipe Returns employees having ‘ssrs’ and ‘sql’ in their skills Explode => converts the skill array into rows, each has on value - Counts the occurrences of each skill Lists the skills (one column, several rows) beside the identifier as a
  • 77. | © Copyright 2015 Hitachi Consulting77 HiveQL – Data Querying Language Querying complex types CREATE TABLE EmployeeProjects ( Identifier INT, ProjectGrades MAP<STRING,FLOAT> ) INSERT INTO TABLE EmployeeProjects SELECT * FROM ( SELECT 1 AS Identifier, MAP('A', CAST(0.3 AS FLOAT),'B', CAST(0.4 AS FLOAT)) AS ProjectGrade UNION ALL SELECT 2 AS Identifier, MAP('A', CAST(0.4 AS FLOAT),'C', CAST(0.5 AS FLOAT),'B', CAST(0.3 AS FLOAT)) AS ProjectGrade UNION ALL SELECT 3 AS Identifier, MAP('B', CAST(0.5 AS FLOAT),'C', CAST(0.2 AS FLOAT)) AS ProjectGrade ) AS query; SELECT * FROM EmployeeProjects WHERE ProjectGrades[“A”] >0.3; Retrieve all the employees that where in Project ‘A’ and had a grade greater than 0.4
  • 78. | © Copyright 2015 Hitachi Consulting78 HiveQL – Data Querying Language Explain Hive provides an EXPLAIN command that shows the execution plan for a query EXPLAIN SELECT ProductName, SUM(LineItemAmount) FROM InterentSales WHERE Year = ‘2014’ GROUP BY ProductName HAVING COUNT(OrderId) > 10000 ORDER BY SUM(LineItemAmount);
  • 79. | © Copyright 2015 Hitachi Consulting79 ETL and Automation
  • 80. | © Copyright 2015 Hitachi Consulting80 ETL and Automation A common pattern  Usually the data is extracted from the sources - using Sqoop, Azure Data Factory, or any other mechanis - and loaded “as is” to an area in the hdfs (landing area). HDFS (Data Lake) Sources Apps Landing Directories
  • 81. | © Copyright 2015 Hitachi Consulting81 ETL and Automation A common pattern  Usually the data is extracted from the sources - using Sqoop, Azure Data Factory, or any other mechanis - and loaded “as is” to an area in the hdfs (landing area).  External Hive tables are created to source this data to hive system. HDFS (Data Lake) Sources Apps Landing Directories Hive External Tables
  • 82. | © Copyright 2015 Hitachi Consulting82 ETL and Automation A common pattern  Usually the data is extracted from the sources - using Sqoop, Azure Data Factory, or any other mechanis - and loaded “as is” to an area in the hdfs (landing area).  External Hive tables are created to source this data to hive system.  A series of hiveQL scripts are executed to transform and load data in the external hive tables to a canonical data model.  The data in the landing area can be dropped (if is not used by other tools) HDFS (Data Lake) Sources Apps Landing Directories Hive External Tables Data Warehouse (Dim. model, Data Vault, etc.)
  • 83. | © Copyright 2015 Hitachi Consulting83 ETL and Automation A common pattern  Usually the data is extracted from the sources - using Sqoop, Azure Data Factory, or any other mechanis - and loaded “as is” to an area in the hdfs (landing area).  External Hive tables are created to source this data to hive system.  A series of hiveQL scripts are executed to transform and load data in the external hive tables to a canonical data model.  The data in the landing area can be dropped (if is not used by other tools)  This process can be automated using Azure Data Factory, SSIS, Oozie, PowerShell etc. HDFS (Data Lake) Oozie(Sqoop,Pig-Latin,HiveQLScripts) AzureDataFactory(DataCopyActivities/HiveJobs) SSIS–PowerShell-Custom Sources Apps Landing Directories Hive External Tables Data Warehouse (Dim. model, Data Vault, etc.)
  • 84. | © Copyright 2015 Hitachi Consulting84 ETL and Automation Introducing Oozie  Oozie Workflow Document - XML file defining workflow actions <workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow"> <start to="FirstAction"/> <action name="FirstAction"> <hive xmlns="uri:oozie:hive-action:0.2"> <script>CreateTable.hql</script> <param>TABLE_NAME=${tableName}</param> <param>LOCATION=${tableFolder}</param> </hive> <ok to="SecondAction"/> <error to="fail"/> </action> <action name="SecondAction"> … </action> <kill name="fail"> <message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> Oozie Workflow
  • 85. | © Copyright 2015 Hitachi Consulting85 ETL and Automation Introducing Oozie  Oozie Workflow Document - XML file defining workflow actions  Script files - Files used by workflow actions (e.g., HiveQL and Pig-Latin script file). <workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow"> <start to="FirstAction"/> <action name="FirstAction"> <hive xmlns="uri:oozie:hive-action:0.2"> <script>CreateTable.hql</script> <param>TABLE_NAME=${tableName}</param> <param>LOCATION=${tableFolder}</param> </hive> <ok to="SecondAction"/> <error to="fail"/> </action> <action name="SecondAction"> … </action> <kill name="fail"> <message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> DROP TABLE IF EXISTS ${TABLE_NAME}; CREATE EXTERNAL TABLE ${TABLE_NAME} (Col1 STRING, Col2 FLOAT, Col3 FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${LOCATION}'; Oozie Workflow HiveQL Script
  • 86. | © Copyright 2015 Hitachi Consulting86 ETL and Automation Introducing Oozie  Oozie Workflow Document - XML file defining workflow actions  Script files - Files used by workflow actions (e.g., HiveQL and Pig-Latin script file).  The job.properties file - Configuration file setting parameter values  Oozie Jobs are scheduled using a Hadoop Agent <workflow-app xmlns="uri:oozie:workflow:0.2" name="MyWorkflow"> <start to="FirstAction"/> <action name="FirstAction"> <hive xmlns="uri:oozie:hive-action:0.2"> <script>CreateTable.hql</script> <param>TABLE_NAME=${tableName}</param> <param>LOCATION=${tableFolder}</param> </hive> <ok to="SecondAction"/> <error to="fail"/> </action> <action name="SecondAction"> … </action> <kill name="fail"> <message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> DROP TABLE IF EXISTS ${TABLE_NAME}; CREATE EXTERNAL TABLE ${TABLE_NAME} (Col1 STRING, Col2 FLOAT, Col3 FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${LOCATION}'; nameNode=wasb://my_container@my_storage_account.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/example/workflow/ tableName=ExampleTable tableFolder=/example/ExampleTable Oozie Workflow HiveQL Script Job properties
  • 87. | © Copyright 2015 Hitachi Consulting87 ETL and Automation PowerShell  The AzureHDInsightHiveJobDefinition cmdlet – Create a job definition – Use Query for explicit HiveQL statements, or File to reference a saved script – Run the job with the Start-AzureHDInsightJob cmdlet  The Invoke-Hive cmdlet – Simpler syntax to run a HiveQL query and wait for the response – Use Query for explicit HiveQL statements, or File to reference a saved script
  • 88. | © Copyright 2015 Hitachi Consulting88 ETL and Automation ..NET APIs
  • 89. | © Copyright 2015 Hitachi Consulting89 ETL and Automation ..NET APIs – Parallel Execution Step Task Desc HiveQL 1 1 Build Dim 1 Script 1 1 2 Build Dim 2 Script 2 1 3 Build Dim 3 Script 3 2 4 Build Fact 1 Script 4 2 5 Build Fact 2 Script 5 3 6 Build Summary Script 6 DW build control table  Tasks in the same steps can be executed in parallel  Steps are executed sequentially  HiveQL column contains the hive script that does the ETL Can be hosted and scheduled using an Azure WebJob!
  • 90. | © Copyright 2015 Hitachi Consulting90 Other features  Indexes  Statistics (analyse)  Archiving  UDF and UDAF using java  Locks  Authorization  Hive transactions  Streaming  Accumulo & HBase integration  HCatalog and WebHCatalog
  • 91. | © Copyright 2015 Hitachi Consulting91 Useful Hive Configurations  SET hive.metastore.warehouse.dir = “<directoty>”;  SET hive.cli.print.current.db = true | false;  SET hive.cli.print.header=true | false;  SET hive.execution.engine = mr | tez | spark;  SET hive.exec.dynamic.partition = true | false;  SET hive.exec.dynamic.partition.mode = strict | nonstrict;  SET hive.exec.max.dynamic.partitions = <value>;  SET hive.exec.max.created.files = <value>;  SET hive.map.aggr=true | false;  SET hive.auto.convert.join = true | false;  SET hive.optimize.bucketmapjoin=true | false ;  SET hive.exec.compress.intermediate=true | false;  SET mapred.map.output.compression.codec=<compression codec>;  SET hive.exec.compress.output=true | false;  SET mapred.output.compression.codec = <compression codec>;  SET hive.archive.enabled=true | false;  SET hive.optimize.ppd.storage=true | false;
  • 92. | © Copyright 2015 Hitachi Consulting92 How to Get Started with Hive?  Read the slides!  Coursera – Big Data Specialization https://www.coursera.org/specializations/big-data  Azure Documentation – HDInsight Emulator https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-hadoop-emulator-get-started  MVA – Big Data Analytics with HDInsight: Hadoop on Azure https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight-hadoop-on-azure-10551  MVA – Implementing Big Data Analysis https://mva.microsoft.com/en-US/training-courses/implementing-big-data-analysis-8311?l=44REr2Yy_5404984382  Azure Documentation – Getting Started with HDInsight https://azure.microsoft.com/en-gb/documentation/services/hdinsight/  Azure Documentation – How to Connect Excel to Windows Azure HDInsight via HiveODBC: https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-connect-excel-hive-odbc-driver/  Apache Scoop https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html  Apache Hive https://cwiki.apache.org/confluence/display/Hive/Home O’Reliy Books– Programming Hive 2nd Edition
  • 93. | © Copyright 2015 Hitachi Consulting93 Coming soon…  NoSQL on Microsoft Azure  Introduction to Spark on HDInsight Stay tuned
  • 94. | © Copyright 2015 Hitachi Consulting94 My Background Applying Computational Intelligence in Data Mining • Honorary Research Fellow, School of Computing , University of Kent. • Ph.D. Computer Science, University of Kent, Canterbury, UK. • M.Sc. Computer Science , The American University in Cairo, Egypt. • 25+ published journal and conference papers, focusing on: – classification rules induction, – decision trees construction, – Bayesian classification modelling, – data reduction, – instance-based learning, – evolving neural networks, and – data clustering • Journals: Swarm Intelligence, Swarm & Evolutionary Computation, , Applied Soft Computing, and Memetic Computing. • Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio, ECTA, IEEE WCCI and INNS-BigData. ResearchGate.org
  • 95. | © Copyright 2015 Hitachi Consulting95 Thank you!