SlideShare a Scribd company logo
1 of 51
Download to read offline
Apache Hive
Define Hive and its architecture
Create and manage tables using Hue Web UI and Beeline
Understand various file formats supported in Hive
Use HiveQL DDL to create tables and execute queries
Learning Objectives
By the end of this lesson, you will be able to:
Hive: SQL over Hadoop MapReduce
Apache Hive
Hive provides a SQL like interface for users to extract data from the Hadoop system.
SELECT t1.a1 as c1, t2.b1 as c2
FROM t1 JOIN t2 ON (t1.a2=t2.b2);
Batch
Processing
Resource Management
Storage
HDFS HBase
Features of Hive
Originally developed by Facebook around 2007
Is an open-source Apache project
High level abstraction layer on top of
MapReduce and Apache Spark
Uses HiveQL
Suitable for structured data
z
Case Study
A leading online education company uses Hive and Impala to analyze social media coverage.
The organization analyzes positive, negative, and neutral reviews using Hive.
Hive Architecture
Hive Architecture
The major components of Hive architecture are: Hadoop core components, Metastore, Driver, and Hive clients.
JDBC/ODBC
Hive Web
Interface
Hive Thrift
Server
Hive
CLI
Drivers
Parser Planner
Execution Optimizer
Metastore
RDBMS
MapReduce
HDFS
Job Execution Flow in Hive
Receive SQL query
Parse HiveQL
Make optimizations
Monitor progress
Plan execution
Submit job(s) to cluster
1
2
3
4
5
6
7
Process data in MapReduce or Apache Spark
Store the data in HDFS
Hive Metastore
Managing Data with Hive
Default path: /user/hive/warehouse/<table_name>
• A table is an HDFS directory containing zero or more files
• Table supports many formats for data storage and retrieval
• Metastore stores the created metadata
o Contained in an RDBMS such as MySQL
META
</
>
• Hive Tables are stored in HDFS and the relevant metadata is stored in the Metastore
Hive uses Metastore service to store metadata for Hive tables.
What Is Hive Metastore?
The Metastore is the component that stores the system catalog which
contains metadata about tables, columns, and partitions.
Use of Metastore in Hive
• Hive uses metastore to get table structure and location of data
• The server queries actual data which is stored in HDFS
1. Get table structure
and location of data
(Data in HDFS files)
(Metadata in RDBMS)
Query
Hive Server
2. Query actual data
Data Warehouse Directory Structure
• By default, all data gets stored in
/user/hive/warehouse
• Each table is a directory within the default location having one or more files
customer_id name country
001 Alice us
002 Bob ca
003 Carlos mx
…. …. ….
392 Maria it
393 Nigel uk
394 Ophelia dk
…. …. ….
Customers Table
File 01
File 02
001 Alice us
002 Bob ca
003 Carlos mx
004 Dieter de
392 Maria it
393 Nigel uk
394 Ophelia dk
…. …. ….
/user/hive/warehouse/customers
In HDFS, Hive data can be split into more than one file.
Hive DDL and DML
Defining Database and Table
• Databases and tables are created and managed using the DDL (Data Definition Language) of HiveQL
• They are very similar to standard SQL DDL
• For example, Create/Drop/Alter/Use Database
Data Types in Hive
The data types in Hive are as follows:
Data Types in Hive
• Integers: TINYINT,
SMALLINT, INT, and BIGINT
• Boolean: BOOLEAN
• Floating point numbers:
FLOAT and DOUBLE
• String: STRING
• Structs: {a INT; b INT}
• Maps: M['group']
• Arrays: ['a', 'b', 'c'], A[1]
returns 'b‘
• Structures with attributes
Primitive types Complex types User-defined types
Validation of Data
SE
Q T
X
T
Missing data is represented as NULL.
• Unlike RDBMS, Hive does not validate data on insert
• Files are simply moved into place, which makes loading data into tables faster in Hive
• Errors in file formats are discovered when queries are performed
Hive follows “schema on read”
File Format Types
File Format Types
Formats to create
Hive table in HDFS
Text Files
Sequence
Files
Avro Data
Files
Parquet
Format
The most basic and human-readable file
Read or written in any programming language
Delimited by a comma or a tab
Consumes more space to store numeric value as string
Difficult to represent binary data
Text
File
Format
File Format: Text File Format
Stores key-value pairs in a binary container format
More efficient than a text file
Not human-readable
File Format: Sequence File Format
Considered the best choice for
general-purpose storage in Hadoop
Embeds schema
metadata in the file
Can read from and write
in many languages
Efficient storage
Widely supported inside and
outside Hadoop ecosystem
Ideal for long-term storage of data
Avro File
Format
File Format: Avro File Format
• Is a columnar format developed by Cloudera
and Twitter
• Uses advanced optimizations described in
Google’s Dremel paper
• Considered the most efficient for adding
multiple records at a time
File Format: Parquet File Format
Data Serialization
What Is Data Serialization?
How do you serialize the number 123456789?
It can be serialized as 4 bytes when stored as a Java int and 9 bytes when stored as a Java String.
Data serialization is a way to represent data in
the storage memory as a series of bytes.
Data Serialization Framework
Efficient data serialization framework
Widely supported throughout Hadoop and its ecosystem
Supports Remote Procedure Calls (RPC)
Offers compatibility
Data Types Supported in Avro
Name Description
null An absence of a value
boolean A binary
int 32-bit signed integer
long 64-bit signed integer
float Single-precision floating point value
double Double-precision floating point value
bytes Sequence of 8-bit unsigned bytes
string Sequence of unicode characters
Complex Data Types Supported in Avro Schemas
Name Description
record
A user-defined type composed of one or more named
fields
enum A specified set of values
array Zero or more values of the same type
map
Set of key-value pairs; key is string while value is of
specified type
union Exactly one value matching a specified set of types
fixed A fixed number of 8-bit unsigned bytes
Hive Table and Avro Schema
Hive Table - CREATE TABLE orders (id INT, name STRING, title STRING)
Avro Schema
{"namespace":"com.simplilearn",
"type":"record",
"name":"orders",
"fields":[
{"name":"id", "type":"int"},
{"name":"name", "type":"string"},
{"name":"title", "type":"string"}]
}
Hive Optimization: Partitioning, Bucketing, and Sampling
Data Storage
Data file partitioning
reduces query time.
Subdirectories are created
for each unique value of a
partition column.
Tables
DB
Partitions
(subdirectory)
Bucket
s
(Files)
Hive
Director
y
HDFS
All files in a data set are stored in a single Hadoop Distributed File System or HDFS directory.
Example of a Non-Partitioned Table
The customer details are required to be
partitioned by state for fast retrieval of
subset data pertaining to the customer
category.
Hive will need to read all the files in
a table’s data directory.
Can be a very slow and expensive
process, especially when the tables
are large.
Example of a Partitioned Table
A partition column is a
“virtual column” where data
is not actually stored in the
file.
Partitions are horizontal slices of data that allow larger sets
of data to be separated in more manageable chunks.
Use partitioning to store data in separate files by state.
Data Insertion
Data insertion into partitioned tables can be done in two ways or modes:
Static
partitioning
Dynamic
partitioning
Static Partitioning
Input data files individually into a partition table.
Create new partitions; define them using ADD
PARTITION clause.
While loading data, specify the partition to store data
in; specify partition column value.
Add partition in the table; move file into the partition of
table.
Add a partition for each
new day of account data.
Dynamic Partitioning
Partitions get created automatically at load times.
New partitions can be created dynamically from
existing data.
Partitions are automatically created based on the value
of the last column. If the partition does not already
exist, it will be created.
If a partition exists, it will be overwritten by the
OVERWRITE keyword.
With a large amount of data
stored in a table, dynamic
partition is suitable.
Viewing Partitions
View the partitions of a partitioned
table using the SHOW command.
Commands that are supported on Hive partitioned tables to view and delete partitions.
Deleting Partitions
Use ALTER command:
• To delete partitions
• To add or change partitions
When to Use Partitioning
Following are the instances when you need to use partitioning for tables:
When reading the entire data set takes too long
When queries almost always filter on the partition
columns
When there are a reasonable number of different
values for partition columns
When Not to Use Partitioning
Following are the instances when you should avoid using a partitioning:
When columns have too
many unique rows
When the partition is less
than 20k
When creating a dynamic
partition as it can lead to
high number of partitions
Bucketing
Bucketing in Hive
Bucketing is an optimization technique
Country
State - 1 State - 2 State - 3 State - n
City City City City City City City City
Partitioned column
Bucket
Bucket
Bucket
Buckets
(Cluster)
0100
1101
1001
DATA
What Do Buckets Do?
CREATE TABLE page_views( user_id INT, session_id BIGINT, url
STRING)
PARTIONED BY (day INT)
CLUSTERED BY (user_id) INTO 100;
Syntax for creating a bucketed table
The processor will first calculate the hash
number of the user_id in the query and will
look for only that bucket.
0100
1101
1001
Bucket
Bucket
Bucket
Buckets
(Cluster)
DATA
Buckets distribute the data load into user-defined set of clusters by calculating
the hash code of the key mentioned in the query.
Hive Query Language: Introduction
SELECT
dt,
COUNT (DISTINCT (user_id))
FROM events
GROUP BY dt;
HiveQL is a SQL-like query language for Hive to process and analyze structured data in a Metastore.
MetaStore
HiveQL: Extensibility
Pluggable data
formats
Pluggable user-
defined types
Pluggable
MapReduce
scripts
Pluggable user-
defined
functions
An important principle of HiveQL is its extensibility. HiveQL can be extended in multiple ways:
Built-in Functions of Hive
Writing the functions in JAVA scripts creates its own UDF. Hive also provides some inbuilt functions
that can be used to avoid own UDFs from being created.
Collection
size, map_keys, map_values,
and array_contains
Conditional
if, case, and coalesce
Date
from_unixtime, to_date, year,
and datediff
Type Conversion
cast
Mathematical
round, floor, ceil, rand, and exp
String
length, reverse, upper, and
trim
Inbuilt
Functions
Other Functions of Hive
Aggregate
Table-generating
UDAF
Single
input row
Single output
row
Table-
generating
Single
input row
Multiple
output rows
Creates the output if
full set of data is given
String pageid Array<int> adid_list
"front_page" [1,2,3]
"contact_page" [3, 4, 5]
String pageid intadid
"front_page" 1
"front_page" 2
…… ……
SELECT pageid, adid FROM pageAds
LATERAL VIEW explode(adid_list) adTable
AS adid;
Lateral view:
Attribute UDF/UDAF MapReduce scripts
Language Java Any language
1/1 input/output Supported via UDF Supported
n/1 input/output Supported via UDAF Supported
1/n input/output Supported via UDTF Supported
Speed Faster
(in same process)
Slower
(spawns new process)
UDF/UDAF vs. MapReduce Scripts
Define Hive and its architecture
Create and manage tables using Hue Web UI and Beeline
Understand various file formats supported in Hive
Use HiveQL DDL to create tables and execute queries
Key Takeaways
You are now able to:
Thank You

More Related Content

Similar to Apache Hive, data segmentation and bucketing

Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxAnonymous9etQKwW
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondGruter
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detailHariKumar544765
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON Padma shree. T
 

Similar to Apache Hive, data segmentation and bucketing (20)

Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Hive training
Hive trainingHive training
Hive training
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detail
 
מיכאל
מיכאלמיכאל
מיכאל
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 

Recently uploaded

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 

Recently uploaded (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 

Apache Hive, data segmentation and bucketing

  • 2. Define Hive and its architecture Create and manage tables using Hue Web UI and Beeline Understand various file formats supported in Hive Use HiveQL DDL to create tables and execute queries Learning Objectives By the end of this lesson, you will be able to:
  • 3. Hive: SQL over Hadoop MapReduce
  • 4. Apache Hive Hive provides a SQL like interface for users to extract data from the Hadoop system. SELECT t1.a1 as c1, t2.b1 as c2 FROM t1 JOIN t2 ON (t1.a2=t2.b2); Batch Processing Resource Management Storage HDFS HBase
  • 5. Features of Hive Originally developed by Facebook around 2007 Is an open-source Apache project High level abstraction layer on top of MapReduce and Apache Spark Uses HiveQL Suitable for structured data z
  • 6. Case Study A leading online education company uses Hive and Impala to analyze social media coverage. The organization analyzes positive, negative, and neutral reviews using Hive.
  • 8. Hive Architecture The major components of Hive architecture are: Hadoop core components, Metastore, Driver, and Hive clients. JDBC/ODBC Hive Web Interface Hive Thrift Server Hive CLI Drivers Parser Planner Execution Optimizer Metastore RDBMS MapReduce HDFS
  • 9. Job Execution Flow in Hive Receive SQL query Parse HiveQL Make optimizations Monitor progress Plan execution Submit job(s) to cluster 1 2 3 4 5 6 7 Process data in MapReduce or Apache Spark Store the data in HDFS
  • 11. Managing Data with Hive Default path: /user/hive/warehouse/<table_name> • A table is an HDFS directory containing zero or more files • Table supports many formats for data storage and retrieval • Metastore stores the created metadata o Contained in an RDBMS such as MySQL META </ > • Hive Tables are stored in HDFS and the relevant metadata is stored in the Metastore Hive uses Metastore service to store metadata for Hive tables.
  • 12. What Is Hive Metastore? The Metastore is the component that stores the system catalog which contains metadata about tables, columns, and partitions.
  • 13. Use of Metastore in Hive • Hive uses metastore to get table structure and location of data • The server queries actual data which is stored in HDFS 1. Get table structure and location of data (Data in HDFS files) (Metadata in RDBMS) Query Hive Server 2. Query actual data
  • 14. Data Warehouse Directory Structure • By default, all data gets stored in /user/hive/warehouse • Each table is a directory within the default location having one or more files customer_id name country 001 Alice us 002 Bob ca 003 Carlos mx …. …. …. 392 Maria it 393 Nigel uk 394 Ophelia dk …. …. …. Customers Table File 01 File 02 001 Alice us 002 Bob ca 003 Carlos mx 004 Dieter de 392 Maria it 393 Nigel uk 394 Ophelia dk …. …. …. /user/hive/warehouse/customers In HDFS, Hive data can be split into more than one file.
  • 16. Defining Database and Table • Databases and tables are created and managed using the DDL (Data Definition Language) of HiveQL • They are very similar to standard SQL DDL • For example, Create/Drop/Alter/Use Database
  • 17. Data Types in Hive The data types in Hive are as follows: Data Types in Hive • Integers: TINYINT, SMALLINT, INT, and BIGINT • Boolean: BOOLEAN • Floating point numbers: FLOAT and DOUBLE • String: STRING • Structs: {a INT; b INT} • Maps: M['group'] • Arrays: ['a', 'b', 'c'], A[1] returns 'b‘ • Structures with attributes Primitive types Complex types User-defined types
  • 18. Validation of Data SE Q T X T Missing data is represented as NULL. • Unlike RDBMS, Hive does not validate data on insert • Files are simply moved into place, which makes loading data into tables faster in Hive • Errors in file formats are discovered when queries are performed Hive follows “schema on read”
  • 20. File Format Types Formats to create Hive table in HDFS Text Files Sequence Files Avro Data Files Parquet Format
  • 21. The most basic and human-readable file Read or written in any programming language Delimited by a comma or a tab Consumes more space to store numeric value as string Difficult to represent binary data Text File Format File Format: Text File Format
  • 22. Stores key-value pairs in a binary container format More efficient than a text file Not human-readable File Format: Sequence File Format
  • 23. Considered the best choice for general-purpose storage in Hadoop Embeds schema metadata in the file Can read from and write in many languages Efficient storage Widely supported inside and outside Hadoop ecosystem Ideal for long-term storage of data Avro File Format File Format: Avro File Format
  • 24. • Is a columnar format developed by Cloudera and Twitter • Uses advanced optimizations described in Google’s Dremel paper • Considered the most efficient for adding multiple records at a time File Format: Parquet File Format
  • 26. What Is Data Serialization? How do you serialize the number 123456789? It can be serialized as 4 bytes when stored as a Java int and 9 bytes when stored as a Java String. Data serialization is a way to represent data in the storage memory as a series of bytes.
  • 27. Data Serialization Framework Efficient data serialization framework Widely supported throughout Hadoop and its ecosystem Supports Remote Procedure Calls (RPC) Offers compatibility
  • 28. Data Types Supported in Avro Name Description null An absence of a value boolean A binary int 32-bit signed integer long 64-bit signed integer float Single-precision floating point value double Double-precision floating point value bytes Sequence of 8-bit unsigned bytes string Sequence of unicode characters
  • 29. Complex Data Types Supported in Avro Schemas Name Description record A user-defined type composed of one or more named fields enum A specified set of values array Zero or more values of the same type map Set of key-value pairs; key is string while value is of specified type union Exactly one value matching a specified set of types fixed A fixed number of 8-bit unsigned bytes
  • 30. Hive Table and Avro Schema Hive Table - CREATE TABLE orders (id INT, name STRING, title STRING) Avro Schema {"namespace":"com.simplilearn", "type":"record", "name":"orders", "fields":[ {"name":"id", "type":"int"}, {"name":"name", "type":"string"}, {"name":"title", "type":"string"}] }
  • 31. Hive Optimization: Partitioning, Bucketing, and Sampling
  • 32. Data Storage Data file partitioning reduces query time. Subdirectories are created for each unique value of a partition column. Tables DB Partitions (subdirectory) Bucket s (Files) Hive Director y HDFS All files in a data set are stored in a single Hadoop Distributed File System or HDFS directory.
  • 33. Example of a Non-Partitioned Table The customer details are required to be partitioned by state for fast retrieval of subset data pertaining to the customer category. Hive will need to read all the files in a table’s data directory. Can be a very slow and expensive process, especially when the tables are large.
  • 34. Example of a Partitioned Table A partition column is a “virtual column” where data is not actually stored in the file. Partitions are horizontal slices of data that allow larger sets of data to be separated in more manageable chunks. Use partitioning to store data in separate files by state.
  • 35. Data Insertion Data insertion into partitioned tables can be done in two ways or modes: Static partitioning Dynamic partitioning
  • 36. Static Partitioning Input data files individually into a partition table. Create new partitions; define them using ADD PARTITION clause. While loading data, specify the partition to store data in; specify partition column value. Add partition in the table; move file into the partition of table. Add a partition for each new day of account data.
  • 37. Dynamic Partitioning Partitions get created automatically at load times. New partitions can be created dynamically from existing data. Partitions are automatically created based on the value of the last column. If the partition does not already exist, it will be created. If a partition exists, it will be overwritten by the OVERWRITE keyword. With a large amount of data stored in a table, dynamic partition is suitable.
  • 38. Viewing Partitions View the partitions of a partitioned table using the SHOW command. Commands that are supported on Hive partitioned tables to view and delete partitions.
  • 39. Deleting Partitions Use ALTER command: • To delete partitions • To add or change partitions
  • 40. When to Use Partitioning Following are the instances when you need to use partitioning for tables: When reading the entire data set takes too long When queries almost always filter on the partition columns When there are a reasonable number of different values for partition columns
  • 41. When Not to Use Partitioning Following are the instances when you should avoid using a partitioning: When columns have too many unique rows When the partition is less than 20k When creating a dynamic partition as it can lead to high number of partitions
  • 43. Bucketing in Hive Bucketing is an optimization technique Country State - 1 State - 2 State - 3 State - n City City City City City City City City Partitioned column Bucket Bucket Bucket Buckets (Cluster) 0100 1101 1001 DATA
  • 44. What Do Buckets Do? CREATE TABLE page_views( user_id INT, session_id BIGINT, url STRING) PARTIONED BY (day INT) CLUSTERED BY (user_id) INTO 100; Syntax for creating a bucketed table The processor will first calculate the hash number of the user_id in the query and will look for only that bucket. 0100 1101 1001 Bucket Bucket Bucket Buckets (Cluster) DATA Buckets distribute the data load into user-defined set of clusters by calculating the hash code of the key mentioned in the query.
  • 45. Hive Query Language: Introduction SELECT dt, COUNT (DISTINCT (user_id)) FROM events GROUP BY dt; HiveQL is a SQL-like query language for Hive to process and analyze structured data in a Metastore. MetaStore
  • 46. HiveQL: Extensibility Pluggable data formats Pluggable user- defined types Pluggable MapReduce scripts Pluggable user- defined functions An important principle of HiveQL is its extensibility. HiveQL can be extended in multiple ways:
  • 47. Built-in Functions of Hive Writing the functions in JAVA scripts creates its own UDF. Hive also provides some inbuilt functions that can be used to avoid own UDFs from being created. Collection size, map_keys, map_values, and array_contains Conditional if, case, and coalesce Date from_unixtime, to_date, year, and datediff Type Conversion cast Mathematical round, floor, ceil, rand, and exp String length, reverse, upper, and trim Inbuilt Functions
  • 48. Other Functions of Hive Aggregate Table-generating UDAF Single input row Single output row Table- generating Single input row Multiple output rows Creates the output if full set of data is given String pageid Array<int> adid_list "front_page" [1,2,3] "contact_page" [3, 4, 5] String pageid intadid "front_page" 1 "front_page" 2 …… …… SELECT pageid, adid FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid; Lateral view:
  • 49. Attribute UDF/UDAF MapReduce scripts Language Java Any language 1/1 input/output Supported via UDF Supported n/1 input/output Supported via UDAF Supported 1/n input/output Supported via UDTF Supported Speed Faster (in same process) Slower (spawns new process) UDF/UDAF vs. MapReduce Scripts
  • 50. Define Hive and its architecture Create and manage tables using Hue Web UI and Beeline Understand various file formats supported in Hive Use HiveQL DDL to create tables and execute queries Key Takeaways You are now able to: