SlideShare a Scribd company logo
1 of 17
Apache Hive
Big Data Webinar Session 3
Presenter : Amit Khandelwal
Agenda
• Introduction
• Where does Hive falls in Big Data stack
• Hive Architecture
• Hive Components
• Job Execution Flow
• Different modes of Hive
• HQL
• Hive Data Model
• Tables
• Partitioning
• Bucketing
Introduction
• What’s Hive ?
• Data warehousing tool built on top of Hadoop.
• Provides High level abstraction by allowing users to query data which in turn fires
Map Reduce jobs, Spark jobs or Tez jobs.
• It is designed for OLAP.
• It is familiar, fast, scalable, and extensible.
• What Hive is not
• A relational database
• A design for Online Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Where does Hive Fall in the Stack?
Data Processing Layer
JDBC
DataSources
IngestionLayer
Data Storage Layer
Data Query Layer / Consumption Layer
ODBC
Hive Architecture
HDFS
MapReduce
Executor Optimizer
Parser Compiler
JDBC/ODBC
CLI Thrift Server Web Interface
Driver
Meta
Store
Client
Metastore
RDBMS
Spark Tez
Hive Component
• Hive Client or Shell Interface – CLI (Command Line Interface)
• Driver:
 Handles sessions, fetch, execution
 Parse, plan, optimize
• Execution Engine:
 Query compilation/validation
 Query Planning
 Optimizing the query plan
 Run map or reduce
• Meta store Database (default is Derby)
Job Execution Flow
Different modes of Hive
Hive can operate in two modes depending on the size of data nodes in Hadoop cluster.
These modes are
I. Local mode
II. Map reduce mode
By default, it works on Map Reduce mode
Hive Query Language (HQL)
• Hive provides a SQL dialect known as Hive Query Language (HQL)
• Name of its default database is “default”
• Hive stores meta information of tables in “derby database” (default database which
comes with hive)
• Example: Select * from <TableName>;
Hive Data Model
• Tables
• Partitions
• Buckets
Hive Tables
• Analogous to relational database tables.
• Each table has a corresponding directory in HDFS.
• Data is stored as files within that directory.
• Types of hive tables :
I. Internal Tables
II. External Tables
Partitions
• Dividing tables into different parts.
• Partitioning helps reducing the amount of data you query.
• A partition is usually represented as a directory on HDFS.
• Increases performance.
• Examples : CREATE TABLE sales (name String, totalsales FLOAT)
PARTITIONED BY (country STRING, year INT, month INT);
Partitions - II
Buckets
• Partitions are sub-divided into buckets, to provide extra structure to the data that
may be used for more efficient querying.
• Bucketing works based on the value of hash function of some column of a table.
• set hive.enforce.bucketing=true;
• CREATE TABLE sales
(openingbid FLOAT, finalbid FLOAT, itemtype STRING, days INT)
CLUSTERED BY(openingbid) INTO 5 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
• INSERT OVERWRITE TABLE sales SELECT * from nteg_demo.testing;
Pros
• Provides an easy way to process large scale data
• Distributed data warehouse.
• Support SQL-like language called HiveQL (HQL).
• Efficient execution plans for performance
• Interoperability with other database
Limitations
• Not designed for the online data processing
• High Latency
• Don’t have proper support for the transaction processing
ANALLGEIERDIVISION
Thank you 
Amit Khandelwal

More Related Content

What's hot (20)

Apache hive
Apache hiveApache hive
Apache hive
 
Hive
HiveHive
Hive
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Introduction to Hive
Introduction to HiveIntroduction to Hive
Introduction to Hive
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
6.hive
6.hive6.hive
6.hive
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 

Similar to Apache Hive

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptxhive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptxOmarBen27
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSRUHULAMINHAZARIKA
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detailHariKumar544765
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in HyderabadRajitha D
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 

Similar to Apache Hive (20)

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Apache hive
Apache hiveApache hive
Apache hive
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptxhive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detail
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 

Recently uploaded

Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksBoston Institute of Analytics
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证pwgnohujw
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 

Recently uploaded (20)

Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 

Apache Hive

  • 1. Apache Hive Big Data Webinar Session 3 Presenter : Amit Khandelwal
  • 2. Agenda • Introduction • Where does Hive falls in Big Data stack • Hive Architecture • Hive Components • Job Execution Flow • Different modes of Hive • HQL • Hive Data Model • Tables • Partitioning • Bucketing
  • 3. Introduction • What’s Hive ? • Data warehousing tool built on top of Hadoop. • Provides High level abstraction by allowing users to query data which in turn fires Map Reduce jobs, Spark jobs or Tez jobs. • It is designed for OLAP. • It is familiar, fast, scalable, and extensible. • What Hive is not • A relational database • A design for Online Transaction Processing (OLTP) • A language for real-time queries and row-level updates
  • 4. Where does Hive Fall in the Stack? Data Processing Layer JDBC DataSources IngestionLayer Data Storage Layer Data Query Layer / Consumption Layer ODBC
  • 5. Hive Architecture HDFS MapReduce Executor Optimizer Parser Compiler JDBC/ODBC CLI Thrift Server Web Interface Driver Meta Store Client Metastore RDBMS Spark Tez
  • 6. Hive Component • Hive Client or Shell Interface – CLI (Command Line Interface) • Driver:  Handles sessions, fetch, execution  Parse, plan, optimize • Execution Engine:  Query compilation/validation  Query Planning  Optimizing the query plan  Run map or reduce • Meta store Database (default is Derby)
  • 8. Different modes of Hive Hive can operate in two modes depending on the size of data nodes in Hadoop cluster. These modes are I. Local mode II. Map reduce mode By default, it works on Map Reduce mode
  • 9. Hive Query Language (HQL) • Hive provides a SQL dialect known as Hive Query Language (HQL) • Name of its default database is “default” • Hive stores meta information of tables in “derby database” (default database which comes with hive) • Example: Select * from <TableName>;
  • 10. Hive Data Model • Tables • Partitions • Buckets
  • 11. Hive Tables • Analogous to relational database tables. • Each table has a corresponding directory in HDFS. • Data is stored as files within that directory. • Types of hive tables : I. Internal Tables II. External Tables
  • 12. Partitions • Dividing tables into different parts. • Partitioning helps reducing the amount of data you query. • A partition is usually represented as a directory on HDFS. • Increases performance. • Examples : CREATE TABLE sales (name String, totalsales FLOAT) PARTITIONED BY (country STRING, year INT, month INT);
  • 14. Buckets • Partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. • Bucketing works based on the value of hash function of some column of a table. • set hive.enforce.bucketing=true; • CREATE TABLE sales (openingbid FLOAT, finalbid FLOAT, itemtype STRING, days INT) CLUSTERED BY(openingbid) INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; • INSERT OVERWRITE TABLE sales SELECT * from nteg_demo.testing;
  • 15. Pros • Provides an easy way to process large scale data • Distributed data warehouse. • Support SQL-like language called HiveQL (HQL). • Efficient execution plans for performance • Interoperability with other database
  • 16. Limitations • Not designed for the online data processing • High Latency • Don’t have proper support for the transaction processing

Editor's Notes

  1. Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of large data systems in open source Hadoop platform. The tables in Hive are similar to tables in a relational database, and data units can be organized from larger to more granular units with the help of Partitioning and Bucketing which we will see in coming slides.. As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve. Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and Write many times.
  2. MapReduce is a programming paradigm used for processing the data that is one of the core components of hadoop. A MapReduce program is composed of 3 operations : Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node. Reduce: worker nodes process each group of output data, per key, in parallel and provide the required ouput. 2. So think of a situation where you have to analyze a large distributed data set. And each time you want to execute a query, you have to write a customized map-reduce java program. You already know the length of the java codes and also how time consuming it will be to write a map-reduce code every time. Can you imagine how uncomfortable it would be ? This was happening during the beginning of big data world, all of the SQL type of queries were supposed to be implemented into MapReduce Java API in order to execute queries over the distributed data. This is what changed with the arrival of Hive. Herein the required SQL abstraction is provided by Hive itself which integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. HQL separates the user from the complexity of Map Reduce programming.   It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning. Why HIVE ? If you have attended last webinars on bigdata then you know how map-reduce works? How uncomfortable it would be while working on Hadoop for analyzing the data due to the coding nature of Hadoop (Since for each analysis you have to write customize map-reduce jobs). Usually according to the earlier standards, all of the SQL type of data base queries were supposed to be implemented into the system of MapReduce Java and API in order to execute queries over the distributed data. This is what changed with the arrival of Hive. Herein the required SQL abstraction is provided by Hive itself in order to integrate queries directly in to JAVA without the need of implementation of queries without using low level JAVA APIs. HQL separates the user from the complexity of Map Reduce programming.  It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning.
  3. Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of large data systems in open source Hadoop platform. The tables in Hive are similar to tables in a relational database, and data units can be organized from larger to more granular units with the help of Partitioning and Bucketing which we will see in coming slides.. As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve. Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and Write many times.
  4. Hadoop Components Hadoop Key components are Yarn and HDFS. Yarn is the resource manager, whereas HDFS is the distributed file system of Hadoop.
  5. 1. Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write Hive queries using Hive Query Language(HQL) 2. Driver : it communicates with all type of JDBC, ODBC, and other client specific applications.  Driver processes requests from different applications to meta store and field systems for further processing. 3. Metastore : Hive has a relational database on the master node it uses to keep track of state. For instance, when you CREATE TABLE FOO(foo string) LOCATION 'hdfs://tmp/';, this table schema is stored in the database. If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the file-system and finding them, etc). These sorts of things are the 'metadata’. Hive has a default metastore (derby); however, you can also change it to other RDBMS. Derby only allows one connection, that’s why you don’t see Hive use Derby in production environment. So, for single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL. Also, On connecting to Hive via CLI, it establishes a connection to metastore as well.
  6. From the screenshot we can understand the Job execution flow in Hive with Hadoop 1. Executing Query from the UI( User Interface) 2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution) process and its related metadata information gathering 3. The compiler creates the plan for a job to be executed. 4. Compiler communicating with Meta store for getting metadata request 5. Meta store sends metadata information back to compiler 6. Compiler communicating with Driver with the proposed plan to execute the query 7. Driver Sending execution plans to Execution engine 8. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS operations. 8. EE should first contacts Name Node and then to Data nodes to get the values stored in tables. 9. EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data node only. While from Name Node it only fetches the metadata information for the query. 10. It collects actual data from data nodes related to mentioned query 11. Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and ALTERING tables and databases are done. 12. Meta store will store information about database name, table names and column names only. It will fetch data related to query mentioned. 13. Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes, and job tracker to execute the query on top of Hadoop file system Fetching results from driver Sending results to Execution engine. 14. Once the results fetched from data nodes to the EE, it will send results back to driver and to UI ( front end) Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted arrow in the Job flow diagram shows the Execution engine communication with Hadoop daemons.
  7. Processing will be very fast on smaller data sets present in the local machine Mapreduce : query will be executs query in parallel way Processing of large data sets with better performance can be achieved through this mode Also we can set in which mode we wan hive to work. By default, it works on Map Reduce mode and for local mode you can have set: SET mapred.job.tracker=local; From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode automatically.
  8. HQL syntax is similar to the SQL syntax that most of us are familiar with. Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce, Tez, or Spark jobs, which are submitted to Hive for execution Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with. The Sample query below display all the records present in mentioned table name. One more term comes here – Hive Server2 : HiveServer2 (HS2) is a server interface that performs following functions: Enables remote clients to execute queries against Hive Retrieve the results of mentioned queries From the latest version it's having some advanced features Based on Thrift RPC like; Multi-client concurrency Authentication
  9. Tables in hive are logically made up of data being stored. You would have to first decide how you want to access the data, according to that you would do partitioning, and bucketing. How tables are stored in hdfs as files. Let me show you how tables exist in hdfs. Also, data loaded in the tables are going to be stored in Hadoop cluster on HDFS. Hive supports four types file - TEXTFILE, SEQUENCEFILE, ORC (Optimized Row Columnar ) and RCFILE (Record Columnar File). - binary file format offers high compression rate By default hive creates Internal tables and manages data. It means that Hive moves the data into its warehouse directory. While an external table, tells hive to refer the data that is at an existing location outside the warehouse directory. How meta store works here ? Metastore works a link
  10. Hive has been one of the preferred tool for performing queries on large datasets, especially when full table scan is done. Lets consider an Example of sales table having millions of records. Lets say table has commodity column, totalsales column, country column, year column, month column. Now if I have to get the number of totalsales of a commodity lets say x in US In march, 2012. What you expect what will happen ? How much time it would take to scan millions of records. Now let us consider another scnerio, where we have done partitioning on columns – country column, year column, month column, date column. Partitioning is a way of organizing a big table by diving it into different parts based on partition keys. grouping same type of data together based on a column or partition key.  A partition is usually represented as a directory on HDFS. As each partition resides as a directory in hdfs so lets see how the data will be grouped. In the case of tables which are not partitioned, all the files in a table’s data directory is read and then filters are applied on it as a subsequent phase. This becomes a slow and expensive affair especially in cases of large tables. SO, Partitioning is oftenly used for distributing load horizontally. If you have a big table, partitioning helps by reducing the amount of data you query.
  11. CREATE TABLE newsales (sale_id INT, amount FLOAT) PARTITIONED BY (country STRING, year INT, month INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/hivePath/sales.csv' INTO TABLE newsales partition(country='US', year=2012, month=10); create table auctionwithpartition (openingbid FLOAT, fianlbid float, itemtype string) PARTITIONED BY (days int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',’; LOAD DATA LOCAL INPATH '/hivePath/auctiondata.csv' INTO TABLE auctionwithpartition partition(days = 7 );
  12. Data in each partition can be divided into buckets based on a hash function of the column. Each bucket is stored as a file in partition directory.
  13. HBASE
  14. Not designed for OLTP hence, no Real time access to data. High Latency - Hive takes less time to load the data because of its property “scheme on read” but it takes longer time to query the data because data has to be verified against the schema at the time of querying. Previously it did not support the transaction processing because it had no support for ACID properties but recently ACID properties has been added to version hive 0.14 but it leads to performance degradation.