SlideShare a Scribd company logo
HIVE
Bucharest Big Data Meetup
March 10, 2015
whoami
• Developer with SQL Server team since 2001
• Apache contributor
• Hive
• Hadoop core (security)
• stackoverflow user 105929s
• @rusanu
What is Hive
• Datawarehouse for querying and managing large datasets
• A query engine that use Hadoop MapReduce for execution
• A SQL abstraction for creating MapReduce algorithms
• SQL interface to HDFS data
• Developed at Facebook
VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce
Framework
• ASF top project since September 2010
How does Hive work
• SQL submitted via CLI or
Hiveserver(2)
• Metadata describing tables
stored in RDBMS
• Driver compiles/optimizes
execution plan
• Plan and execution engine
submitted to Hadoop as job
• MR invokes Hive execution
engine which executes plan
HiveHadoop
Metastore
RDBMS
HCatalog
HDFS
Driver
Compiles, Optimizes
MapReduce
Task
Task
Split
Split
CLI Hiveserver2
ODBC JDBCShell
Job
Tracker
Beeswax
Hive Query execution
• Compilation/Optimization results in an AST containing operators eg:
• FetchOperator: scans source data (the input split)
• SelectOperator: projects column values, computes
• GroupByOperator: aggregate functions (SUM, COUNT etc)
• JoinOperator:joins
• The plan forms a DAG of MR jobs
• The plan tree is serialized (Kryo)
• Hive Driver dispatches jobs
• Multiple stages can result in multiple jobs
• Task execution picks up the plan and start iterating the plan
• MR emits values (rows) into the topmost operator (Fetch)
• Rows propagate down the tree
• ReduceSinkOperator emits map output for shuffle
• Each operator implements both a map side and a reduce side algorithm
• Executes the one appropriate for the current task
• MR does the shuffle, many operators rely on it as part of their algorithm
• Eg. SortOperator, GroupByOperator
• Multi-stage queries create intermediate output and the driver submits new job to continue next stage
• TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later)
• Vectorized execution mode emits batches of rows (1024 rows)
Hive features
• Data types:
• Numeric: tinyint, smallint, int, bigint, float,
double, decimal(precision, scale)
• Date/Time: timestamp, date
• Character types: string, char(size), varchar(size)
• Misc. types: Boolean, binary
• Complex types: ARRAY<type>, MAP<type, type>,
STRUCT<name:type, name:type>, UNIONTYPE<type,
type, type>
• Storage formats: text, sequencefile, ORC, Parquet, RC,
arbitrary SerDe
• Data Load: INSERT, LOAD, external tables, dynamic
partitioning
• Bucketized tables
• JOIN optimizations: MapJoin, SMB join
• ACID DML (INSERT/UPDATE/DELETE)
• Only supported for ORC storage for now
• Columnar storage, vectorized execution
• Cost based optimizer (new)
• HiveQL: SQL dialect, drives toward ANSI-92 compliance
• Subqueries, joins, common table expressions
• Lateral views (CROSS APPLY)
• SELECT … FROM table
LATERAL VIEW explode(column)
• Windowing and analytical functions
• LEAD, LAG, FIRST_VALUE, LAST_VALUE
• RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE
• OVER clause for aggregates
• PARTITION BY, ORDER BY
• WINDOW specification
• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c
ROWS 3 PRECEDING AND 3 FOLLOWING)
• WINDOW clause
• SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION
BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2
FOLLOWING)
• GROUPING SETs, CUBE, ROLLUP
• XPath, custom UDF
• TRANSFORM: arbitrary map_script, reduce_script
Hive engines
• MapReduce
• Default, widely
available
• Complex queries
require stages ->
stop-and-go
• Always on disk
shuffle
• TEZ
• Generalized MRR
(DAG)
• Pipelining
• Memory shuffle
• JOINs (bipartite)
• Custom sort
• HIVE can optimize
plans for TEZ
• Recommended
engine
• SPARK
• HIVE-7292
• In development
• Not to be confused
with Shark or Spark-
SQL
Hive pros and cons
+ Capable of handling PB scale
+ Decent performance
+ Fairly advanced SQL features
+ Integrates with Hadoop ecosystem
+ Share the data (HDFS)
+ Leverage existing clusters
+ High Availability
+ Disaster Recoverability
+ Partitioning, Clustering, Bucketing
+ Positive momentum, active
development
+ No licensing costs
- Not good for ad-hoc due to high
latency (job submit time)
• Topic is actively pursued by Hive/Tez/Yarn
- ANSI-SQL gaps
- Poor toolset (hivecli, ODBC drivers)
- In-house RDBMS operating expertize
does not translate to Hive
- Outperformed by (costly) high-end
proprietary solutions
How to use Hive
+As a part of the ETL in transforming large data ingress (click-stream,
mobile uploads, access log etc) into query able form
+ Alternative to PIG for those that favor SQL
+Run one-off queries to analyze large unstructured data sets
+ Power of SQL to get insight into ‘collect everything’
+DW/BI
+ Can also be part of the ETL that loads the DW
+ Deploy on TEZ not on M/R
+ Use ORC or as storage Parquet format
+Use recent releases (Hive 0.14 or later)
When to avoid Hive
- Replace RDBMS/OLTP
- Ad-hoc BI (latency still too high, will improve soon)
- When the dataset is small
• 512 GB RAM is cheap
• If it fits in memory, is not Big data
- When data changes frequently
- If you have an infinite budget
Some alternatives to Hive
• Columnar storage in a traditional RDBMS
• MySQL: ICE, InfiniDB
• PostgreSQL: cstore_fwd (Citus)
• SQL Server columnstore
• Amazon Red-shift
• Azure SQL Database v12
• Impala
• Presto
• Spark SQL
• Note that Impala and Spark SQL can share Hive’s metastore
Links
• Hive Language Manual:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• Join strategies in Hive:
https://cwiki.apache.org/confluence/download/attachments/273620
54/Hive+Summit+2011-join.pdf
• Hive on Tez:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez
• Hive on Spark:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark

More Related Content

What's hot

The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
Joydeep Sen Sarma
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
Joydeep Sen Sarma
 
Nextag talk
Nextag talkNextag talk
Nextag talk
Joydeep Sen Sarma
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
Joydeep Sen Sarma
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
Joydeep Sen Sarma
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
NoSQL
NoSQLNoSQL
NoSQL
dbulic
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systems
Regunath B
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
Amazon Web Services
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
datastack
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
Kel Graham
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Andrew Brust
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
Eric Sun
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
Mostafa
 
Blue Sky Thinking
Blue Sky ThinkingBlue Sky Thinking
Blue Sky Thinking
Thomas Sykes
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
Sql server 2016 Discovery Day
Sql server 2016 Discovery DaySql server 2016 Discovery Day
Sql server 2016 Discovery Day
Thomas Sykes
 

What's hot (20)

The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
NoSQL
NoSQLNoSQL
NoSQL
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systems
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
Blue Sky Thinking
Blue Sky ThinkingBlue Sky Thinking
Blue Sky Thinking
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Sql server 2016 Discovery Day
Sql server 2016 Discovery DaySql server 2016 Discovery Day
Sql server 2016 Discovery Day
 

Similar to Hive big-data meetup

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
Keith Davis
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
John Sichi
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
vishwasgarade1
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
Mark Rittman
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
hdhappy001
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
Remus Rusanu
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
David Kaiser
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
Serendio Inc.
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
DataWorks Summit
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
DataWorks Summit
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
alanfgates
 

Similar to Hive big-data meetup (20)

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 

Recently uploaded

原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 

Recently uploaded (20)

原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 

Hive big-data meetup

  • 1. HIVE Bucharest Big Data Meetup March 10, 2015
  • 2. whoami • Developer with SQL Server team since 2001 • Apache contributor • Hive • Hadoop core (security) • stackoverflow user 105929s • @rusanu
  • 3. What is Hive • Datawarehouse for querying and managing large datasets • A query engine that use Hadoop MapReduce for execution • A SQL abstraction for creating MapReduce algorithms • SQL interface to HDFS data • Developed at Facebook VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce Framework • ASF top project since September 2010
  • 4. How does Hive work • SQL submitted via CLI or Hiveserver(2) • Metadata describing tables stored in RDBMS • Driver compiles/optimizes execution plan • Plan and execution engine submitted to Hadoop as job • MR invokes Hive execution engine which executes plan HiveHadoop Metastore RDBMS HCatalog HDFS Driver Compiles, Optimizes MapReduce Task Task Split Split CLI Hiveserver2 ODBC JDBCShell Job Tracker Beeswax
  • 5. Hive Query execution • Compilation/Optimization results in an AST containing operators eg: • FetchOperator: scans source data (the input split) • SelectOperator: projects column values, computes • GroupByOperator: aggregate functions (SUM, COUNT etc) • JoinOperator:joins • The plan forms a DAG of MR jobs • The plan tree is serialized (Kryo) • Hive Driver dispatches jobs • Multiple stages can result in multiple jobs • Task execution picks up the plan and start iterating the plan • MR emits values (rows) into the topmost operator (Fetch) • Rows propagate down the tree • ReduceSinkOperator emits map output for shuffle • Each operator implements both a map side and a reduce side algorithm • Executes the one appropriate for the current task • MR does the shuffle, many operators rely on it as part of their algorithm • Eg. SortOperator, GroupByOperator • Multi-stage queries create intermediate output and the driver submits new job to continue next stage • TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later) • Vectorized execution mode emits batches of rows (1024 rows)
  • 6. Hive features • Data types: • Numeric: tinyint, smallint, int, bigint, float, double, decimal(precision, scale) • Date/Time: timestamp, date • Character types: string, char(size), varchar(size) • Misc. types: Boolean, binary • Complex types: ARRAY<type>, MAP<type, type>, STRUCT<name:type, name:type>, UNIONTYPE<type, type, type> • Storage formats: text, sequencefile, ORC, Parquet, RC, arbitrary SerDe • Data Load: INSERT, LOAD, external tables, dynamic partitioning • Bucketized tables • JOIN optimizations: MapJoin, SMB join • ACID DML (INSERT/UPDATE/DELETE) • Only supported for ORC storage for now • Columnar storage, vectorized execution • Cost based optimizer (new) • HiveQL: SQL dialect, drives toward ANSI-92 compliance • Subqueries, joins, common table expressions • Lateral views (CROSS APPLY) • SELECT … FROM table LATERAL VIEW explode(column) • Windowing and analytical functions • LEAD, LAG, FIRST_VALUE, LAST_VALUE • RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE • OVER clause for aggregates • PARTITION BY, ORDER BY • WINDOW specification • SELECT SUM(a) OVER (PARTITION BY b ORDER BY c ROWS 3 PRECEDING AND 3 FOLLOWING) • WINDOW clause • SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) • GROUPING SETs, CUBE, ROLLUP • XPath, custom UDF • TRANSFORM: arbitrary map_script, reduce_script
  • 7. Hive engines • MapReduce • Default, widely available • Complex queries require stages -> stop-and-go • Always on disk shuffle • TEZ • Generalized MRR (DAG) • Pipelining • Memory shuffle • JOINs (bipartite) • Custom sort • HIVE can optimize plans for TEZ • Recommended engine • SPARK • HIVE-7292 • In development • Not to be confused with Shark or Spark- SQL
  • 8. Hive pros and cons + Capable of handling PB scale + Decent performance + Fairly advanced SQL features + Integrates with Hadoop ecosystem + Share the data (HDFS) + Leverage existing clusters + High Availability + Disaster Recoverability + Partitioning, Clustering, Bucketing + Positive momentum, active development + No licensing costs - Not good for ad-hoc due to high latency (job submit time) • Topic is actively pursued by Hive/Tez/Yarn - ANSI-SQL gaps - Poor toolset (hivecli, ODBC drivers) - In-house RDBMS operating expertize does not translate to Hive - Outperformed by (costly) high-end proprietary solutions
  • 9. How to use Hive +As a part of the ETL in transforming large data ingress (click-stream, mobile uploads, access log etc) into query able form + Alternative to PIG for those that favor SQL +Run one-off queries to analyze large unstructured data sets + Power of SQL to get insight into ‘collect everything’ +DW/BI + Can also be part of the ETL that loads the DW + Deploy on TEZ not on M/R + Use ORC or as storage Parquet format +Use recent releases (Hive 0.14 or later)
  • 10. When to avoid Hive - Replace RDBMS/OLTP - Ad-hoc BI (latency still too high, will improve soon) - When the dataset is small • 512 GB RAM is cheap • If it fits in memory, is not Big data - When data changes frequently - If you have an infinite budget
  • 11. Some alternatives to Hive • Columnar storage in a traditional RDBMS • MySQL: ICE, InfiniDB • PostgreSQL: cstore_fwd (Citus) • SQL Server columnstore • Amazon Red-shift • Azure SQL Database v12 • Impala • Presto • Spark SQL • Note that Impala and Spark SQL can share Hive’s metastore
  • 12. Links • Hive Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual • Join strategies in Hive: https://cwiki.apache.org/confluence/download/attachments/273620 54/Hive+Summit+2011-join.pdf • Hive on Tez: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez • Hive on Spark: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark