HIVE
Bucharest Big Data Meetup
March 10, 2015
whoami
• Developer with SQL Server team since 2001
• Apache contributor
• Hive
• Hadoop core (security)
• stackoverflow user 105929s
• @rusanu
What is Hive
• Datawarehouse for querying and managing large datasets
• A query engine that use Hadoop MapReduce for execution
• A SQL abstraction for creating MapReduce algorithms
• SQL interface to HDFS data
• Developed at Facebook
VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce
Framework
• ASF top project since September 2010
How does Hive work
• SQL submitted via CLI or
Hiveserver(2)
• Metadata describing tables
stored in RDBMS
• Driver compiles/optimizes
execution plan
• Plan and execution engine
submitted to Hadoop as job
• MR invokes Hive execution
engine which executes plan
HiveHadoop
Metastore
RDBMS
HCatalog
HDFS
Driver
Compiles, Optimizes
MapReduce
Task
Task
Split
Split
CLI Hiveserver2
ODBC JDBCShell
Job
Tracker
Beeswax
Hive Query execution
• Compilation/Optimization results in an AST containing operators eg:
• FetchOperator: scans source data (the input split)
• SelectOperator: projects column values, computes
• GroupByOperator: aggregate functions (SUM, COUNT etc)
• JoinOperator:joins
• The plan forms a DAG of MR jobs
• The plan tree is serialized (Kryo)
• Hive Driver dispatches jobs
• Multiple stages can result in multiple jobs
• Task execution picks up the plan and start iterating the plan
• MR emits values (rows) into the topmost operator (Fetch)
• Rows propagate down the tree
• ReduceSinkOperator emits map output for shuffle
• Each operator implements both a map side and a reduce side algorithm
• Executes the one appropriate for the current task
• MR does the shuffle, many operators rely on it as part of their algorithm
• Eg. SortOperator, GroupByOperator
• Multi-stage queries create intermediate output and the driver submits new job to continue next stage
• TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later)
• Vectorized execution mode emits batches of rows (1024 rows)
Hive features
• Data types:
• Numeric: tinyint, smallint, int, bigint, float,
double, decimal(precision, scale)
• Date/Time: timestamp, date
• Character types: string, char(size), varchar(size)
• Misc. types: Boolean, binary
• Complex types: ARRAY<type>, MAP<type, type>,
STRUCT<name:type, name:type>, UNIONTYPE<type,
type, type>
• Storage formats: text, sequencefile, ORC, Parquet, RC,
arbitrary SerDe
• Data Load: INSERT, LOAD, external tables, dynamic
partitioning
• Bucketized tables
• JOIN optimizations: MapJoin, SMB join
• ACID DML (INSERT/UPDATE/DELETE)
• Only supported for ORC storage for now
• Columnar storage, vectorized execution
• Cost based optimizer (new)
• HiveQL: SQL dialect, drives toward ANSI-92 compliance
• Subqueries, joins, common table expressions
• Lateral views (CROSS APPLY)
• SELECT … FROM table
LATERAL VIEW explode(column)
• Windowing and analytical functions
• LEAD, LAG, FIRST_VALUE, LAST_VALUE
• RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE
• OVER clause for aggregates
• PARTITION BY, ORDER BY
• WINDOW specification
• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c
ROWS 3 PRECEDING AND 3 FOLLOWING)
• WINDOW clause
• SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION
BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2
FOLLOWING)
• GROUPING SETs, CUBE, ROLLUP
• XPath, custom UDF
• TRANSFORM: arbitrary map_script, reduce_script
Hive engines
• MapReduce
• Default, widely
available
• Complex queries
require stages ->
stop-and-go
• Always on disk
shuffle
• TEZ
• Generalized MRR
(DAG)
• Pipelining
• Memory shuffle
• JOINs (bipartite)
• Custom sort
• HIVE can optimize
plans for TEZ
• Recommended
engine
• SPARK
• HIVE-7292
• In development
• Not to be confused
with Shark or Spark-
SQL
Hive pros and cons
+ Capable of handling PB scale
+ Decent performance
+ Fairly advanced SQL features
+ Integrates with Hadoop ecosystem
+ Share the data (HDFS)
+ Leverage existing clusters
+ High Availability
+ Disaster Recoverability
+ Partitioning, Clustering, Bucketing
+ Positive momentum, active
development
+ No licensing costs
- Not good for ad-hoc due to high
latency (job submit time)
• Topic is actively pursued by Hive/Tez/Yarn
- ANSI-SQL gaps
- Poor toolset (hivecli, ODBC drivers)
- In-house RDBMS operating expertize
does not translate to Hive
- Outperformed by (costly) high-end
proprietary solutions
How to use Hive
+As a part of the ETL in transforming large data ingress (click-stream,
mobile uploads, access log etc) into query able form
+ Alternative to PIG for those that favor SQL
+Run one-off queries to analyze large unstructured data sets
+ Power of SQL to get insight into ‘collect everything’
+DW/BI
+ Can also be part of the ETL that loads the DW
+ Deploy on TEZ not on M/R
+ Use ORC or as storage Parquet format
+Use recent releases (Hive 0.14 or later)
When to avoid Hive
- Replace RDBMS/OLTP
- Ad-hoc BI (latency still too high, will improve soon)
- When the dataset is small
• 512 GB RAM is cheap
• If it fits in memory, is not Big data
- When data changes frequently
- If you have an infinite budget
Some alternatives to Hive
• Columnar storage in a traditional RDBMS
• MySQL: ICE, InfiniDB
• PostgreSQL: cstore_fwd (Citus)
• SQL Server columnstore
• Amazon Red-shift
• Azure SQL Database v12
• Impala
• Presto
• Spark SQL
• Note that Impala and Spark SQL can share Hive’s metastore
Links
• Hive Language Manual:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• Join strategies in Hive:
https://cwiki.apache.org/confluence/download/attachments/273620
54/Hive+Summit+2011-join.pdf
• Hive on Tez:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez
• Hive on Spark:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark

Hive big-data meetup

  • 1.
    HIVE Bucharest Big DataMeetup March 10, 2015
  • 2.
    whoami • Developer withSQL Server team since 2001 • Apache contributor • Hive • Hadoop core (security) • stackoverflow user 105929s • @rusanu
  • 3.
    What is Hive •Datawarehouse for querying and managing large datasets • A query engine that use Hadoop MapReduce for execution • A SQL abstraction for creating MapReduce algorithms • SQL interface to HDFS data • Developed at Facebook VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce Framework • ASF top project since September 2010
  • 4.
    How does Hivework • SQL submitted via CLI or Hiveserver(2) • Metadata describing tables stored in RDBMS • Driver compiles/optimizes execution plan • Plan and execution engine submitted to Hadoop as job • MR invokes Hive execution engine which executes plan HiveHadoop Metastore RDBMS HCatalog HDFS Driver Compiles, Optimizes MapReduce Task Task Split Split CLI Hiveserver2 ODBC JDBCShell Job Tracker Beeswax
  • 5.
    Hive Query execution •Compilation/Optimization results in an AST containing operators eg: • FetchOperator: scans source data (the input split) • SelectOperator: projects column values, computes • GroupByOperator: aggregate functions (SUM, COUNT etc) • JoinOperator:joins • The plan forms a DAG of MR jobs • The plan tree is serialized (Kryo) • Hive Driver dispatches jobs • Multiple stages can result in multiple jobs • Task execution picks up the plan and start iterating the plan • MR emits values (rows) into the topmost operator (Fetch) • Rows propagate down the tree • ReduceSinkOperator emits map output for shuffle • Each operator implements both a map side and a reduce side algorithm • Executes the one appropriate for the current task • MR does the shuffle, many operators rely on it as part of their algorithm • Eg. SortOperator, GroupByOperator • Multi-stage queries create intermediate output and the driver submits new job to continue next stage • TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later) • Vectorized execution mode emits batches of rows (1024 rows)
  • 6.
    Hive features • Datatypes: • Numeric: tinyint, smallint, int, bigint, float, double, decimal(precision, scale) • Date/Time: timestamp, date • Character types: string, char(size), varchar(size) • Misc. types: Boolean, binary • Complex types: ARRAY<type>, MAP<type, type>, STRUCT<name:type, name:type>, UNIONTYPE<type, type, type> • Storage formats: text, sequencefile, ORC, Parquet, RC, arbitrary SerDe • Data Load: INSERT, LOAD, external tables, dynamic partitioning • Bucketized tables • JOIN optimizations: MapJoin, SMB join • ACID DML (INSERT/UPDATE/DELETE) • Only supported for ORC storage for now • Columnar storage, vectorized execution • Cost based optimizer (new) • HiveQL: SQL dialect, drives toward ANSI-92 compliance • Subqueries, joins, common table expressions • Lateral views (CROSS APPLY) • SELECT … FROM table LATERAL VIEW explode(column) • Windowing and analytical functions • LEAD, LAG, FIRST_VALUE, LAST_VALUE • RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE • OVER clause for aggregates • PARTITION BY, ORDER BY • WINDOW specification • SELECT SUM(a) OVER (PARTITION BY b ORDER BY c ROWS 3 PRECEDING AND 3 FOLLOWING) • WINDOW clause • SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) • GROUPING SETs, CUBE, ROLLUP • XPath, custom UDF • TRANSFORM: arbitrary map_script, reduce_script
  • 7.
    Hive engines • MapReduce •Default, widely available • Complex queries require stages -> stop-and-go • Always on disk shuffle • TEZ • Generalized MRR (DAG) • Pipelining • Memory shuffle • JOINs (bipartite) • Custom sort • HIVE can optimize plans for TEZ • Recommended engine • SPARK • HIVE-7292 • In development • Not to be confused with Shark or Spark- SQL
  • 8.
    Hive pros andcons + Capable of handling PB scale + Decent performance + Fairly advanced SQL features + Integrates with Hadoop ecosystem + Share the data (HDFS) + Leverage existing clusters + High Availability + Disaster Recoverability + Partitioning, Clustering, Bucketing + Positive momentum, active development + No licensing costs - Not good for ad-hoc due to high latency (job submit time) • Topic is actively pursued by Hive/Tez/Yarn - ANSI-SQL gaps - Poor toolset (hivecli, ODBC drivers) - In-house RDBMS operating expertize does not translate to Hive - Outperformed by (costly) high-end proprietary solutions
  • 9.
    How to useHive +As a part of the ETL in transforming large data ingress (click-stream, mobile uploads, access log etc) into query able form + Alternative to PIG for those that favor SQL +Run one-off queries to analyze large unstructured data sets + Power of SQL to get insight into ‘collect everything’ +DW/BI + Can also be part of the ETL that loads the DW + Deploy on TEZ not on M/R + Use ORC or as storage Parquet format +Use recent releases (Hive 0.14 or later)
  • 10.
    When to avoidHive - Replace RDBMS/OLTP - Ad-hoc BI (latency still too high, will improve soon) - When the dataset is small • 512 GB RAM is cheap • If it fits in memory, is not Big data - When data changes frequently - If you have an infinite budget
  • 11.
    Some alternatives toHive • Columnar storage in a traditional RDBMS • MySQL: ICE, InfiniDB • PostgreSQL: cstore_fwd (Citus) • SQL Server columnstore • Amazon Red-shift • Azure SQL Database v12 • Impala • Presto • Spark SQL • Note that Impala and Spark SQL can share Hive’s metastore
  • 12.
    Links • Hive LanguageManual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual • Join strategies in Hive: https://cwiki.apache.org/confluence/download/attachments/273620 54/Hive+Summit+2011-join.pdf • Hive on Tez: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez • Hive on Spark: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark