New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
October 2014 HUG : Hive On Spark
1. 1
Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
Hive on Spark PRIOR TO 10/23/12
Szehon Ho
Software Engineer at Cloudera, Apache Hive Committer
October 2014
2. Background (Hive)
• Apache Hive: a data query and management tool for a
distributed dataset, exposed via a SQL-like query language
called HiveQL
2
3. Background (Hive)
• 2007-2013, MapReduce = only distributed processing engine
• Map(), Reduce() primitives, not designed for long data pipelines
• Complex SQL-like queries inefficiently expressed as many MR
stages.
• Disk IO between MR’s
• Shuffle-sort between M+R
3
Map() Red()
Hive Query
Map() Red() Map() Red()
HDFS
4. Background (Hive)
• 2013 Hive Community started work on Hive on Tez
• Tez DAG execution graph
4
Hive Query
Map() Red()
Map() Red()
Red()
HDFS
5. Background (Spark)
• Generalized distributed processing framework created in ~2011
by UC Berkeley AMPLab
• Many advantages (community, ease-of-use), heading to succeed
MapReduce
5
6. Background (Spark)
• Community Momentum:
• Already the most active project in Hadoop ecosystem
• June 2014: 255 contributors from 50 companies
• First half of 2014: ~1200 commits, 250000 LOC changed
• Integration from with many Hadoop components, ie Pig, Flume,
Mahout, Crunch, Solr, now Hive.
6
7. Background (Spark)
7
• Clean programming abstraction: Resilient Distributed Dataset
(RDD):
• A fault-tolerant dataset, can be a stage in a data pipeline.
• Created from existing data set like HDFS file, or
transformation from other RDD (chain-up RDD’s)
• Expressive API’s, much more than MapReduce
• Transformations: map, filter, groupBy
• Actions: cache, save
• => More efficient representation of Hive queries
8. Hive on Spark
8
• Shark Project:
• AMPLab github project, fork of Hive
• Not maintained by Hive community, sunsetted 2014
• Hive on Spark:
• Done in Hive community
• Architecturally compatible, by keeping same physical abstraction for Hive on
Spark as Hive on Tez/MR.
• Code maintenance
• Maximize re-use of common functionality across execution engine
9. Hive on Spark
9
• Hive on Spark, User Benefits
• Another seamless execution option (MR, Tez, Spark)
• Leverage Spark clusters coming in use for ML, Graph Processing,
Streaming, etc.
• Continued efficiency, performance improvements via strong Spark
community.
10. High-Level Design
Common across engines:
• HQL syntax
• Tool Integrations (auditing plugins, authorization,
Drivers, Thrift clients, UDF, StorageHandler)
• Logical optimizations
MapRedCompiler TezCompiler SparkCompiler
10
Hive Query
Logical Op Tree
Task
TaskCompiler
Work
MapRedTask
MapRedWork
TezTask SparkTask
MapRedWork
TezWork
TezWork SparkWk
TezWork
SparkWk
SparkWk
11. Simple Example
11
SELECT COUNT(*) from status_updates where
ds = ‘2014-10-01’ group by region;
TableScan
(status_updates)
Filter (ds=‘2014 10-01’)
Select (region)
Group-By (count)
Select
Hive Query:
Operator Tree:
GBY trigger reduce-boundary:
12. Simple Example
12
Reducer
GroupBy
Select
FileOutput
Mapper
TableScan
Filter
Select
Group-By
ReduceSink
MapRed Work Tree
• Map->Reduce
ShuffleSort
13. Simple Example
13
mapPartition()
GroupBy
Select
FileOutput
mapPartition()
TableScan
Filter
Select
Group-By
ReduceSink
Spark Work Tree:
• RDD Chain
No sorting
groupBy()
14. Join Example
14
TableScan
Filter
Select
Join
Select
Sort
Select
TableScan
Filter
Select
SELECT * FROM
(SELECT * FROM src WHERE src.key < 10) src1
JOIN
(SELECT * FROM src WHERE src.key < 10) src2
ORDER BY src1.key;
• Operator Tree:
• Join/Sort trigger Reduce
boundary
Hive Query:
15. Join Example
15
MapRed Work Tree
• 2 MapReduce Works
ShuffleSort ShuffleSort
Map
TableScan
ReduceSink (Sort)
Map
TableScan
Filter
Select
Reduce Sink Reduce
Join
Select
FileOutput
Reduce
Select
FileOutput
Map
TableScan
Filter
Select
Reduce Sink
Disk IO
HDFS
16. Join Example
16
No spill to disk
mapPartition()
Join
Select
Reduce Sink
mapPartition()
Select
FileOutput
union() Partition/
Sort()
sortBy()
mapPartition()
TableScan
Filter
Select
Reduce Sink
mapPartition()
TableScan
Filter
Select
Reduce Sink
Spark Work Tree:
RDD Transform Chain
17. Improvements to Spark
17
• Reduce-side join: SPARK-2978
• Spark had group(), sort(), but not partition+sort like MR-style shuffle-sort.
• Can help other apps migrate from Map-Reduce to Spark
• Remote Spark-context (push down to AM)
• SparkContext is not allowed concurrently in client application process.
• SparkContext is heavy-weight
• Spark Monitoring API’s
• Elastic scaling of Spark application: SPARK-3174
18. Community
18
• Thanks to contributors from many organizations:
• Follow our progress on HIVE-7292
• Thank you!