Hive_p

Hive
房多多-二手房事业部-产品技术中心-数据挖掘产品中心
李栓柱 Samchu Li
2016-06-12

HiveSQL解析
• Hive SQL解析
• <<Hive – A warehousing Solution over a Map-Reduce Framework>>
• 有点老了

Hive Architecture / Exec Flow
6/13/16 HIVE - A warehouse solution over Map Reduce Framework 3
Driver
Compiler
Hadoop
Client
Metastore
-This is the overview
- Clients are User Interfaces both CLI,
WebUI and API likes JDBC and ODBC.
- Metastore is system catalog which
has the schema informaction for hive
tables.
- Dirver manages the lifecycle of
HiveQL for compilation, optimization
and execution.
- Complier transforms HiveQL to
Operators using some optimizers.
Hive Workflow:
- Hive has the operators which are minimum processing units.
- The process of each operator is done with HDFS operation or M/R jobs.
- The compiler converts HiveQL to the sets of operators
- The point is : Hive converts our order(HiveQL) to operators which are made
with M/R jobs.

Hive Workflow - Operators
Operators Descriptions
TableScanOperator 扫描hive表数据
ReduceSinkOperator 创建将发送到Reducer端的<key,value>对
JoinOperator Join两份数据
SelectOperator 选择输出列
FileSinkOperator 建立结果数据，输出至文件
FilterOperator 过滤输入数据
GroupByOperator Group By 语句
MapJoinOperator /*+ mapjoin(t)*/
LimitOperator Limit语句
UnionOperator Union语句
… …

• For M/R processing, Hive uses ExecMaper and ExecReducer
• Hive’s M/R jobs are done by ExecMaper and ExecReducer
• They read plans and process them dynamically
• On processing, 2 modes
• Local processing mode
• Distributed processing mode
Hive Workflow
Driver
Compiler
Hadoop
Client
Metastore

Hive Workflow – 2 modes
• Local Mode
• Hive fork the process with hadoop command
• The plan.xml is made just on 1 and the single node process this
• Distributed mode
• Hive send the process to existing JobTracker
• The information is housed on DistributedCache and
• Processed on multi-nodes
Driver
Compiler
Hadoop
Client
Metastore

Hive Workflow - Compiler
• Compiler: How to process HiveQL
Driver
Compiler
Hadoop
Client
Metastore

“Plumbing”of HIVE compiler
Parser
• Convert into Parse Tree Representation
Semantic
Analyzer
• Convert into block-base internal query representation
• retrieve schema information of the input table from metastore and verifies the column
names and so on
Logical Plan
Generator
• Convert into internal query representation to a logic plan consists of a tree of logical
operators

“Plumbing”of HIVE compiler – continued
Logical
Optimizer
• Rewrite plans into more optimized plans
• Logical optimizer perform multiple passes over logical plan and rewrites in several ways. For example, Combine
multiple joins which share the join key into a single multi-way JOIN which is done by a single M/R job.
Physical Plan
Generator
•Convert into physical plans(M/R jobs)
Physical
Optimizer
• Adopt join strategy

Compiler Overview
10
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Hive
QL
AST
Operator
Tree
QB
Operator
Tree
Task
Tree
Task
Tree

Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Parser
Hive
QL
AST
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
TOK_QUERY
+ TOK_FROM
+ TOK_JOIN
+ TOK_TABREF
+ TOK_TABNAME
+ "access_log_hbase"
+ a
+ TOK_TABREF
+ TOK_TABNAME
+ "product_hbase"
+ "p"
+ "="
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "prono“
Hive
QL
AST
+ TOK_INSERT
+ TOK_DESTINATION
+ TOK_TAB
+ TOK_TABNAME
+ "access_log_temp2"
+ TOK_SELECT
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "user"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "prono"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "maker"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "price"

Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Parser
SQL AST
TOK_QUERY
+ TOK_FROM
+ TOK_JOIN
+ TOK_TABREF
+ TOK_TABNAME
+ a
+ TOK_TABREF
+ TOK_TABNAME
+ "product_hbase"
+ "p"
+ "="
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "prono“
+ TOK_INSERT
+ TOK_DESTINATION
+ TOK_TAB
+ TOK_TABNAME
+ "access_log_temp2"
+ TOK_SELECT
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "user"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "prono"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "maker"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "price"
SQL
AST 1
2
3

13
Semantic Analyzer (1/2)
+ TOK_FROM
+ TOK_JOIN
+ TOK_TABREF
+ TOK_TABNAME
+ a
+ TOK_TABREF
+ TOK_TABNAME
+ "product_hbase"
+ "p"
+ "="
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "prono“
QB
AST ParseInfo
Join Node
+ TOK_JOIN
+ TOK_TABREF
…
+ TOK_TABREF
…
+ “=”
…
13
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
AST QB
MetaData
Alias To Table Info
“a”=Table Info(“access_log_hbase”)
“p”=Table Info(“product_hbase”)
1

14
+ TOK_DESTINATION
+ TOK_TAB
+ TOK_TABNAME
+ "access_log_temp2”
AST
QB
ParseInfo
Name To Destination Node
+ TOK_TAB
+ TOK_TABNAME
+"access_log_temp2”
1414
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
AST QB
2

15
+ TOK_SELECT
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "user"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "prono"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "maker"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "price"
AST
QB
ParseInfo
Name To Select Node
+ TOK_SELECT
+ TOK_SELEXPR
…
+ TOK_SELEXPR
…
+ TOK_SELEXPR
…
+ TOK_SELEXPR
…
1515
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
AST QB
3

16
Logical Plan Generator (1/4)
QB
OP
Tree
TableScanOperator(“access_log_hbase”)
TableScanOperator(“product_hbase”)
MetaData
Alias To Table Info
“a”=Table Info(“access_log_hbase”)
“p”=Table Info(“product_hbase”)
1616
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
QB
OP
Tree

17
QB
ParseInfo
+ TOK_JOIN
+ TOK_TABREF
+ TOK_TABNAME
+ a
+ TOK_TABREF
+ TOK_TABNAME
+ "product_hbase"
+ "p"
+ "="
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "prono“
OP
Tree
ReduceSinkOperator(“access_log_hbase”)
ReduceSinkOperator(“product_hbase”)
JoinOperator
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
QB
OP
Tree

18
OP
Tree
SelectOperator
QB
ParseInfo
Name To Select Node
+ TOK_SELECT
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "user"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "prono"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "maker"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "price"
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
QB
OP
Tree

19
OP
Tree
FileSinkOperator
QB
MetaData
Name To Destination Table Info
“insclause-0”=
Table Info(“access_log_temp2”)
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
QB
OP
Tree

Logical Plan Generator (result)
20 LCF
TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
JoinOperator
JOIN_4
SelectOperator
SEL_5
FileSinkOperator
FS_6
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
OP
Tree

21
Logical Optimizer
说明
LineageGenerator 表与表的血缘关系生成器
ColumnPruner 列裁剪
Predicate
PushDown
谓词下推，将只与一张表有关的过滤操
作下推至TableScanOperator之后
PartitionPruner 分区裁剪
PartitionCondition
Remover
在分区裁剪之前，将一些无关的条件谓
词去除
SimpleFetchOptimiz
er
优化没有GroupBy表达式的聚合查询
GroupByOptimizer map端聚合
CorrelationOptimize
r
利用查询中的相关性，合并有相关性的
JOB
说明
GroupByOptimizer Group By 优化
SamplePruner 采样裁剪
MapJoinProcessor 如果用户指定mapjoin, 则将
ReduceSinkOperator转换成
MapSinkOperator
BucketMapJoin
Optimizer
采用分桶的Map Join, 扩大Map Join的
适用范围
SortedMergeBucket
MapJoinOptimizer
Sort Merge Join
UnionProcessor 目前只在两个子查询都是map-only
Task时做个标记
JoinReader /*+ STREAMTABLE(A) */
ReduceSink
DeDuplication
如果两个ReduceSinkOperator共享同一
个分区/排序列，则需要对他们进行合
并
2121
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser

Logical Optimizer (Predicate Push Down)
2222
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono)
WHERE p.maker = 'honda';

TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
JoinOperator
JOIN_4
SelectOperator
SEL_6
FileSinkOperator
FS_7
2323
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser

TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
JoinOperator
JOIN_4
FilterOperator
FIL_5
(_col8 = 'honda')
SelectOperator
SEL_6
FileSinkOperator
FS_7
2424
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser

TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
JoinOperator
JOIN_4
FilterOperator
FIL_5
(_col8 = 'honda')
SelectOperator
SEL_6
FileSinkOperator
FS_7
FilterOperator
FIL_8
(maker = 'honda')
2525
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser

26
Physical Plan Generator
MoveTask (Stage-0)
Ope
Tree
LoadTableDesc
MapRedTask (Stage-1/root)
TableScanOperator (TS_1)
JoinOperator (JOIN_4)
ReduceSinkOperator (RS_3)
FileSinkOperator (FS_6) StatsTask (Stage-2)
2626
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
OP
Tree
Task
Tree
SelectOperator(SEL_5)

JoinOperator (JOIN_4)
SelectOperator(SEL_5)
Physical Plan Generator (result)
27 LCF
Mapper
TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
Reducer
JoinOperator
JOIN_4
SelectOperator
SEL_5
FileSinkOperator
FS_6
272727
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
OP
Tree
Task
Tree

28
Physical Optimizer
java/org/apache/hadoop/hive/ql/optimizer/physical/以下
概要
MapJoinResolver 处理Map Join
SkewJoinResolver 处理倾斜Join
CommonJoinResolver 处理普通Join
Vectorizer 使Hive从单行单行处理数据改为批量处理方
式，大大提升了指令流水线和缓存的利用率
SortMergeJoinResolver 与bucet配合，类似于归并排序
SamplingOptimizer 并行order by优化器
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Task
Tree
Task
Tree

29
Physical Optimizer (MapJoinResolver)
29
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Task
Tree
Task
Tree
MapRedTask (Stage-1)
Mapper
TableScanOperator
TS_1
TableScanOperator
TS_0
MapJoinOperator
MAPJOIN_7
SelectOperator
SEL_5
FileSinkOperator
FS_6
SelectOperator
SEL_8

30
Physical Optimizer (MapJoinResolver)
Mapper
TableScanOperator
TS_1
MapJoinOperator
MAPJOIN_7
SelectOperator
SEL_5
FileSinkOperator
FS_6
SelectOperator
SEL_8
MapredLocalTask (Stage-7)
TableScanOperator
TS_0
HashTableSinkOperator
HASHTABLESINK_11
Mapper
TableScanOperator
TS_1
TableScanOperator
TS_0
MapJoinOperator
MAPJOIN_7
SelectOperator
SEL_5
FileSinkOperator
FS_6
SelectOperator
SEL_8
30
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Task
Tree
Task
Tree

Join Strategies in Hive
处理分布式Join, 一般有两种办法:
• Replication Join:把其中一个表复制到所有节点，这样另一个表在每个节点上的分片就可以跟这个表
完整的表join了(Map side Join)
• Repartition Join:把两份数据按照Join key进行hash重分布, 让每个节点处理的hash值相同的join key数
据, 也就是做局部的Join (Reduce Side Join)
这里讲下Simple hash join, grace hash join, hybrid grace hash join
1. Common Join
2. Map Join
3. Auto MapJoin
4. Bucket Map Join
5. Bucket Sort Merge Map Join
6. Skew Join

Common Join - Shuffle Join
• Default choice
• Always works
• Worst case scenario
• Each process
• Reads from part of one of the tables
• Buckets and sorts on join key
• Sends one bucket to each reduce
• Works everytime.

Map Join
• One table is small (eg. dimension table)
• Fits in memory
• Each process
• Reads small table into memory hash table
• Streams through part of the big file
• Joining each record from hash table
• Very fast, but limited

Optimized Map Join – Hive-1293

Converting Common Join into Map Join
Task A
CommonJoinTask
Task C
Task A
Conditional Task
Task C
MapJoinLocalTask
CommonJoinTask
. . . . .
c
a
b
Previous Execution Flow
Optimized Execution Flow
MapJoinTask
MapJoinLocalTask
MapJoinTask
MapJoinLocalTask
MapJoinTask

Execution Time
Task A
Conditional Task
Task C
MapJoinLocalTask
CommonJoinTask
a
MapJoinTask
Table X is the big
table
Both tables are too
big for map join
SELECT * FROM
SRC1 x JOIN SRC2 y
ON x.key = y.key;

Backup Task
Task A
Conditional Task
Task C
MapJoin LocalTask
CommonJoinTask
MapJoinTask
Run as a Backup
Task
Memory Bound

Performance Bottleneck
• Distributed Cache is the potential performance bottleneck
• Large hashtable file will slow down the propagation of Distributed Cache
• Mappers are waiting for the hashtables file from Distributed Cache
• Compress and archive all the hashtable file into a tar file

Bucket Map Join
• Why:
• Total table/partition size is big, not good for mapjoin
• How:
• set hive.optimize.bucketmapjoin = true;
• 1. Work together with map join
2. All join tables are bucketized, and each small table’s bucket
number can be divided by big table’s bucket number.
3. Bucket columns == Join columns

Bucket Map Join
SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.*
a join b on a.key = b.key
join c on a.key=c.key;
Table b Table a Table c
Mapper 1
Bucket b1
Bucket
a1
Bucket
a2
Bucket
c1
Mapper 2
Bucket b1
Mapper 3
Bucket b2
a1
c1
a1
c1
a2
c1 Normally in production, there will be
thousands of buckets!
Table a,b,c all bucketized by ‘key’
a has 2 buckets, b has 2, and c has 1
1.  Spawn mapper based on the big table
2.  Only matching buckets of all small tables
are replicated onto each mapper

Sort Merge Bucket (SMB) Join
• If both tables are:
• Sorted the same
• Bucketed the same
• And joining on the sort/bucket column
• Each process:
• Reads a bucket from each table
• Process the row with the lowest value
• Very efficient if applicable

Sort Merge Bucket (SMB) Join
• Why:
• No limit on file/partition/table size
• How:
• set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInp
utFormat;
• 1. Work together with bucket map join
2. Bucket columns == Join columns == sort columns

Sort Merge Bucket Map Join
Facebook
Table A Table B Table C
1, val_1
3, val_3
5, val_5
4, val_4
4, val_4
20, val_20
23, val_23
20, val_20
25, val_25
Small tables are read on demand
NOT hold entire small tables in memory
Can perform outer join

Skew
• Skew is typical in real datasets
• A user complained that his job was slow
• He had 100 reduces
• 98 of them finished fast
• 2 ran really slow
• The key was a boolean...

Skew Join
• Join bottlenecked on the reducer who gets the skewed key
• set hive.optimize.skewjoin = true;
set hive.skewjoin.key = skew_key_threshold

Skew Join Reducer 1
Reducer 2
a-K 3
b-K 3
a-K 3
b-K 3
a-K 2
b-K 2 a-K 2
b-K 2
a-K 1
b-K 1Table
A
Table
B
A join B
Write to
HDFS
HDFS
File
a-K1
HDFS
File
b-K1
Map
join
a-k1
map join
b-k1
Job 1 Job 2
Final results

Skew Group by
• group by造成的倾斜有两个参数可以解决
• 一个是Hive.Map.aggr，默认值已经为true，意思是会做Map端的
combiner。所以如果你的group by查询只是做count(*)的话，其实是看不
出倾斜效果的，但是如果你做的是count(distinct)，那么还是会看出一点
倾斜效果。
• 另一个参数是Hive.groupby. skewindata。这个参数的意思是做Reduce操
作的时候，拿到的key并不是所有相同值给同一个Reduce，而是随机分发，
然后Reduce做聚合，做完之后再做一轮MR，拿前面聚合过的数据再算结
果。所以这个参数其实跟Hive.Map.aggr做的是类似的事情，只是拿到
Reduce端来做，而且要额外启动一轮Job，所以其实不怎么推荐用，效
果不明显。

Case study
• Which of the following is faster?
• Select count(distict(Col)) from Tbl
• Select count(*) from (select distict(col) from Tbl)
- The first case:
- Maps send each value to the
reduce
- Single reduce counts them all
- The second case:
- Maps split up the values to
many reduces
- Each reduce generates its list
- Final job counts the size of
each list
- Singleton reduces are almost
always BAD

• Appendix: What does Explain show?
6/13/16 HIVE - A warehouse solution over Map Reduce Framework 49

Appendix: What does Explain show?
hive> explain INSERT OVERWRITE TABLE access_log_temp2
> SELECT a.user, a.prono, p.maker, p.price
> FROMaccess_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM(TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a)
(TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (.
(TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME
access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user))
(TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p)
maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: prono
type: int
sort order: +
Map-reduce partition columns:
expr: prono
type: int
tag: 0
value expressions:
expr: user
type: string
expr: prono
type: int
p
TableScan
alias: p
key expressions:
expr: prono
type: int
sort order: +
expr: prono
type: int
tag: 1
value expressions:
expr: maker
type: string
expr: price
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col2}
handleSkewJoin: false
outputColumnNames: _col0, _col2, _col6, _col7
Select Operator
expressions:
expr: _col0
type: string
expr: _col2
type: int
expr: _col6
type: string
expr: _col7
type: int
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.access_log_temp2
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputForma t
Stage: Stage-2
Stats-Aggr Operator
Time taken: 0.1 seconds
hive>

hive> explain INSERT OVERWRITE TABLE access_log_temp2
> SELECT a.user, a.prono, p.maker, p.price
> FROMaccess_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
OK
(TOK_QUERY (TOK_FROM(TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a)
(TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (.
(TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME
access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user))
(TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p)
maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price)))))
STAGE DEPENDENCIES:
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
key expressions:
expr: prono
type: int
sort order: +
expr: prono
type: int
tag: 0
value expressions:
expr: user
type: string
expr: prono
type: int
p
TableScan
alias: p
key expressions:
expr: prono
type: int
sort order: +
expr: prono
type: int
tag: 1
value expressions:
expr: maker
type: string
expr: price
type: int
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
handleSkewJoin: false
Select Operator
expressions:
expr: _col0
type: string
expr: _col2
type: int
expr: _col6
type: string
expr: _col7
type: int
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputForma t
Stage: Stage-2
Stats-Aggr Operator
Time taken: 0.1 seconds
hive>
STAGE DEPENDENCIES:
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
TableScan
Join Operator
Select Operator
Stage: Stage-0
Move Operator
Stage: Stage-2
Stats-Aggr Operator

STAGE DEPENDENCIES:
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
TableScan
Join Operator
Select Operator
Stage: Stage-0
Move Operator
Stage: Stage-2
Stats-Aggr Operator
Mapper
TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
Reducer
JoinOperator
JOIN_4
SelectOperator
SEL_5
FileSinkOperator
FS_6
≒
Move Task (Stage-0)
Stats Task (Stage-2)

Explain
• Hive doesn’t tell you what is
wrong
• Expects you to know.
• Explain tool provides query
plan
• Filters on input
• Numbers of jobs
• Numbers of maps and reduces
• What the jobs are sorting by
• What directories are they
reading or writing

Hive SQL 解析
• 抽象语法树: org.apache.hadoop.hive.ql.parse.ParseDriver的parse方
法
• org.apache.hadoop.hive.ql.parse.ASTNode中getToken().getType()
拿到节点, 遇到TOK_QUERY循环调用

HiveQL优化
• Data Layout
• Data Format
• Joins
• Debugging

Data Layout – HDFS Characteristics
• Provides Distributed File System
• Very high aggregate bandwidth
• Extreme scalability(up to 100 PB)
• Self-healing storage
• Relatively simple to administer
• Limitations
• Can’t modify existing files
• Single writer for each file
• Heavy bias for large files(> 100 MB)

Choices for Layout
• Partitions
• Top level mechanism for pruning
• Primary unit for updating tables(& schema)
• Directory per value of specified column
• Bucketing
• Hashed into a file, good for sampling
• Controls write parallelism
• Sort order
• The order the data is written within file

Example Hive Layout
• Directory
• Warehouse/$database/$table
• Partitioning
• /part1=$partValue/part2=$partValue
• Bucketing
• /$bucket_$attempt(eg. 000000_0)
• Sort
• Each file is sorted within the file

Layout Guidelines
• Limit the number of partitions
• 1000 partitions is much faster than 10000
• Nested partitions are almost always wrong
• Gauge the number of buckets
• Calculate file size and keep big (200 ~ 500MB)
• Don’t forget number of files (Buckets * Parts)
• Layout related tables the same way
• Partition
• Bucket and sort order

Data Format
• Serde
• Input/Output (aka File) Format
• Primary Choices
• Text
• Sequence File
• RCFile
• ORC

Text Format
• Critical to pick a Serde
• Default - 001 between fields
• JSON – top level JSON record
• CSV
• Slow to read and wirte
• Can‘t split compressed files
• Leads to huge maps
• Need to read/decompress all fields

Sequence File
• Traditional MapReduce binary file format
• Stores keys and values as classes
• Not a goof fit for Hive, which has SQL types
• Hive always stores entire row as value
• Splittable but only by searching file
• Default block size is 1 MB
• Need to read and decompress all fields

RCFile
• Columns stored separately
• Read and decompress only needed ones
• Better Compression
• Columns stored as binary blobs
• Depends on metastore to supply types
• Larger blocks
• 4MB by default
• Still search file for split boundary

ORC(Optimized Row Columnar)
• Columns stored separately
• Knows types
• Uses type-specific encoders
• Stores statistics(min, max, sum, count)
• Has light-weight index
• Skip over blocks of rows that don‘t matter
• Larger blocks
• 256 MB by default
• Has an index for block boundaries

比较
• 和RCFile格式相比，ORC File格式有以下优点：
(1)、每个task只输出单个文件，这样可以减少NameNode的负载；
(2)、支持各种复杂的数据类型，比如： datetime, decimal, 以及一
些复杂类型(struct, list, map, and union)；
(3)、在文件中存储了一些轻量级的索引数据；
(4)、基于数据类型的块模式压缩：a、integer类型的列用行程长
度编码(run-length encoding);b、String类型的列用字典编码(dictionary
encoding)；
(5)、用多个互相独立的RecordReaders并行读相同的文件；
(6)、无需扫描markers就可以分割文件；
(7)、绑定读写所需要的内存；
(8)、metadata的存储是用 Protocol Buffers的，所以它支持添加和
删除一些列。

ORC使用
• CREATE TABLE ... STORED AS ORC
• ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
• SET hive.default.fileformat=Orc
• 所有关于ORCFile的参数都是在Hive QL语句的TBLPROPERTIES字
段里面出现，他们是：

ORC使用 – 例子
create table Addresses (
name string,
street string,
city string,
state string,
zip int
) stored as orc tblproperties ("orc.compress"="NONE");

Vectorized Query Execution
• The Hive query execution engine currently processes one row at a time.
A single row of data goes through all the operators before the next
row can be processed. This mode of processing is very inefficient in
terms of CPU usage.
• This involves long code paths and significant metadata interpretation in
the inner loop of execution. Vectorized query execution streamlines
operations by processing a block of 1024 rows at a time. Within the
block, each column is stored as a vector (an array of a primitive data
type). Simple operations like arithmetic and comparisons are done by
quickly iterating through the vectors in a tight loop, with no or very few
function calls or conditional branches inside the loop. These loops
compile in a streamlined way that uses relatively few instructions and
finishes each instruction in fewer clock cycles, on average, by effectively
using the processor pipeline and cache memory.

Vectorized Query Execution - USAGE
• ORC format
• set hive.vectorized.execution.enabled = true;
• Vectorized execution is off by default, so your queries only utilize
it if this variable is turned on. To disable vectorized execution and
go back to standard execution, do the following:
• set hive.vectorized.execution.enabled = false;

Vectorized Query Execution - USAGE
• The following expressions can be vectorized when used on supported types:
• arithmetic: +, -, *, /, %
• AND, OR, NOT
• comparisons <, >, <=, >=, =, !=, BETWEEN, IN ( list-of-constants ) as filters
• Boolean-valued expressions (non-filters) using AND, OR, NOT, <, >, <=, >=, =, !=
• IS [NOT] NULL
• all math functions (SIN, LOG, etc.)
• string functions SUBSTR, CONCAT, TRIM, LTRIM, RTRIM, LOWER, UPPER, LENGTH
• type casts
• Hive user-defined functions, including standard and generic UDFs
• date functions (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP)
• the IF conditional expression

Vectorized Query Execution – USAGE UDF
support
• User-defined functions are supported using a backward
compatibility bridge, so although they do run vectorized, they
don't run as fast as optimized vector implementations of built-in
operators and functions. Vectorized filter operations are evaluated
left-to-right, so for best performance, put UDFs on the right in an
ANDed list of expressions in the WHERE clause. E.g., use
• column1 = 10 and myUDF(column2) = "x"

Compression
• Need to pick level of compression
• None
• LZO or Snappy – fast but sloppy
• Best for temporary tables
• ZLIB – slow and complete
• Best for long term storage

查询优化 - Map阶段的优化(Map phase)
• Map阶段的优化，主要是确定合适的Map数。那么首先要了解
Map数的计算公式：
• num_Map_tasks = max[${Mapred.min.split.size},min(${dfs.block.siz
e}, ${Mapred.max.split.size})]
• Mapred.min.split.size指的是数据的最小分割单元大小。
• Mapred.max.split.size指的是数据的最大分割单元大小。
• dfs.block.size指的是HDFS设置的数据块大小。
• 一般来说dfs.block.size这个值是一个已经指定好的值，而且这个参数Hive是识别不
到的
• 在Hive中min的默认值是1B，max的默认值是256MB

查询优化 - Reduce阶段的优化(Reduce phase)
• Reduce阶段优化的主要工作也是选择合适的Reduce task数量，跟
上面的Map优化类似。
• 1. Mapred.Reduce.tasks, 直接指定reduce数量
• 2. num_Reduce_tasks = min[${Hive.exec.Reducers.max},
(${input.size} / ${ Hive.exec.Reducers.bytes.per.Reducer})]
• 根据输入的数据量大小来决定Reduce的个数，默认
Hive.exec.Reducers.bytes.per.Reducer为1G,而且Reduce个数不能超过一个上限参数
值，这个参数的默认取值为999。所以我们可以调整
Hive.exec.Reducers.bytes.per.Reducer来设置Reduce个数。

Map与Reduce之间的优化(Spill， copy，
Sort phase)
• Spill和Sort
• 在Spill阶段，由于内存不够，数据可能没办法在内存中一次性排
序完成，那么就只能把局部排序的文件先保存到磁盘上，这个动
作叫Spill，然后Spill出来的多个文件可以在最后进行merge。如果
发生Spill，可以通过设置io.Sort.mb来增大Mapper输出buffer的大
小，避免Spill的发生。另外合并时可以通过设置io.Sort.factor来使
得一次性能够合并更多的数据。调试参数的时候，一个要看Spill
的时间成本，一个要看merge的时间成本，还需要注意不要撑爆
内存（io.Sort.mb是算在Map的内存里面的）。Reduce端的merge
也是一样可以用io.Sort.factor。一般情况下这两个参数很少需要
调整，除非很明确知道这个地方是瓶颈。

Map与Reduce之间的优化(Spill， copy，
Sort phase)
• Copy
• copy阶段是把文件从Map端copy到Reduce端。默认情况下在5%的Map
完成的情况下Reduce就开始启动copy，这个有时候是很浪费资源的，
因为Reduce一旦启动就被占用，一直等到Map全部完成，收集到所有
数据才可以进行后面的动作，所以我们可以等比较多的Map完成之后
再启动Reduce流程，这个比例可以通
Mapred.Reduce.slowstart.completed.Maps去调整，他的默认值就是5%。
如果觉得这么做会减慢Reduce端copy的进度，可以把copy过程的线程
增大。tasktracker.http.threads可以决定作为server端的Map用于提供数
据传输服务的线程，Mapred.Reduce.parallel.copies可以决定作为client
端的Reduce同时从Map端拉取数据的并行度（一次同时从多少个Map
拉数据），修改参数的时候这两个注意协调一下，server端能处理
client端的请求即可。

文件的其他优化 - 小文件问题
• 小文件问题在目前的Hive环境下已经得到了比较好的解决，Hive
的默认配置中就可以在小文件输入时自动把多个文件合并给1个
Map处理，输出时如果文件很小也会进行一轮单独的合并
• 解决办法:
• 1. 输入合并, 即在map前合并小文件
• 2. 输出合并, 即在输出结果的时候合并小文件

Hive小文件 – 输入合并
• -- 每个Map最大输入大小，决定合并后的文件数
• set mapred.max.split.size=256000000;
• -- 一个节点上split的至少的大小，决定了多个data node上的文件是
否需要合并
• set mapred.min.split.size.per.node=100000000;
• -- 一个交换机下split的至少的大小，决定了多个交换机上的文件是否
需要合并
• set mapred.min.split.size.per.rack=100000000;
• -- 执行Map前进行小文件合并
• set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInput
Format;

Hive小文件 – 输出合并
• hive.merge.mapfiles 在map-only job后合并文件，默认true
• hive.merge.mapredfiles 在map-reduce job后合并文件，默认false
• hive.merge.size.per.task 合并后每个文件的大小，默认256000000
• hive.merge.smallfiles.avgsize 平均文件大小，是决定是否执行合
并操作的阈值，默认16000000

压缩文件的处理
• 对于输出结果为压缩文件形式存储的情况，要解决小文件问题，如果在Map输入前合并，对输出的文
件存储格式并没有限制。但是如果使用输出合并，则必须配合SequenceFile来存储，否则无法进行合
并，以下是示例：
• set mapred.output.compression. type=BLOCK;
• set hive.exec.compress.output= true;
• set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
• set hive.merge.smallfiles.avgsize=100000000;
• drop table if exists dw_stage.zj_small;
• create table dw_stage.zj_small
• STORED AS SEQUENCEFILE
• as select *
• from dw_db.dw_soj_imp_dtl
• where log_dt = '2014-04-14'
• and paid like '%baidu%' ;

使用HAR归档文件
• Hadoop的归档文件格式也是解决小文件问题的方式之一。而且Hive提供了
原生支持：
•
• set hive.archive.enabled= true;
• set hive.archive.har.parentdir.settable= true;
• set har.partfile.size=1099511627776;
• ALTER TABLE srcpart ARCHIVE PARTITION(ds= '2008-04-08', hr= '12' );
• ALTER TABLE srcpart UNARCHIVE PARTITION(ds= '2008-04-08', hr= '12' );
•
• 如果使用的不是分区表，则可创建成外部表，并使用har://协议来指定路径。

Job的优化
• 1. Job执行模式
• 2. JVM重用
• 3. 索引
• 4. Join算法
• 5. 数据倾斜

Job执行模式
• Hadoop的Map Reduce Job可以有3种模式执行，即本地模式，伪分布式，
还有真正的分布式。本地模式和伪分布式都是在最初学习Hadoop的时候往
往被说成是做单机开发的时候用到。但是实际上对于处理数据量非常小的
Job，直接启动分布式Job会消耗大量资源，而真正执行计算的时间反而非常
少。这个时候就应该使用本地模式执行mr Job，这样执行的时候不会启动分
布式Job，执行速度就会快很多。比如一般来说启动分布式Job，无论多小的
数据量，执行时间一般不会少于20s，而使用本地mr模式，10秒左右就能出
结果。
• 设置执行模式的主要参数有三个，一个是Hive.exec.mode.local.auto，把他设
为true就能够自动开启local mr模式。但是这还不足以启动local mr，输入的
文件数量和数据量大小必须要控制，这两个参数分别为
Hive.exec.mode.local.auto.tasks.max和
Hive.exec.mode.local.auto.inputbytes.max，默认值分别为4和128MB，即默
认情况下，Map处理的文件数不超过4个并且总大小小于128MB就启用local
mr模式。

JVM重用
• 正常情况下，MapReduce启动的JVM在完成一个task之后就退出
了，但是如果任务花费时间很短，又要多次启动JVM的情况下
（比如对很大数据量进行计数操作），JVM的启动时间就会变成
一个比较大的overhead。在这种情况下，可以使用jvm重用的参
数：
• set Mapred.Job.reuse.jvm.num.tasks = 5;
• 他的作用是让一个jvm运行多次任务之后再退出。这样一来也能
节约不少JVM启动时间。

Join
• 参考HQL编译解析过程

SQL整体优化
• 1. Job间并行
• 设置Job间并行的参数是Hive.exec.parallel，将其设为true即可。默认的并行度为8，也就是最多允许sql
中8个Job并行。如果想要更高的并行度，可以通过Hive.exec.parallel. thread.number参数进行设置，但
要避免设置过大而占用过多资源。
• 2.减少Job数
• 例子: 查询某网站访问过页面a和页面b的用户数量
select count(*)
from
(select distinct user_id
from logs where page_name = ‘a’) a
join
(select distinct user_id
from logs where blog_owner = ‘b’) b
on a.user_id = b.user_id;

SQL整体优化
select count(*)
from logs group by user_id
having (count(case when page_name = ‘a’ then 1 end) > 0
and count(case when page_name = ‘b’ then 1 end) > 0)

Indexed Hive
• Hive Indexing
• Provides key-based data view
• Keys data duplicated
• Storage layout favors search & lookup performance
• Provided better data access for certain operations
• A cheaper alternative to full data scans!

How does the index look like?
• An index is a table with 3 columns
• Data in index looks like

Hive index in HQL
• SELECT (mapping, projection, association, given key, fetch value)
• WHERE (filters on keys)
• GROUP BY (grouping on keys)
• JOIN (join key as index key)
• Indexes have high potential for accelerating wide range of queries

Hive Index
• Index as Reference
• Index as Data
• Here takes the index as data as the demonstration
• Uses Query Rewrite technique to transform queries on base table to
index table
• Limited applicability currently, but technique itself has wide potential
• Also a very quick way to demonstrate importance of index for
performance

Indexes and Query Rewrites
• GROUP BY, aggregation
• Index as Data
• Group By Key = Index Key
• Query rewritten to use indexes, but still a valid query (nothing special in
it!)

Why index performs better?
• Reducing data increases I/O efficiency
• Exploiting storage layout optimization
• e.g. GROUP BY:
• Sort + agg
• Hash & agg
• Sort step already in index
• Parallelization
• Process the index data in the same manner as base table, distribute the
processing across nodes
• Scalable

Hive MetaStore ER diagram
BUCKETING_COLS
SD_ID BIGINT(20)
BUCKET_COL_NAME VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
COLUMNS
SD_ID BIGINT(20)
COMMENT VARCHAR(256)
COLUMN_NAME VARCHAR(128)
TYPE_NAME VARCHAR(4000)
INTEGER_IDX INT(11)
Indexes
DATABASE_PARAMS
DB_ID BIGINT(20)
PARAM_KEY VARCHAR(180)
PARAM_VALUE VARCHAR(4000)
Indexes
DBS
DB_ID BIGINT(20)
DESC VARCHAR(4000)
DB_LOCATION_URI VARCHAR(4000)
NAME VARCHAR(128)
Indexes
DB_PRIVS
DB_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
DB_PRIV VARCHAR(128)
Indexes
GLOBAL_PRIVS
USER_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
USER_PRIV VARCHAR(128)
Indexes
IDXS
INDEX_ID BIGINT(20)
CREATE_TIME INT(11)
DEFERRED_REBUILD BIT(1)
INDEX_HANDLER_CLASS VARCHAR(4000)
INDEX_NAME VARCHAR(128)
INDEX_TBL_ID BIGINT(20)
LAST_ACCESS_TIME INT(11)
ORIG_TBL_ID BIGINT(20)
SD_ID BIGINT(20)
Indexes
INDEX_PARAMS
INDEX_ID BIGINT(20)
Indexes
PARTITION_KEYS
TBL_ID BIGINT(20)
PKEY_COMMENT VARCHAR(4000)
PKEY_NAME VARCHAR(128)
PKEY_TYPE VARCHAR(767)
INTEGER_IDX INT(11)
Indexes
ROLES
ROLE_ID BIGINT(20)
CREATE_TIME INT(11)
OWNER_NAME VARCHAR(128)
ROLE_NAME VARCHAR(128)
Indexes
ROLE_MAP
ROLE_GRANT_ID BIGINT(20)
ADD_TIME INT(11)
ROLE_ID BIGINT(20)
Indexes
SDS
SD_ID BIGINT(20)
INPUT_FORMAT VARCHAR(4000)
IS_COMPRESSED BIT(1)
LOCATION VARCHAR(4000)
NUM_BUCKETS INT(11)
OUTPUT_FORMAT VARCHAR(4000)
SERDE_ID BIGINT(20)
Indexes
SD_PARAMS
SD_ID BIGINT(20)
Indexes
SEQUENCE_TABLE
SEQUENCE_NAME VARCHAR(255)
NEXT_VAL BIGINT(20)
Indexes
SERDES
SERDE_ID BIGINT(20)
NAME VARCHAR(128)
SLIB VARCHAR(4000)
Indexes
SERDE_PARAMS
SERDE_ID BIGINT(20)
Indexes
SORT_COLS
SD_ID BIGINT(20)
COLUMN_NAME VARCHAR(128)
ORDER INT(11)
INTEGER_IDX INT(11)
Indexes
TABLE_PARAMS
TBL_ID BIGINT(20)
Indexes
TBLS
TBL_ID BIGINT(20)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
LAST_ACCESS_TIME INT(11)
OWNER VARCHAR(767)
RETENTION INT(11)
SD_ID BIGINT(20)
TBL_NAME VARCHAR(128)
TBL_TYPE VARCHAR(128)
VIEW_EXPANDED_TEXT MEDIUMTEXT
VIEW_ORIGINAL_TEXT MEDIUMTEXT
Indexes
TBL_PRIVS
TBL_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
TBL_PRIV VARCHAR(128)
TBL_ID BIGINT(20)
Indexes

Reference
• https://cwiki.apache.org/confluence/display/Hive/DesignDocs
• FaceBook Hive Summit 2011 – join: Hive from the 2011 Hadoop
Summit (Liyin Tang, Namit Jain)
• Indexed Hive – Prafulla Tekawade/ Nikhil Deshpande
• Internal Hive “http://www.slideshare.net/recruitcojp/internal-hive
• Hive SQL的编译过程: http://tech.meituan.com/hive-sql-to-
mapreduce.html
• MonetDB/X100: Hyper-Pipelining Query Execution. 2005, Peter Boncz,
Matcin Zuokowski, Niels Nes
• Ysmart: Yet Another SQL-to-MapReduce Translator, Rubao Lee, Tian
Luo…

Hive_p

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Hive_p

Similar to Hive_p (20)

Hive_p