SlideShare a Scribd company logo
1 of 106
Download to read offline
Hive
房多多-二手房事业部-产品技术中心-数据挖掘产品中心
李栓柱 Samchu Li
2016-06-12
HiveSQL解析
• Hive SQL解析
• <<Hive – A warehousing Solution over a Map-Reduce Framework>>
• 有点老了
Hive Architecture / Exec Flow
6/13/16 HIVE - A warehouse solution over Map Reduce Framework 3
Driver
Compiler
Hadoop
Client
Metastore
-This is the overview
- Clients are User Interfaces both CLI,
WebUI and API likes JDBC and ODBC.
- Metastore is system catalog which
has the schema informaction for hive
tables.
- Dirver manages the lifecycle of
HiveQL for compilation, optimization
and execution.
- Complier transforms HiveQL to
Operators using some optimizers.
Hive Workflow:
- Hive has the operators which are minimum processing units.
- The process of each operator is done with HDFS operation or M/R jobs.
- The compiler converts HiveQL to the sets of operators
- The point is : Hive converts our order(HiveQL) to operators which are made
with M/R jobs.
Hive Workflow - Operators
Operators Descriptions
TableScanOperator 扫描hive表数据
ReduceSinkOperator 创建将发送到Reducer端的<key,value>对
JoinOperator Join两份数据
SelectOperator 选择输出列
FileSinkOperator 建立结果数据,输出至文件
FilterOperator 过滤输入数据
GroupByOperator Group By 语句
MapJoinOperator /*+ mapjoin(t)*/
LimitOperator Limit语句
UnionOperator Union语句
… …
• For M/R processing, Hive uses ExecMaper and ExecReducer
• Hive’s M/R jobs are done by ExecMaper and ExecReducer
• They read plans and process them dynamically
• On processing, 2 modes
• Local processing mode
• Distributed processing mode
Hive Workflow
Driver
Compiler
Hadoop
Client
Metastore
Hive Workflow – 2 modes
• Local Mode
• Hive fork the process with hadoop command
• The plan.xml is made just on 1 and the single node process this
• Distributed mode
• Hive send the process to existing JobTracker
• The information is housed on DistributedCache and
• Processed on multi-nodes
Driver
Compiler
Hadoop
Client
Metastore
Hive Workflow - Compiler
• Compiler: How to process HiveQL
Driver
Compiler
Hadoop
Client
Metastore
“Plumbing”of HIVE compiler
Parser
• Convert into Parse Tree Representation
Semantic
Analyzer
• Convert into block-base internal query representation
• retrieve schema information of the input table from metastore and verifies the column
names and so on
Logical Plan
Generator
• Convert into internal query representation to a logic plan consists of a tree of logical
operators
“Plumbing”of HIVE compiler – continued
Logical
Optimizer
• Rewrite plans into more optimized plans
• Logical optimizer perform multiple passes over logical plan and rewrites in several ways. For example, Combine
multiple joins which share the join key into a single multi-way JOIN which is done by a single M/R job.
Physical Plan
Generator
•Convert into physical plans(M/R jobs)
Physical
Optimizer
• Adopt join strategy
Compiler Overview
10
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Hive
QL
AST
Operator
Tree
QB
Operator
Tree
Task
Tree
Task
Tree
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Parser
Hive
QL
AST
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
TOK_QUERY
+ TOK_FROM
+ TOK_JOIN
+ TOK_TABREF
+ TOK_TABNAME
+ "access_log_hbase"
+ a
+ TOK_TABREF
+ TOK_TABNAME
+ "product_hbase"
+ "p"
+ "="
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "access_log_hbase"
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "prono“
Hive
QL
AST
+ TOK_INSERT
+ TOK_DESTINATION
+ TOK_TAB
+ TOK_TABNAME
+ "access_log_temp2"
+ TOK_SELECT
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "user"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "prono"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "maker"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "price"
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Parser
SQL AST
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
TOK_QUERY
+ TOK_FROM
+ TOK_JOIN
+ TOK_TABREF
+ TOK_TABNAME
+ "access_log_hbase"
+ a
+ TOK_TABREF
+ TOK_TABNAME
+ "product_hbase"
+ "p"
+ "="
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "access_log_hbase"
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "prono“
+ TOK_INSERT
+ TOK_DESTINATION
+ TOK_TAB
+ TOK_TABNAME
+ "access_log_temp2"
+ TOK_SELECT
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "user"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "prono"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "maker"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "price"
SQL
AST 1
2
3
13
Semantic Analyzer (1/2)
+ TOK_FROM
+ TOK_JOIN
+ TOK_TABREF
+ TOK_TABNAME
+ "access_log_hbase"
+ a
+ TOK_TABREF
+ TOK_TABNAME
+ "product_hbase"
+ "p"
+ "="
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "access_log_hbase"
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "prono“
QB
AST ParseInfo
Join Node
+ TOK_JOIN
+ TOK_TABREF
…
+ TOK_TABREF
…
+ “=”
…
13
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
AST QB
MetaData
Alias To Table Info
“a”=Table Info(“access_log_hbase”)
“p”=Table Info(“product_hbase”)
1
14
Semantic Analyzer (2/2)
+ TOK_DESTINATION
+ TOK_TAB
+ TOK_TABNAME
+ "access_log_temp2”
AST
QB
ParseInfo
Name To Destination Node
+ TOK_TAB
+ TOK_TABNAME
+"access_log_temp2”
1414
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
AST QB
2
15
Semantic Analyzer (2/2)
+ TOK_SELECT
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "user"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "prono"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "maker"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "price"
AST
QB
ParseInfo
Name To Select Node
+ TOK_SELECT
+ TOK_SELEXPR
…
+ TOK_SELEXPR
…
+ TOK_SELEXPR
…
+ TOK_SELEXPR
…
1515
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
AST QB
3
16
Logical Plan Generator (1/4)
QB
OP
Tree
TableScanOperator(“access_log_hbase”)
TableScanOperator(“product_hbase”)
MetaData
Alias To Table Info
“a”=Table Info(“access_log_hbase”)
“p”=Table Info(“product_hbase”)
1616
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
QB
OP
Tree
17
Logical Plan Generator (2/4)
QB
ParseInfo
+ TOK_JOIN
+ TOK_TABREF
+ TOK_TABNAME
+ "access_log_hbase"
+ a
+ TOK_TABREF
+ TOK_TABNAME
+ "product_hbase"
+ "p"
+ "="
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "access_log_hbase"
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "prono“
OP
Tree
ReduceSinkOperator(“access_log_hbase”)
ReduceSinkOperator(“product_hbase”)
JoinOperator
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
QB
OP
Tree
18
Logical Plan Generator (3/4)
OP
Tree
SelectOperator
QB
ParseInfo
Name To Select Node
+ TOK_SELECT
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "user"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "a"
+ "prono"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "maker"
+ TOK_SELEXPR
+ "."
+ TOK_TABLE_OR_COL
+ "p"
+ "price"
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
QB
OP
Tree
19
Logical Plan Generator (4/4)
OP
Tree
FileSinkOperator
QB
MetaData
Name To Destination Table Info
“insclause-0”=
Table Info(“access_log_temp2”)
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
QB
OP
Tree
Logical Plan Generator (result)
20 LCF
TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
JoinOperator
JOIN_4
SelectOperator
SEL_5
FileSinkOperator
FS_6
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
OP
Tree
21
Logical Optimizer
说明
LineageGenerator 表与表的血缘关系生成器
ColumnPruner 列裁剪
Predicate
PushDown
谓词下推,将只与一张表有关的过滤操
作下推至TableScanOperator之后
PartitionPruner 分区裁剪
PartitionCondition
Remover
在分区裁剪之前,将一些无关的条件谓
词去除
SimpleFetchOptimiz
er
优化没有GroupBy表达式的聚合查询
GroupByOptimizer map端聚合
CorrelationOptimize
r
利用查询中的相关性,合并有相关性的
JOB
说明
GroupByOptimizer Group By 优化
SamplePruner 采样裁剪
MapJoinProcessor 如果用户指定mapjoin, 则将
ReduceSinkOperator转换成
MapSinkOperator
BucketMapJoin
Optimizer
采用分桶的Map Join, 扩大Map Join的
适用范围
SortedMergeBucket
MapJoinOptimizer
Sort Merge Join
UnionProcessor 目前只在两个子查询都是map-only
Task时做个标记
JoinReader /*+ STREAMTABLE(A) */
ReduceSink
DeDuplication
如果两个ReduceSinkOperator共享同一
个分区/排序列,则需要对他们进行合
并
2121
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Logical Optimizer (Predicate Push Down)
2222
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono)
WHERE p.maker = 'honda';
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono)
WHERE p.maker = 'honda';
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
Logical Optimizer (Predicate Push Down)
TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
JoinOperator
JOIN_4
SelectOperator
SEL_6
FileSinkOperator
FS_7
2323
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono)
WHERE p.maker = 'honda';
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
Logical Optimizer (Predicate Push Down)
TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
JoinOperator
JOIN_4
FilterOperator
FIL_5
(_col8 = 'honda')
SelectOperator
SEL_6
FileSinkOperator
FS_7
2424
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono)
WHERE p.maker = 'honda';
INSERT OVERWRITE TABLE access_log_temp2
SELECT a.user, a.prono, p.maker, p.price
FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
Logical Optimizer (Predicate Push Down)
TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
JoinOperator
JOIN_4
FilterOperator
FIL_5
(_col8 = 'honda')
SelectOperator
SEL_6
FileSinkOperator
FS_7
FilterOperator
FIL_8
(maker = 'honda')
2525
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
26
Physical Plan Generator
MoveTask (Stage-0)
Ope
Tree
LoadTableDesc
MapRedTask (Stage-1/root)
TableScanOperator (TS_1)
JoinOperator (JOIN_4)
ReduceSinkOperator (RS_3)
FileSinkOperator (FS_6) StatsTask (Stage-2)
2626
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
OP
Tree
Task
Tree
TableScanOperator (TS_0)
ReduceSinkOperator (RS_2)
SelectOperator(SEL_5)
MapRedTask (Stage-1/root)
TableScanOperator (TS_1)
JoinOperator (JOIN_4)
ReduceSinkOperator (RS_3)
TableScanOperator (TS_0)
ReduceSinkOperator (RS_2)
SelectOperator(SEL_5)
Physical Plan Generator (result)
27 LCF
MapRedTask (Stage-1/root)
Mapper
TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
Reducer
JoinOperator
JOIN_4
SelectOperator
SEL_5
FileSinkOperator
FS_6
272727
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
OP
Tree
Task
Tree
28
Physical Optimizer
java/org/apache/hadoop/hive/ql/optimizer/physical/以下
概要
MapJoinResolver 处理Map Join
SkewJoinResolver 处理倾斜Join
CommonJoinResolver 处理普通Join
Vectorizer 使Hive从单行单行处理数据改为批量处理方
式,大大提升了指令流水线和缓存的利用率
SortMergeJoinResolver 与bucet配合,类似于归并排序
SamplingOptimizer 并行order by优化器
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Task
Tree
Task
Tree
29
Physical Optimizer (MapJoinResolver)
29
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Task
Tree
Task
Tree
MapRedTask (Stage-1)
Mapper
TableScanOperator
TS_1
TableScanOperator
TS_0
MapJoinOperator
MAPJOIN_7
SelectOperator
SEL_5
FileSinkOperator
FS_6
SelectOperator
SEL_8
30
Physical Optimizer (MapJoinResolver)
MapRedTask (Stage-1)
Mapper
TableScanOperator
TS_1
MapJoinOperator
MAPJOIN_7
SelectOperator
SEL_5
FileSinkOperator
FS_6
SelectOperator
SEL_8
MapredLocalTask (Stage-7)
TableScanOperator
TS_0
HashTableSinkOperator
HASHTABLESINK_11
MapRedTask (Stage-1)
Mapper
TableScanOperator
TS_1
TableScanOperator
TS_0
MapJoinOperator
MAPJOIN_7
SelectOperator
SEL_5
FileSinkOperator
FS_6
SelectOperator
SEL_8
30
Semantic
Analyzer
Logical
Plan Gen.
Logical
Optimizer
Physical
Plan Gen.
Physical
Optimizer
Parser
Task
Tree
Task
Tree
Join Strategies in Hive
处理分布式Join, 一般有两种办法:
• Replication Join:把其中一个表复制到所有节点,这样另一个表在每个节点上的分片就可以跟这个表
完整的表join了(Map side Join)
• Repartition Join:把两份数据按照Join key进行hash重分布, 让每个节点处理的hash值相同的join key数
据, 也就是做局部的Join (Reduce Side Join)
这里讲下Simple hash join, grace hash join, hybrid grace hash join
1. Common Join
2. Map Join
3. Auto MapJoin
4. Bucket Map Join
5. Bucket Sort Merge Map Join
6. Skew Join
Common Join - Shuffle Join
• Default choice
• Always works
• Worst case scenario
• Each process
• Reads from part of one of the tables
• Buckets and sorts on join key
• Sends one bucket to each reduce
• Works everytime.
Map Join
• One table is small (eg. dimension table)
• Fits in memory
• Each process
• Reads small table into memory hash table
• Streams through part of the big file
• Joining each record from hash table
• Very fast, but limited
Optimized Map Join – Hive-1293
Converting Common Join into Map Join
Task A
CommonJoinTask
Task C
Task A
Conditional Task
Task C
MapJoinLocalTask
CommonJoinTask
. . . . .
c
a
b
Previous Execution Flow
Optimized Execution Flow
MapJoinTask
MapJoinLocalTask
MapJoinTask
MapJoinLocalTask
MapJoinTask
Execution Time
Task A
Conditional Task
Task C
MapJoinLocalTask
CommonJoinTask
a
MapJoinTask
Table X is the big
table
Both tables are too
big for map join
SELECT * FROM
SRC1 x JOIN SRC2 y
ON x.key = y.key;
Backup Task
Task A
Conditional Task
Task C
MapJoin LocalTask
CommonJoinTask
MapJoinTask
Run as a Backup
Task
Memory Bound
Performance Bottleneck
• Distributed Cache is the potential performance bottleneck
• Large hashtable file will slow down the propagation of Distributed Cache
• Mappers are waiting for the hashtables file from Distributed Cache
• Compress and archive all the hashtable file into a tar file
Bucket Map Join
• Why:
• Total table/partition size is big, not good for mapjoin
• How:
• set hive.optimize.bucketmapjoin = true;
• 1. Work together with map join
2. All join tables are bucketized, and each small table’s bucket
number can be divided by big table’s bucket number.
3. Bucket columns == Join columns
Bucket Map Join
SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.*
a join b on a.key = b.key
join c on a.key=c.key;
Table b Table a Table c
Mapper 1
Bucket b1
Bucket
a1
Bucket
a2
Bucket
c1
Mapper 2
Bucket b1
Mapper 3
Bucket b2
a1
c1
a1
c1
a2
c1 Normally in production, there will be
thousands of buckets!
Table a,b,c all bucketized by ‘key’
a has 2 buckets, b has 2, and c has 1
1.  Spawn mapper based on the big table
2.  Only matching buckets of all small tables
are replicated onto each mapper
Sort Merge Bucket (SMB) Join
• If both tables are:
• Sorted the same
• Bucketed the same
• And joining on the sort/bucket column
• Each process:
• Reads a bucket from each table
• Process the row with the lowest value
• Very efficient if applicable
Sort Merge Bucket (SMB) Join
• Why:
• No limit on file/partition/table size
• How:
• set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInp
utFormat;
• 1. Work together with bucket map join
2. Bucket columns == Join columns == sort columns
Sort Merge Bucket Map Join
Facebook
Table A Table B Table C
1, val_1
3, val_3
5, val_5
4, val_4
4, val_4
20, val_20
23, val_23
20, val_20
25, val_25
Small tables are read on demand
NOT hold entire small tables in memory
Can perform outer join
Skew
• Skew is typical in real datasets
• A user complained that his job was slow
• He had 100 reduces
• 98 of them finished fast
• 2 ran really slow
• The key was a boolean...
Skew Join
• Join bottlenecked on the reducer who gets the skewed key
• set hive.optimize.skewjoin = true;
set hive.skewjoin.key = skew_key_threshold
Skew Join Reducer 1
Reducer 2
a-K 3
b-K 3
a-K 3
b-K 3
a-K 2
b-K 2 a-K 2
b-K 2
a-K 1
b-K 1Table
A
Table
B
A join B
Write to
HDFS
HDFS
File
a-K1
HDFS
File
b-K1
Map
join
a-k1
map join
b-k1
Job 1 Job 2
Final results
Skew Group by
• group by造成的倾斜有两个参数可以解决
• 一个是Hive.Map.aggr,默认值已经为true,意思是会做Map端的
combiner。所以如果你的group by查询只是做count(*)的话,其实是看不
出倾斜效果的,但是如果你做的是count(distinct),那么还是会看出一点
倾斜效果。
• 另一个参数是Hive.groupby. skewindata。这个参数的意思是做Reduce操
作的时候,拿到的key并不是所有相同值给同一个Reduce,而是随机分发,
然后Reduce做聚合,做完之后再做一轮MR,拿前面聚合过的数据再算结
果。所以这个参数其实跟Hive.Map.aggr做的是类似的事情,只是拿到
Reduce端来做,而且要额外启动一轮Job,所以其实不怎么推荐用,效
果不明显。
Case study
• Which of the following is faster?
• Select count(distict(Col)) from Tbl
• Select count(*) from (select distict(col) from Tbl)
- The first case:
- Maps send each value to the
reduce
- Single reduce counts them all
- The second case:
- Maps split up the values to
many reduces
- Each reduce generates its list
- Final job counts the size of
each list
- Singleton reduces are almost
always BAD
• Appendix: What does Explain show?
6/13/16 HIVE - A warehouse solution over Map Reduce Framework 49
Appendix: What does Explain show?
hive> explain INSERT OVERWRITE TABLE access_log_temp2
> SELECT a.user, a.prono, p.maker, p.price
> FROMaccess_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM(TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a)
(TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (.
(TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME
access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user))
(TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p)
maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: prono
type: int
sort order: +
Map-reduce partition columns:
expr: prono
type: int
tag: 0
value expressions:
expr: user
type: string
expr: prono
type: int
p
TableScan
alias: p
Reduce Output Operator
key expressions:
expr: prono
type: int
sort order: +
Map-reduce partition columns:
expr: prono
type: int
tag: 1
value expressions:
expr: maker
type: string
expr: price
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col2}
1 {VALUE._col1} {VALUE._col2}
handleSkewJoin: false
outputColumnNames: _col0, _col2, _col6, _col7
Select Operator
expressions:
expr: _col0
type: string
expr: _col2
type: int
expr: _col6
type: string
expr: _col7
type: int
outputColumnNames: _col0, _col1, _col2, _col3
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.access_log_temp2
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputForma t
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.access_log_temp2
Stage: Stage-2
Stats-Aggr Operator
Time taken: 0.1 seconds
hive>
Appendix: What does Explain show?
hive> explain INSERT OVERWRITE TABLE access_log_temp2
> SELECT a.user, a.prono, p.maker, p.price
> FROMaccess_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM(TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a)
(TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (.
(TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME
access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user))
(TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p)
maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: prono
type: int
sort order: +
Map-reduce partition columns:
expr: prono
type: int
tag: 0
value expressions:
expr: user
type: string
expr: prono
type: int
p
TableScan
alias: p
Reduce Output Operator
key expressions:
expr: prono
type: int
sort order: +
Map-reduce partition columns:
expr: prono
type: int
tag: 1
value expressions:
expr: maker
type: string
expr: price
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col2}
1 {VALUE._col1} {VALUE._col2}
handleSkewJoin: false
outputColumnNames: _col0, _col2, _col6, _col7
Select Operator
expressions:
expr: _col0
type: string
expr: _col2
type: int
expr: _col6
type: string
expr: _col7
type: int
outputColumnNames: _col0, _col1, _col2, _col3
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.access_log_temp2
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputForma t
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.access_log_temp2
Stage: Stage-2
Stats-Aggr Operator
Time taken: 0.1 seconds
hive>
ABSTRACT SYNTAX TREE:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
TableScan
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Select Operator
File Output Operator
Stage: Stage-0
Move Operator
Stage: Stage-2
Stats-Aggr Operator
Appendix: What does Explain show?
ABSTRACT SYNTAX TREE:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
TableScan
Reduce Output Operator
Reduce Operator Tree:
Join Operator
Select Operator
File Output Operator
Stage: Stage-0
Move Operator
Stage: Stage-2
Stats-Aggr Operator
MapRedTask (Stage-1/root)
Mapper
TableScanOperator
TS_1
TableScanOperator
TS_0
ReduceSinkOperator
RS_2
ReduceSinkOperator
RS_3
Reducer
JoinOperator
JOIN_4
SelectOperator
SEL_5
FileSinkOperator
FS_6
≒
Move Task (Stage-0)
Stats Task (Stage-2)
Explain
• Hive doesn’t tell you what is
wrong
• Expects you to know.
• Explain tool provides query
plan
• Filters on input
• Numbers of jobs
• Numbers of maps and reduces
• What the jobs are sorting by
• What directories are they
reading or writing
Hive SQL 解析
• 抽象语法树: org.apache.hadoop.hive.ql.parse.ParseDriver的parse方
法
• org.apache.hadoop.hive.ql.parse.ASTNode中getToken().getType()
拿到节点, 遇到TOK_QUERY循环调用
HiveQL优化
• Data Layout
• Data Format
• Joins
• Debugging
Data Layout – HDFS Characteristics
• Provides Distributed File System
• Very high aggregate bandwidth
• Extreme scalability(up to 100 PB)
• Self-healing storage
• Relatively simple to administer
• Limitations
• Can’t modify existing files
• Single writer for each file
• Heavy bias for large files(> 100 MB)
Choices for Layout
• Partitions
• Top level mechanism for pruning
• Primary unit for updating tables(& schema)
• Directory per value of specified column
• Bucketing
• Hashed into a file, good for sampling
• Controls write parallelism
• Sort order
• The order the data is written within file
Example Hive Layout
• Directory
• Warehouse/$database/$table
• Partitioning
• /part1=$partValue/part2=$partValue
• Bucketing
• /$bucket_$attempt(eg. 000000_0)
• Sort
• Each file is sorted within the file
Layout Guidelines
• Limit the number of partitions
• 1000 partitions is much faster than 10000
• Nested partitions are almost always wrong
• Gauge the number of buckets
• Calculate file size and keep big (200 ~ 500MB)
• Don’t forget number of files (Buckets * Parts)
• Layout related tables the same way
• Partition
• Bucket and sort order
Data Format
• Serde
• Input/Output (aka File) Format
• Primary Choices
• Text
• Sequence File
• RCFile
• ORC
Text Format
• Critical to pick a Serde
• Default - 001 between fields
• JSON – top level JSON record
• CSV
• Slow to read and wirte
• Can‘t split compressed files
• Leads to huge maps
• Need to read/decompress all fields
Sequence File
• Traditional MapReduce binary file format
• Stores keys and values as classes
• Not a goof fit for Hive, which has SQL types
• Hive always stores entire row as value
• Splittable but only by searching file
• Default block size is 1 MB
• Need to read and decompress all fields
RCFile
• Columns stored separately
• Read and decompress only needed ones
• Better Compression
• Columns stored as binary blobs
• Depends on metastore to supply types
• Larger blocks
• 4MB by default
• Still search file for split boundary
ORC(Optimized Row Columnar)
• Columns stored separately
• Knows types
• Uses type-specific encoders
• Stores statistics(min, max, sum, count)
• Has light-weight index
• Skip over blocks of rows that don‘t matter
• Larger blocks
• 256 MB by default
• Has an index for block boundaries
ORC – File Layout
比较
• 和RCFile格式相比,ORC File格式有以下优点:
(1)、每个task只输出单个文件,这样可以减少NameNode的负载;
(2)、支持各种复杂的数据类型,比如: datetime, decimal, 以及一
些复杂类型(struct, list, map, and union);
(3)、在文件中存储了一些轻量级的索引数据;
(4)、基于数据类型的块模式压缩:a、integer类型的列用行程长
度编码(run-length encoding);b、String类型的列用字典编码(dictionary
encoding);
(5)、用多个互相独立的RecordReaders并行读相同的文件;
(6)、无需扫描markers就可以分割文件;
(7)、绑定读写所需要的内存;
(8)、metadata的存储是用 Protocol Buffers的,所以它支持添加和
删除一些列。
ORC使用
• CREATE TABLE ... STORED AS ORC
• ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
• SET hive.default.fileformat=Orc
• 所有关于ORCFile的参数都是在Hive QL语句的TBLPROPERTIES字
段里面出现,他们是:
ORC使用 – 例子
create table Addresses (
name string,
street string,
city string,
state string,
zip int
) stored as orc tblproperties ("orc.compress"="NONE");
Vectorized Query Execution
• The Hive query execution engine currently processes one row at a time.
A single row of data goes through all the operators before the next
row can be processed. This mode of processing is very inefficient in
terms of CPU usage.
• This involves long code paths and significant metadata interpretation in
the inner loop of execution. Vectorized query execution streamlines
operations by processing a block of 1024 rows at a time. Within the
block, each column is stored as a vector (an array of a primitive data
type). Simple operations like arithmetic and comparisons are done by
quickly iterating through the vectors in a tight loop, with no or very few
function calls or conditional branches inside the loop. These loops
compile in a streamlined way that uses relatively few instructions and
finishes each instruction in fewer clock cycles, on average, by effectively
using the processor pipeline and cache memory.
Vectorized Query Execution - USAGE
• ORC format
• set hive.vectorized.execution.enabled = true;
• Vectorized execution is off by default, so your queries only utilize
it if this variable is turned on. To disable vectorized execution and
go back to standard execution, do the following:
• set hive.vectorized.execution.enabled = false;
Vectorized Query Execution - USAGE
• The following expressions can be vectorized when used on supported types:
• arithmetic: +, -, *, /, %
• AND, OR, NOT
• comparisons <, >, <=, >=, =, !=, BETWEEN, IN ( list-of-constants ) as filters
• Boolean-valued expressions (non-filters) using AND, OR, NOT, <, >, <=, >=, =, !=
• IS [NOT] NULL
• all math functions (SIN, LOG, etc.)
• string functions SUBSTR, CONCAT, TRIM, LTRIM, RTRIM, LOWER, UPPER, LENGTH
• type casts
• Hive user-defined functions, including standard and generic UDFs
• date functions (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP)
• the IF conditional expression
Vectorized Query Execution – USAGE UDF
support
• User-defined functions are supported using a backward
compatibility bridge, so although they do run vectorized, they
don't run as fast as optimized vector implementations of built-in
operators and functions. Vectorized filter operations are evaluated
left-to-right, so for best performance, put UDFs on the right in an
ANDed list of expressions in the WHERE clause. E.g., use
• column1 = 10 and myUDF(column2) = "x"
Compression
• Need to pick level of compression
• None
• LZO or Snappy – fast but sloppy
• Best for temporary tables
• ZLIB – slow and complete
• Best for long term storage
查询优化 - Map阶段的优化(Map phase)
• Map阶段的优化,主要是确定合适的Map数。那么首先要了解
Map数的计算公式:
• num_Map_tasks = max[${Mapred.min.split.size},min(${dfs.block.siz
e}, ${Mapred.max.split.size})]
• Mapred.min.split.size指的是数据的最小分割单元大小。
• Mapred.max.split.size指的是数据的最大分割单元大小。
• dfs.block.size指的是HDFS设置的数据块大小。
• 一般来说dfs.block.size这个值是一个已经指定好的值,而且这个参数Hive是识别不
到的
• 在Hive中min的默认值是1B,max的默认值是256MB
查询优化 - Reduce阶段的优化(Reduce phase)
• Reduce阶段优化的主要工作也是选择合适的Reduce task数量,跟
上面的Map优化类似。
• 1. Mapred.Reduce.tasks, 直接指定reduce数量
• 2. num_Reduce_tasks = min[${Hive.exec.Reducers.max},
(${input.size} / ${ Hive.exec.Reducers.bytes.per.Reducer})]
• 根据输入的数据量大小来决定Reduce的个数,默认
Hive.exec.Reducers.bytes.per.Reducer为1G,而且Reduce个数不能超过一个上限参数
值,这个参数的默认取值为999。所以我们可以调整
Hive.exec.Reducers.bytes.per.Reducer来设置Reduce个数。
Map与Reduce之间的优化(Spill, copy,
Sort phase)
• Spill和Sort
• 在Spill阶段,由于内存不够,数据可能没办法在内存中一次性排
序完成,那么就只能把局部排序的文件先保存到磁盘上,这个动
作叫Spill,然后Spill出来的多个文件可以在最后进行merge。如果
发生Spill,可以通过设置io.Sort.mb来增大Mapper输出buffer的大
小,避免Spill的发生。另外合并时可以通过设置io.Sort.factor来使
得一次性能够合并更多的数据。调试参数的时候,一个要看Spill
的时间成本,一个要看merge的时间成本,还需要注意不要撑爆
内存(io.Sort.mb是算在Map的内存里面的)。Reduce端的merge
也是一样可以用io.Sort.factor。一般情况下这两个参数很少需要
调整,除非很明确知道这个地方是瓶颈。
Map与Reduce之间的优化(Spill, copy,
Sort phase)
• Copy
• copy阶段是把文件从Map端copy到Reduce端。默认情况下在5%的Map
完成的情况下Reduce就开始启动copy,这个有时候是很浪费资源的,
因为Reduce一旦启动就被占用,一直等到Map全部完成,收集到所有
数据才可以进行后面的动作,所以我们可以等比较多的Map完成之后
再启动Reduce流程,这个比例可以通
Mapred.Reduce.slowstart.completed.Maps去调整,他的默认值就是5%。
如果觉得这么做会减慢Reduce端copy的进度,可以把copy过程的线程
增大。tasktracker.http.threads可以决定作为server端的Map用于提供数
据传输服务的线程,Mapred.Reduce.parallel.copies可以决定作为client
端的Reduce同时从Map端拉取数据的并行度(一次同时从多少个Map
拉数据),修改参数的时候这两个注意协调一下,server端能处理
client端的请求即可。
文件的其他优化 - 小文件问题
• 小文件问题在目前的Hive环境下已经得到了比较好的解决,Hive
的默认配置中就可以在小文件输入时自动把多个文件合并给1个
Map处理,输出时如果文件很小也会进行一轮单独的合并
• 解决办法:
• 1. 输入合并, 即在map前合并小文件
• 2. 输出合并, 即在输出结果的时候合并小文件
Hive小文件 – 输入合并
• -- 每个Map最大输入大小,决定合并后的文件数
• set mapred.max.split.size=256000000;
• -- 一个节点上split的至少的大小 ,决定了多个data node上的文件是
否需要合并
• set mapred.min.split.size.per.node=100000000;
• -- 一个交换机下split的至少的大小,决定了多个交换机上的文件是否
需要合并
• set mapred.min.split.size.per.rack=100000000;
• -- 执行Map前进行小文件合并
• set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInput
Format;
Hive小文件 – 输出合并
• hive.merge.mapfiles 在map-only job后合并文件,默认true
• hive.merge.mapredfiles 在map-reduce job后合并文件,默认false
• hive.merge.size.per.task 合并后每个文件的大小,默认256000000
• hive.merge.smallfiles.avgsize 平均文件大小,是决定是否执行合
并操作的阈值,默认16000000
压缩文件的处理
• 对于输出结果为压缩文件形式存储的情况,要解决小文件问题,如果在Map输入前合并,对输出的文
件存储格式并没有限制。但是如果使用输出合并,则必须配合SequenceFile来存储,否则无法进行合
并,以下是示例:
• set mapred.output.compression. type=BLOCK;
• set hive.exec.compress.output= true;
• set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
• set hive.merge.smallfiles.avgsize=100000000;
• drop table if exists dw_stage.zj_small;
• create table dw_stage.zj_small
• STORED AS SEQUENCEFILE
• as select *
• from dw_db.dw_soj_imp_dtl
• where log_dt = '2014-04-14'
• and paid like '%baidu%' ;
使用HAR归档文件
• Hadoop的归档文件格式也是解决小文件问题的方式之一。而且Hive提供了
原生支持:
•
• set hive.archive.enabled= true;
• set hive.archive.har.parentdir.settable= true;
• set har.partfile.size=1099511627776;
• ALTER TABLE srcpart ARCHIVE PARTITION(ds= '2008-04-08', hr= '12' );
• ALTER TABLE srcpart UNARCHIVE PARTITION(ds= '2008-04-08', hr= '12' );
•
• 如果使用的不是分区表,则可创建成外部表,并使用har://协议来指定路径。
Job的优化
• 1. Job执行模式
• 2. JVM重用
• 3. 索引
• 4. Join算法
• 5. 数据倾斜
Job执行模式
• Hadoop的Map Reduce Job可以有3种模式执行,即本地模式,伪分布式,
还有真正的分布式。本地模式和伪分布式都是在最初学习Hadoop的时候往
往被说成是做单机开发的时候用到。但是实际上对于处理数据量非常小的
Job,直接启动分布式Job会消耗大量资源,而真正执行计算的时间反而非常
少。这个时候就应该使用本地模式执行mr Job,这样执行的时候不会启动分
布式Job,执行速度就会快很多。比如一般来说启动分布式Job,无论多小的
数据量,执行时间一般不会少于20s,而使用本地mr模式,10秒左右就能出
结果。
• 设置执行模式的主要参数有三个,一个是Hive.exec.mode.local.auto,把他设
为true就能够自动开启local mr模式。但是这还不足以启动local mr,输入的
文件数量和数据量大小必须要控制,这两个参数分别为
Hive.exec.mode.local.auto.tasks.max和
Hive.exec.mode.local.auto.inputbytes.max,默认值分别为4和128MB,即默
认情况下,Map处理的文件数不超过4个并且总大小小于128MB就启用local
mr模式。
JVM重用
• 正常情况下,MapReduce启动的JVM在完成一个task之后就退出
了,但是如果任务花费时间很短,又要多次启动JVM的情况下
(比如对很大数据量进行计数操作),JVM的启动时间就会变成
一个比较大的overhead。在这种情况下,可以使用jvm重用的参
数:
• set Mapred.Job.reuse.jvm.num.tasks = 5;
• 他的作用是让一个jvm运行多次任务之后再退出。这样一来也能
节约不少JVM启动时间。
索引
• 参考后续章节
Join
• 参考HQL编译解析过程
SQL整体优化
• 1. Job间并行
• 设置Job间并行的参数是Hive.exec.parallel,将其设为true即可。默认的并行度为8,也就是最多允许sql
中8个Job并行。如果想要更高的并行度,可以通过Hive.exec.parallel. thread.number参数进行设置,但
要避免设置过大而占用过多资源。
• 2.减少Job数
• 例子: 查询某网站访问过页面a和页面b的用户数量
select count(*)
from
(select distinct user_id
from logs where page_name = ‘a’) a
join
(select distinct user_id
from logs where blog_owner = ‘b’) b
on a.user_id = b.user_id;
SQL整体优化
select count(*)
from logs group by user_id
having (count(case when page_name = ‘a’ then 1 end) > 0
and count(case when page_name = ‘b’ then 1 end) > 0)
Indexed Hive
• Hive Indexing
• Provides key-based data view
• Keys data duplicated
• Storage layout favors search & lookup performance
• Provided better data access for certain operations
• A cheaper alternative to full data scans!
How does the index look like?
• An index is a table with 3 columns
• Data in index looks like
Hive index in HQL
• SELECT (mapping, projection, association, given key, fetch value)
• WHERE (filters on keys)
• GROUP BY (grouping on keys)
• JOIN (join key as index key)
• Indexes have high potential for accelerating wide range of queries
Hive Index
• Index as Reference
• Index as Data
• Here takes the index as data as the demonstration
• Uses Query Rewrite technique to transform queries on base table to
index table
• Limited applicability currently, but technique itself has wide potential
• Also a very quick way to demonstrate importance of index for
performance
Indexes and Query Rewrites
• GROUP BY, aggregation
• Index as Data
• Group By Key = Index Key
• Query rewritten to use indexes, but still a valid query (nothing special in
it!)
Agg example
where
Func(key)
Histogram Query
Year on year query
Year on year query
Why index performs better?
• Reducing data increases I/O efficiency
• Exploiting storage layout optimization
• e.g. GROUP BY:
• Sort + agg
• Hash & agg
• Sort step already in index
• Parallelization
• Process the index data in the same manner as base table, distribute the
processing across nodes
• Scalable
Index Design
Hive complier
Query Rewrite Engine
Hive MetaStore ER diagram
BUCKETING_COLS
SD_ID BIGINT(20)
BUCKET_COL_NAME VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
COLUMNS
SD_ID BIGINT(20)
COMMENT VARCHAR(256)
COLUMN_NAME VARCHAR(128)
TYPE_NAME VARCHAR(4000)
INTEGER_IDX INT(11)
Indexes
DATABASE_PARAMS
DB_ID BIGINT(20)
PARAM_KEY VARCHAR(180)
PARAM_VALUE VARCHAR(4000)
Indexes
DBS
DB_ID BIGINT(20)
DESC VARCHAR(4000)
DB_LOCATION_URI VARCHAR(4000)
NAME VARCHAR(128)
Indexes
DB_PRIVS
DB_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
DB_PRIV VARCHAR(128)
Indexes
GLOBAL_PRIVS
USER_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
USER_PRIV VARCHAR(128)
Indexes
IDXS
INDEX_ID BIGINT(20)
CREATE_TIME INT(11)
DEFERRED_REBUILD BIT(1)
INDEX_HANDLER_CLASS VARCHAR(4000)
INDEX_NAME VARCHAR(128)
INDEX_TBL_ID BIGINT(20)
LAST_ACCESS_TIME INT(11)
ORIG_TBL_ID BIGINT(20)
SD_ID BIGINT(20)
Indexes
INDEX_PARAMS
INDEX_ID BIGINT(20)
PARAM_KEY VARCHAR(256)
PARAM_VALUE VARCHAR(4000)
Indexes
PARTITION_KEYS
TBL_ID BIGINT(20)
PKEY_COMMENT VARCHAR(4000)
PKEY_NAME VARCHAR(128)
PKEY_TYPE VARCHAR(767)
INTEGER_IDX INT(11)
Indexes
ROLES
ROLE_ID BIGINT(20)
CREATE_TIME INT(11)
OWNER_NAME VARCHAR(128)
ROLE_NAME VARCHAR(128)
Indexes
ROLE_MAP
ROLE_GRANT_ID BIGINT(20)
ADD_TIME INT(11)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
ROLE_ID BIGINT(20)
Indexes
SDS
SD_ID BIGINT(20)
INPUT_FORMAT VARCHAR(4000)
IS_COMPRESSED BIT(1)
LOCATION VARCHAR(4000)
NUM_BUCKETS INT(11)
OUTPUT_FORMAT VARCHAR(4000)
SERDE_ID BIGINT(20)
Indexes
SD_PARAMS
SD_ID BIGINT(20)
PARAM_KEY VARCHAR(256)
PARAM_VALUE VARCHAR(4000)
Indexes
SEQUENCE_TABLE
SEQUENCE_NAME VARCHAR(255)
NEXT_VAL BIGINT(20)
Indexes
SERDES
SERDE_ID BIGINT(20)
NAME VARCHAR(128)
SLIB VARCHAR(4000)
Indexes
SERDE_PARAMS
SERDE_ID BIGINT(20)
PARAM_KEY VARCHAR(256)
PARAM_VALUE VARCHAR(4000)
Indexes
SORT_COLS
SD_ID BIGINT(20)
COLUMN_NAME VARCHAR(128)
ORDER INT(11)
INTEGER_IDX INT(11)
Indexes
TABLE_PARAMS
TBL_ID BIGINT(20)
PARAM_KEY VARCHAR(256)
PARAM_VALUE VARCHAR(4000)
Indexes
TBLS
TBL_ID BIGINT(20)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
LAST_ACCESS_TIME INT(11)
OWNER VARCHAR(767)
RETENTION INT(11)
SD_ID BIGINT(20)
TBL_NAME VARCHAR(128)
TBL_TYPE VARCHAR(128)
VIEW_EXPANDED_TEXT MEDIUMTEXT
VIEW_ORIGINAL_TEXT MEDIUMTEXT
Indexes
TBL_PRIVS
TBL_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
TBL_PRIV VARCHAR(128)
TBL_ID BIGINT(20)
Indexes
Reference
• https://cwiki.apache.org/confluence/display/Hive/DesignDocs
• FaceBook Hive Summit 2011 – join: Hive from the 2011 Hadoop
Summit (Liyin Tang, Namit Jain)
• Indexed Hive – Prafulla Tekawade/ Nikhil Deshpande
• Internal Hive “http://www.slideshare.net/recruitcojp/internal-hive
• Hive SQL的编译过程: http://tech.meituan.com/hive-sql-to-
mapreduce.html
• MonetDB/X100: Hyper-Pipelining Query Execution. 2005, Peter Boncz,
Matcin Zuokowski, Niels Nes
• Ysmart: Yet Another SQL-to-MapReduce Translator, Rubao Lee, Tian
Luo…

More Related Content

What's hot

Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldDataWorks Summit
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview EMC
 
Hp vertica certification guide
Hp vertica certification guideHp vertica certification guide
Hp vertica certification guideneinamat
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0enissoz
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetupt3rmin4t0r
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Ed Kohlwey
 
Power JSON with PostgreSQL
Power JSON with PostgreSQLPower JSON with PostgreSQL
Power JSON with PostgreSQLEDB
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase Cloudera, Inc.
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handyPraveen Sripati
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperDataWorks Summit
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 

What's hot (19)

Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
Hp vertica certification guide
Hp vertica certification guideHp vertica certification guide
Hp vertica certification guide
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
Power JSON with PostgreSQL
Power JSON with PostgreSQLPower JSON with PostgreSQL
Power JSON with PostgreSQL
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 

Similar to Hive_p

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
OGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's ViewOGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's ViewBartosz Dobrzelecki
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureDataWorks Summit/Hadoop Summit
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizonArtem Ervits
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteSalesforce Engineering
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in HiveDataWorks Summit
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServicePivotalOpenSourceHub
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cRonald Francisco Vargas Quesada
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!HostedbyConfluent
 
MySQL Tech Tour 2015 - 5.7 Whats new
MySQL Tech Tour 2015 - 5.7 Whats newMySQL Tech Tour 2015 - 5.7 Whats new
MySQL Tech Tour 2015 - 5.7 Whats newMark Swarbrick
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresJitendra Singh
 
Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Chef
 

Similar to Hive_p (20)

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
OGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's ViewOGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's View
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a Service
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!
 
MySQL Tech Tour 2015 - 5.7 Whats new
MySQL Tech Tour 2015 - 5.7 Whats newMySQL Tech Tour 2015 - 5.7 Whats new
MySQL Tech Tour 2015 - 5.7 Whats new
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and Underscores
 
Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015
 

Hive_p

  • 2. HiveSQL解析 • Hive SQL解析 • <<Hive – A warehousing Solution over a Map-Reduce Framework>> • 有点老了
  • 3. Hive Architecture / Exec Flow 6/13/16 HIVE - A warehouse solution over Map Reduce Framework 3 Driver Compiler Hadoop Client Metastore -This is the overview - Clients are User Interfaces both CLI, WebUI and API likes JDBC and ODBC. - Metastore is system catalog which has the schema informaction for hive tables. - Dirver manages the lifecycle of HiveQL for compilation, optimization and execution. - Complier transforms HiveQL to Operators using some optimizers. Hive Workflow: - Hive has the operators which are minimum processing units. - The process of each operator is done with HDFS operation or M/R jobs. - The compiler converts HiveQL to the sets of operators - The point is : Hive converts our order(HiveQL) to operators which are made with M/R jobs.
  • 4. Hive Workflow - Operators Operators Descriptions TableScanOperator 扫描hive表数据 ReduceSinkOperator 创建将发送到Reducer端的<key,value>对 JoinOperator Join两份数据 SelectOperator 选择输出列 FileSinkOperator 建立结果数据,输出至文件 FilterOperator 过滤输入数据 GroupByOperator Group By 语句 MapJoinOperator /*+ mapjoin(t)*/ LimitOperator Limit语句 UnionOperator Union语句 … …
  • 5. • For M/R processing, Hive uses ExecMaper and ExecReducer • Hive’s M/R jobs are done by ExecMaper and ExecReducer • They read plans and process them dynamically • On processing, 2 modes • Local processing mode • Distributed processing mode Hive Workflow Driver Compiler Hadoop Client Metastore
  • 6. Hive Workflow – 2 modes • Local Mode • Hive fork the process with hadoop command • The plan.xml is made just on 1 and the single node process this • Distributed mode • Hive send the process to existing JobTracker • The information is housed on DistributedCache and • Processed on multi-nodes Driver Compiler Hadoop Client Metastore
  • 7. Hive Workflow - Compiler • Compiler: How to process HiveQL Driver Compiler Hadoop Client Metastore
  • 8. “Plumbing”of HIVE compiler Parser • Convert into Parse Tree Representation Semantic Analyzer • Convert into block-base internal query representation • retrieve schema information of the input table from metastore and verifies the column names and so on Logical Plan Generator • Convert into internal query representation to a logic plan consists of a tree of logical operators
  • 9. “Plumbing”of HIVE compiler – continued Logical Optimizer • Rewrite plans into more optimized plans • Logical optimizer perform multiple passes over logical plan and rewrites in several ways. For example, Combine multiple joins which share the join key into a single multi-way JOIN which is done by a single M/R job. Physical Plan Generator •Convert into physical plans(M/R jobs) Physical Optimizer • Adopt join strategy
  • 10. Compiler Overview 10 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser Hive QL AST Operator Tree QB Operator Tree Task Tree Task Tree
  • 11. Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser Parser Hive QL AST INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono); TOK_QUERY + TOK_FROM + TOK_JOIN + TOK_TABREF + TOK_TABNAME + "access_log_hbase" + a + TOK_TABREF + TOK_TABNAME + "product_hbase" + "p" + "=" + "." + TOK_TABLE_OR_COL + "a" + "access_log_hbase" + "." + TOK_TABLE_OR_COL + "p" + "prono“ Hive QL AST + TOK_INSERT + TOK_DESTINATION + TOK_TAB + TOK_TABNAME + "access_log_temp2" + TOK_SELECT + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "user" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "prono" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "maker" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "price"
  • 12. Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser Parser SQL AST INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono); TOK_QUERY + TOK_FROM + TOK_JOIN + TOK_TABREF + TOK_TABNAME + "access_log_hbase" + a + TOK_TABREF + TOK_TABNAME + "product_hbase" + "p" + "=" + "." + TOK_TABLE_OR_COL + "a" + "access_log_hbase" + "." + TOK_TABLE_OR_COL + "p" + "prono“ + TOK_INSERT + TOK_DESTINATION + TOK_TAB + TOK_TABNAME + "access_log_temp2" + TOK_SELECT + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "user" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "prono" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "maker" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "price" SQL AST 1 2 3
  • 13. 13 Semantic Analyzer (1/2) + TOK_FROM + TOK_JOIN + TOK_TABREF + TOK_TABNAME + "access_log_hbase" + a + TOK_TABREF + TOK_TABNAME + "product_hbase" + "p" + "=" + "." + TOK_TABLE_OR_COL + "a" + "access_log_hbase" + "." + TOK_TABLE_OR_COL + "p" + "prono“ QB AST ParseInfo Join Node + TOK_JOIN + TOK_TABREF … + TOK_TABREF … + “=” … 13 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser AST QB MetaData Alias To Table Info “a”=Table Info(“access_log_hbase”) “p”=Table Info(“product_hbase”) 1
  • 14. 14 Semantic Analyzer (2/2) + TOK_DESTINATION + TOK_TAB + TOK_TABNAME + "access_log_temp2” AST QB ParseInfo Name To Destination Node + TOK_TAB + TOK_TABNAME +"access_log_temp2” 1414 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser AST QB 2
  • 15. 15 Semantic Analyzer (2/2) + TOK_SELECT + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "user" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "prono" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "maker" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "price" AST QB ParseInfo Name To Select Node + TOK_SELECT + TOK_SELEXPR … + TOK_SELEXPR … + TOK_SELEXPR … + TOK_SELEXPR … 1515 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser AST QB 3
  • 16. 16 Logical Plan Generator (1/4) QB OP Tree TableScanOperator(“access_log_hbase”) TableScanOperator(“product_hbase”) MetaData Alias To Table Info “a”=Table Info(“access_log_hbase”) “p”=Table Info(“product_hbase”) 1616 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser QB OP Tree
  • 17. 17 Logical Plan Generator (2/4) QB ParseInfo + TOK_JOIN + TOK_TABREF + TOK_TABNAME + "access_log_hbase" + a + TOK_TABREF + TOK_TABNAME + "product_hbase" + "p" + "=" + "." + TOK_TABLE_OR_COL + "a" + "access_log_hbase" + "." + TOK_TABLE_OR_COL + "p" + "prono“ OP Tree ReduceSinkOperator(“access_log_hbase”) ReduceSinkOperator(“product_hbase”) JoinOperator Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser QB OP Tree
  • 18. 18 Logical Plan Generator (3/4) OP Tree SelectOperator QB ParseInfo Name To Select Node + TOK_SELECT + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "user" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "prono" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "maker" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "price" Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser QB OP Tree
  • 19. 19 Logical Plan Generator (4/4) OP Tree FileSinkOperator QB MetaData Name To Destination Table Info “insclause-0”= Table Info(“access_log_temp2”) Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser QB OP Tree
  • 20. Logical Plan Generator (result) 20 LCF TableScanOperator TS_1 TableScanOperator TS_0 ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 JoinOperator JOIN_4 SelectOperator SEL_5 FileSinkOperator FS_6 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser OP Tree
  • 21. 21 Logical Optimizer 说明 LineageGenerator 表与表的血缘关系生成器 ColumnPruner 列裁剪 Predicate PushDown 谓词下推,将只与一张表有关的过滤操 作下推至TableScanOperator之后 PartitionPruner 分区裁剪 PartitionCondition Remover 在分区裁剪之前,将一些无关的条件谓 词去除 SimpleFetchOptimiz er 优化没有GroupBy表达式的聚合查询 GroupByOptimizer map端聚合 CorrelationOptimize r 利用查询中的相关性,合并有相关性的 JOB 说明 GroupByOptimizer Group By 优化 SamplePruner 采样裁剪 MapJoinProcessor 如果用户指定mapjoin, 则将 ReduceSinkOperator转换成 MapSinkOperator BucketMapJoin Optimizer 采用分桶的Map Join, 扩大Map Join的 适用范围 SortedMergeBucket MapJoinOptimizer Sort Merge Join UnionProcessor 目前只在两个子查询都是map-only Task时做个标记 JoinReader /*+ STREAMTABLE(A) */ ReduceSink DeDuplication 如果两个ReduceSinkOperator共享同一 个分区/排序列,则需要对他们进行合 并 2121 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser
  • 22. Logical Optimizer (Predicate Push Down) 2222 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda'; INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);
  • 23. INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda'; INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono); Logical Optimizer (Predicate Push Down) TableScanOperator TS_1 TableScanOperator TS_0 ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 JoinOperator JOIN_4 SelectOperator SEL_6 FileSinkOperator FS_7 2323 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser
  • 24. INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda'; INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono); Logical Optimizer (Predicate Push Down) TableScanOperator TS_1 TableScanOperator TS_0 ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 JoinOperator JOIN_4 FilterOperator FIL_5 (_col8 = 'honda') SelectOperator SEL_6 FileSinkOperator FS_7 2424 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser
  • 25. INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda'; INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono); Logical Optimizer (Predicate Push Down) TableScanOperator TS_1 TableScanOperator TS_0 ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 JoinOperator JOIN_4 FilterOperator FIL_5 (_col8 = 'honda') SelectOperator SEL_6 FileSinkOperator FS_7 FilterOperator FIL_8 (maker = 'honda') 2525 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser
  • 26. 26 Physical Plan Generator MoveTask (Stage-0) Ope Tree LoadTableDesc MapRedTask (Stage-1/root) TableScanOperator (TS_1) JoinOperator (JOIN_4) ReduceSinkOperator (RS_3) FileSinkOperator (FS_6) StatsTask (Stage-2) 2626 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser OP Tree Task Tree TableScanOperator (TS_0) ReduceSinkOperator (RS_2) SelectOperator(SEL_5)
  • 27. MapRedTask (Stage-1/root) TableScanOperator (TS_1) JoinOperator (JOIN_4) ReduceSinkOperator (RS_3) TableScanOperator (TS_0) ReduceSinkOperator (RS_2) SelectOperator(SEL_5) Physical Plan Generator (result) 27 LCF MapRedTask (Stage-1/root) Mapper TableScanOperator TS_1 TableScanOperator TS_0 ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 Reducer JoinOperator JOIN_4 SelectOperator SEL_5 FileSinkOperator FS_6 272727 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser OP Tree Task Tree
  • 28. 28 Physical Optimizer java/org/apache/hadoop/hive/ql/optimizer/physical/以下 概要 MapJoinResolver 处理Map Join SkewJoinResolver 处理倾斜Join CommonJoinResolver 处理普通Join Vectorizer 使Hive从单行单行处理数据改为批量处理方 式,大大提升了指令流水线和缓存的利用率 SortMergeJoinResolver 与bucet配合,类似于归并排序 SamplingOptimizer 并行order by优化器 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser Task Tree Task Tree
  • 29. 29 Physical Optimizer (MapJoinResolver) 29 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser Task Tree Task Tree MapRedTask (Stage-1) Mapper TableScanOperator TS_1 TableScanOperator TS_0 MapJoinOperator MAPJOIN_7 SelectOperator SEL_5 FileSinkOperator FS_6 SelectOperator SEL_8
  • 30. 30 Physical Optimizer (MapJoinResolver) MapRedTask (Stage-1) Mapper TableScanOperator TS_1 MapJoinOperator MAPJOIN_7 SelectOperator SEL_5 FileSinkOperator FS_6 SelectOperator SEL_8 MapredLocalTask (Stage-7) TableScanOperator TS_0 HashTableSinkOperator HASHTABLESINK_11 MapRedTask (Stage-1) Mapper TableScanOperator TS_1 TableScanOperator TS_0 MapJoinOperator MAPJOIN_7 SelectOperator SEL_5 FileSinkOperator FS_6 SelectOperator SEL_8 30 Semantic Analyzer Logical Plan Gen. Logical Optimizer Physical Plan Gen. Physical Optimizer Parser Task Tree Task Tree
  • 31. Join Strategies in Hive 处理分布式Join, 一般有两种办法: • Replication Join:把其中一个表复制到所有节点,这样另一个表在每个节点上的分片就可以跟这个表 完整的表join了(Map side Join) • Repartition Join:把两份数据按照Join key进行hash重分布, 让每个节点处理的hash值相同的join key数 据, 也就是做局部的Join (Reduce Side Join) 这里讲下Simple hash join, grace hash join, hybrid grace hash join 1. Common Join 2. Map Join 3. Auto MapJoin 4. Bucket Map Join 5. Bucket Sort Merge Map Join 6. Skew Join
  • 32. Common Join - Shuffle Join • Default choice • Always works • Worst case scenario • Each process • Reads from part of one of the tables • Buckets and sorts on join key • Sends one bucket to each reduce • Works everytime.
  • 33. Map Join • One table is small (eg. dimension table) • Fits in memory • Each process • Reads small table into memory hash table • Streams through part of the big file • Joining each record from hash table • Very fast, but limited
  • 34. Optimized Map Join – Hive-1293
  • 35. Converting Common Join into Map Join Task A CommonJoinTask Task C Task A Conditional Task Task C MapJoinLocalTask CommonJoinTask . . . . . c a b Previous Execution Flow Optimized Execution Flow MapJoinTask MapJoinLocalTask MapJoinTask MapJoinLocalTask MapJoinTask
  • 36. Execution Time Task A Conditional Task Task C MapJoinLocalTask CommonJoinTask a MapJoinTask Table X is the big table Both tables are too big for map join SELECT * FROM SRC1 x JOIN SRC2 y ON x.key = y.key;
  • 37. Backup Task Task A Conditional Task Task C MapJoin LocalTask CommonJoinTask MapJoinTask Run as a Backup Task Memory Bound
  • 38. Performance Bottleneck • Distributed Cache is the potential performance bottleneck • Large hashtable file will slow down the propagation of Distributed Cache • Mappers are waiting for the hashtables file from Distributed Cache • Compress and archive all the hashtable file into a tar file
  • 39. Bucket Map Join • Why: • Total table/partition size is big, not good for mapjoin • How: • set hive.optimize.bucketmapjoin = true; • 1. Work together with map join 2. All join tables are bucketized, and each small table’s bucket number can be divided by big table’s bucket number. 3. Bucket columns == Join columns
  • 40. Bucket Map Join SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key; Table b Table a Table c Mapper 1 Bucket b1 Bucket a1 Bucket a2 Bucket c1 Mapper 2 Bucket b1 Mapper 3 Bucket b2 a1 c1 a1 c1 a2 c1 Normally in production, there will be thousands of buckets! Table a,b,c all bucketized by ‘key’ a has 2 buckets, b has 2, and c has 1 1.  Spawn mapper based on the big table 2.  Only matching buckets of all small tables are replicated onto each mapper
  • 41. Sort Merge Bucket (SMB) Join • If both tables are: • Sorted the same • Bucketed the same • And joining on the sort/bucket column • Each process: • Reads a bucket from each table • Process the row with the lowest value • Very efficient if applicable
  • 42. Sort Merge Bucket (SMB) Join • Why: • No limit on file/partition/table size • How: • set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInp utFormat; • 1. Work together with bucket map join 2. Bucket columns == Join columns == sort columns
  • 43. Sort Merge Bucket Map Join Facebook Table A Table B Table C 1, val_1 3, val_3 5, val_5 4, val_4 4, val_4 20, val_20 23, val_23 20, val_20 25, val_25 Small tables are read on demand NOT hold entire small tables in memory Can perform outer join
  • 44. Skew • Skew is typical in real datasets • A user complained that his job was slow • He had 100 reduces • 98 of them finished fast • 2 ran really slow • The key was a boolean...
  • 45. Skew Join • Join bottlenecked on the reducer who gets the skewed key • set hive.optimize.skewjoin = true; set hive.skewjoin.key = skew_key_threshold
  • 46. Skew Join Reducer 1 Reducer 2 a-K 3 b-K 3 a-K 3 b-K 3 a-K 2 b-K 2 a-K 2 b-K 2 a-K 1 b-K 1Table A Table B A join B Write to HDFS HDFS File a-K1 HDFS File b-K1 Map join a-k1 map join b-k1 Job 1 Job 2 Final results
  • 47. Skew Group by • group by造成的倾斜有两个参数可以解决 • 一个是Hive.Map.aggr,默认值已经为true,意思是会做Map端的 combiner。所以如果你的group by查询只是做count(*)的话,其实是看不 出倾斜效果的,但是如果你做的是count(distinct),那么还是会看出一点 倾斜效果。 • 另一个参数是Hive.groupby. skewindata。这个参数的意思是做Reduce操 作的时候,拿到的key并不是所有相同值给同一个Reduce,而是随机分发, 然后Reduce做聚合,做完之后再做一轮MR,拿前面聚合过的数据再算结 果。所以这个参数其实跟Hive.Map.aggr做的是类似的事情,只是拿到 Reduce端来做,而且要额外启动一轮Job,所以其实不怎么推荐用,效 果不明显。
  • 48. Case study • Which of the following is faster? • Select count(distict(Col)) from Tbl • Select count(*) from (select distict(col) from Tbl) - The first case: - Maps send each value to the reduce - Single reduce counts them all - The second case: - Maps split up the values to many reduces - Each reduce generates its list - Final job counts the size of each list - Singleton reduces are almost always BAD
  • 49. • Appendix: What does Explain show? 6/13/16 HIVE - A warehouse solution over Map Reduce Framework 49
  • 50. Appendix: What does Explain show? hive> explain INSERT OVERWRITE TABLE access_log_temp2 > SELECT a.user, a.prono, p.maker, p.price > FROMaccess_log_hbase a JOIN product_hbase p ON (a.prono = p.prono); OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM(TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a) (TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (. (TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price))))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator key expressions: expr: prono type: int sort order: + Map-reduce partition columns: expr: prono type: int tag: 0 value expressions: expr: user type: string expr: prono type: int p TableScan alias: p Reduce Output Operator key expressions: expr: prono type: int sort order: + Map-reduce partition columns: expr: prono type: int tag: 1 value expressions: expr: maker type: string expr: price type: int Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col2} 1 {VALUE._col1} {VALUE._col2} handleSkewJoin: false outputColumnNames: _col0, _col2, _col6, _col7 Select Operator expressions: expr: _col0 type: string expr: _col2 type: int expr: _col6 type: string expr: _col7 type: int outputColumnNames: _col0, _col1, _col2, _col3 File Output Operator compressed: false GlobalTableId: 1 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.access_log_temp2 Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.TextInputForma t output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.access_log_temp2 Stage: Stage-2 Stats-Aggr Operator Time taken: 0.1 seconds hive>
  • 51. Appendix: What does Explain show? hive> explain INSERT OVERWRITE TABLE access_log_temp2 > SELECT a.user, a.prono, p.maker, p.price > FROMaccess_log_hbase a JOIN product_hbase p ON (a.prono = p.prono); OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM(TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a) (TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (. (TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price))))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator key expressions: expr: prono type: int sort order: + Map-reduce partition columns: expr: prono type: int tag: 0 value expressions: expr: user type: string expr: prono type: int p TableScan alias: p Reduce Output Operator key expressions: expr: prono type: int sort order: + Map-reduce partition columns: expr: prono type: int tag: 1 value expressions: expr: maker type: string expr: price type: int Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col2} 1 {VALUE._col1} {VALUE._col2} handleSkewJoin: false outputColumnNames: _col0, _col2, _col6, _col7 Select Operator expressions: expr: _col0 type: string expr: _col2 type: int expr: _col6 type: string expr: _col7 type: int outputColumnNames: _col0, _col1, _col2, _col3 File Output Operator compressed: false GlobalTableId: 1 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.access_log_temp2 Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.TextInputForma t output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.access_log_temp2 Stage: Stage-2 Stats-Aggr Operator Time taken: 0.1 seconds hive> ABSTRACT SYNTAX TREE: STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Reduce Output Operator TableScan Reduce Output Operator Reduce Operator Tree: Join Operator Select Operator File Output Operator Stage: Stage-0 Move Operator Stage: Stage-2 Stats-Aggr Operator
  • 52. Appendix: What does Explain show? ABSTRACT SYNTAX TREE: STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Reduce Output Operator TableScan Reduce Output Operator Reduce Operator Tree: Join Operator Select Operator File Output Operator Stage: Stage-0 Move Operator Stage: Stage-2 Stats-Aggr Operator MapRedTask (Stage-1/root) Mapper TableScanOperator TS_1 TableScanOperator TS_0 ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 Reducer JoinOperator JOIN_4 SelectOperator SEL_5 FileSinkOperator FS_6 ≒ Move Task (Stage-0) Stats Task (Stage-2)
  • 53. Explain • Hive doesn’t tell you what is wrong • Expects you to know. • Explain tool provides query plan • Filters on input • Numbers of jobs • Numbers of maps and reduces • What the jobs are sorting by • What directories are they reading or writing
  • 54. Hive SQL 解析 • 抽象语法树: org.apache.hadoop.hive.ql.parse.ParseDriver的parse方 法 • org.apache.hadoop.hive.ql.parse.ASTNode中getToken().getType() 拿到节点, 遇到TOK_QUERY循环调用
  • 55. HiveQL优化 • Data Layout • Data Format • Joins • Debugging
  • 56. Data Layout – HDFS Characteristics • Provides Distributed File System • Very high aggregate bandwidth • Extreme scalability(up to 100 PB) • Self-healing storage • Relatively simple to administer • Limitations • Can’t modify existing files • Single writer for each file • Heavy bias for large files(> 100 MB)
  • 57. Choices for Layout • Partitions • Top level mechanism for pruning • Primary unit for updating tables(& schema) • Directory per value of specified column • Bucketing • Hashed into a file, good for sampling • Controls write parallelism • Sort order • The order the data is written within file
  • 58. Example Hive Layout • Directory • Warehouse/$database/$table • Partitioning • /part1=$partValue/part2=$partValue • Bucketing • /$bucket_$attempt(eg. 000000_0) • Sort • Each file is sorted within the file
  • 59. Layout Guidelines • Limit the number of partitions • 1000 partitions is much faster than 10000 • Nested partitions are almost always wrong • Gauge the number of buckets • Calculate file size and keep big (200 ~ 500MB) • Don’t forget number of files (Buckets * Parts) • Layout related tables the same way • Partition • Bucket and sort order
  • 60. Data Format • Serde • Input/Output (aka File) Format • Primary Choices • Text • Sequence File • RCFile • ORC
  • 61. Text Format • Critical to pick a Serde • Default - 001 between fields • JSON – top level JSON record • CSV • Slow to read and wirte • Can‘t split compressed files • Leads to huge maps • Need to read/decompress all fields
  • 62. Sequence File • Traditional MapReduce binary file format • Stores keys and values as classes • Not a goof fit for Hive, which has SQL types • Hive always stores entire row as value • Splittable but only by searching file • Default block size is 1 MB • Need to read and decompress all fields
  • 63. RCFile • Columns stored separately • Read and decompress only needed ones • Better Compression • Columns stored as binary blobs • Depends on metastore to supply types • Larger blocks • 4MB by default • Still search file for split boundary
  • 64. ORC(Optimized Row Columnar) • Columns stored separately • Knows types • Uses type-specific encoders • Stores statistics(min, max, sum, count) • Has light-weight index • Skip over blocks of rows that don‘t matter • Larger blocks • 256 MB by default • Has an index for block boundaries
  • 65. ORC – File Layout
  • 66. 比较 • 和RCFile格式相比,ORC File格式有以下优点: (1)、每个task只输出单个文件,这样可以减少NameNode的负载; (2)、支持各种复杂的数据类型,比如: datetime, decimal, 以及一 些复杂类型(struct, list, map, and union); (3)、在文件中存储了一些轻量级的索引数据; (4)、基于数据类型的块模式压缩:a、integer类型的列用行程长 度编码(run-length encoding);b、String类型的列用字典编码(dictionary encoding); (5)、用多个互相独立的RecordReaders并行读相同的文件; (6)、无需扫描markers就可以分割文件; (7)、绑定读写所需要的内存; (8)、metadata的存储是用 Protocol Buffers的,所以它支持添加和 删除一些列。
  • 67. ORC使用 • CREATE TABLE ... STORED AS ORC • ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC • SET hive.default.fileformat=Orc • 所有关于ORCFile的参数都是在Hive QL语句的TBLPROPERTIES字 段里面出现,他们是:
  • 68. ORC使用 – 例子 create table Addresses ( name string, street string, city string, state string, zip int ) stored as orc tblproperties ("orc.compress"="NONE");
  • 69. Vectorized Query Execution • The Hive query execution engine currently processes one row at a time. A single row of data goes through all the operators before the next row can be processed. This mode of processing is very inefficient in terms of CPU usage. • This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory.
  • 70. Vectorized Query Execution - USAGE • ORC format • set hive.vectorized.execution.enabled = true; • Vectorized execution is off by default, so your queries only utilize it if this variable is turned on. To disable vectorized execution and go back to standard execution, do the following: • set hive.vectorized.execution.enabled = false;
  • 71. Vectorized Query Execution - USAGE • The following expressions can be vectorized when used on supported types: • arithmetic: +, -, *, /, % • AND, OR, NOT • comparisons <, >, <=, >=, =, !=, BETWEEN, IN ( list-of-constants ) as filters • Boolean-valued expressions (non-filters) using AND, OR, NOT, <, >, <=, >=, =, != • IS [NOT] NULL • all math functions (SIN, LOG, etc.) • string functions SUBSTR, CONCAT, TRIM, LTRIM, RTRIM, LOWER, UPPER, LENGTH • type casts • Hive user-defined functions, including standard and generic UDFs • date functions (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP) • the IF conditional expression
  • 72. Vectorized Query Execution – USAGE UDF support • User-defined functions are supported using a backward compatibility bridge, so although they do run vectorized, they don't run as fast as optimized vector implementations of built-in operators and functions. Vectorized filter operations are evaluated left-to-right, so for best performance, put UDFs on the right in an ANDed list of expressions in the WHERE clause. E.g., use • column1 = 10 and myUDF(column2) = "x"
  • 73. Compression • Need to pick level of compression • None • LZO or Snappy – fast but sloppy • Best for temporary tables • ZLIB – slow and complete • Best for long term storage
  • 74. 查询优化 - Map阶段的优化(Map phase) • Map阶段的优化,主要是确定合适的Map数。那么首先要了解 Map数的计算公式: • num_Map_tasks = max[${Mapred.min.split.size},min(${dfs.block.siz e}, ${Mapred.max.split.size})] • Mapred.min.split.size指的是数据的最小分割单元大小。 • Mapred.max.split.size指的是数据的最大分割单元大小。 • dfs.block.size指的是HDFS设置的数据块大小。 • 一般来说dfs.block.size这个值是一个已经指定好的值,而且这个参数Hive是识别不 到的 • 在Hive中min的默认值是1B,max的默认值是256MB
  • 75. 查询优化 - Reduce阶段的优化(Reduce phase) • Reduce阶段优化的主要工作也是选择合适的Reduce task数量,跟 上面的Map优化类似。 • 1. Mapred.Reduce.tasks, 直接指定reduce数量 • 2. num_Reduce_tasks = min[${Hive.exec.Reducers.max}, (${input.size} / ${ Hive.exec.Reducers.bytes.per.Reducer})] • 根据输入的数据量大小来决定Reduce的个数,默认 Hive.exec.Reducers.bytes.per.Reducer为1G,而且Reduce个数不能超过一个上限参数 值,这个参数的默认取值为999。所以我们可以调整 Hive.exec.Reducers.bytes.per.Reducer来设置Reduce个数。
  • 76. Map与Reduce之间的优化(Spill, copy, Sort phase) • Spill和Sort • 在Spill阶段,由于内存不够,数据可能没办法在内存中一次性排 序完成,那么就只能把局部排序的文件先保存到磁盘上,这个动 作叫Spill,然后Spill出来的多个文件可以在最后进行merge。如果 发生Spill,可以通过设置io.Sort.mb来增大Mapper输出buffer的大 小,避免Spill的发生。另外合并时可以通过设置io.Sort.factor来使 得一次性能够合并更多的数据。调试参数的时候,一个要看Spill 的时间成本,一个要看merge的时间成本,还需要注意不要撑爆 内存(io.Sort.mb是算在Map的内存里面的)。Reduce端的merge 也是一样可以用io.Sort.factor。一般情况下这两个参数很少需要 调整,除非很明确知道这个地方是瓶颈。
  • 77. Map与Reduce之间的优化(Spill, copy, Sort phase) • Copy • copy阶段是把文件从Map端copy到Reduce端。默认情况下在5%的Map 完成的情况下Reduce就开始启动copy,这个有时候是很浪费资源的, 因为Reduce一旦启动就被占用,一直等到Map全部完成,收集到所有 数据才可以进行后面的动作,所以我们可以等比较多的Map完成之后 再启动Reduce流程,这个比例可以通 Mapred.Reduce.slowstart.completed.Maps去调整,他的默认值就是5%。 如果觉得这么做会减慢Reduce端copy的进度,可以把copy过程的线程 增大。tasktracker.http.threads可以决定作为server端的Map用于提供数 据传输服务的线程,Mapred.Reduce.parallel.copies可以决定作为client 端的Reduce同时从Map端拉取数据的并行度(一次同时从多少个Map 拉数据),修改参数的时候这两个注意协调一下,server端能处理 client端的请求即可。
  • 78. 文件的其他优化 - 小文件问题 • 小文件问题在目前的Hive环境下已经得到了比较好的解决,Hive 的默认配置中就可以在小文件输入时自动把多个文件合并给1个 Map处理,输出时如果文件很小也会进行一轮单独的合并 • 解决办法: • 1. 输入合并, 即在map前合并小文件 • 2. 输出合并, 即在输出结果的时候合并小文件
  • 79. Hive小文件 – 输入合并 • -- 每个Map最大输入大小,决定合并后的文件数 • set mapred.max.split.size=256000000; • -- 一个节点上split的至少的大小 ,决定了多个data node上的文件是 否需要合并 • set mapred.min.split.size.per.node=100000000; • -- 一个交换机下split的至少的大小,决定了多个交换机上的文件是否 需要合并 • set mapred.min.split.size.per.rack=100000000; • -- 执行Map前进行小文件合并 • set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInput Format;
  • 80. Hive小文件 – 输出合并 • hive.merge.mapfiles 在map-only job后合并文件,默认true • hive.merge.mapredfiles 在map-reduce job后合并文件,默认false • hive.merge.size.per.task 合并后每个文件的大小,默认256000000 • hive.merge.smallfiles.avgsize 平均文件大小,是决定是否执行合 并操作的阈值,默认16000000
  • 81. 压缩文件的处理 • 对于输出结果为压缩文件形式存储的情况,要解决小文件问题,如果在Map输入前合并,对输出的文 件存储格式并没有限制。但是如果使用输出合并,则必须配合SequenceFile来存储,否则无法进行合 并,以下是示例: • set mapred.output.compression. type=BLOCK; • set hive.exec.compress.output= true; • set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec; • set hive.merge.smallfiles.avgsize=100000000; • drop table if exists dw_stage.zj_small; • create table dw_stage.zj_small • STORED AS SEQUENCEFILE • as select * • from dw_db.dw_soj_imp_dtl • where log_dt = '2014-04-14' • and paid like '%baidu%' ;
  • 82. 使用HAR归档文件 • Hadoop的归档文件格式也是解决小文件问题的方式之一。而且Hive提供了 原生支持: • • set hive.archive.enabled= true; • set hive.archive.har.parentdir.settable= true; • set har.partfile.size=1099511627776; • ALTER TABLE srcpart ARCHIVE PARTITION(ds= '2008-04-08', hr= '12' ); • ALTER TABLE srcpart UNARCHIVE PARTITION(ds= '2008-04-08', hr= '12' ); • • 如果使用的不是分区表,则可创建成外部表,并使用har://协议来指定路径。
  • 83. Job的优化 • 1. Job执行模式 • 2. JVM重用 • 3. 索引 • 4. Join算法 • 5. 数据倾斜
  • 84. Job执行模式 • Hadoop的Map Reduce Job可以有3种模式执行,即本地模式,伪分布式, 还有真正的分布式。本地模式和伪分布式都是在最初学习Hadoop的时候往 往被说成是做单机开发的时候用到。但是实际上对于处理数据量非常小的 Job,直接启动分布式Job会消耗大量资源,而真正执行计算的时间反而非常 少。这个时候就应该使用本地模式执行mr Job,这样执行的时候不会启动分 布式Job,执行速度就会快很多。比如一般来说启动分布式Job,无论多小的 数据量,执行时间一般不会少于20s,而使用本地mr模式,10秒左右就能出 结果。 • 设置执行模式的主要参数有三个,一个是Hive.exec.mode.local.auto,把他设 为true就能够自动开启local mr模式。但是这还不足以启动local mr,输入的 文件数量和数据量大小必须要控制,这两个参数分别为 Hive.exec.mode.local.auto.tasks.max和 Hive.exec.mode.local.auto.inputbytes.max,默认值分别为4和128MB,即默 认情况下,Map处理的文件数不超过4个并且总大小小于128MB就启用local mr模式。
  • 88. SQL整体优化 • 1. Job间并行 • 设置Job间并行的参数是Hive.exec.parallel,将其设为true即可。默认的并行度为8,也就是最多允许sql 中8个Job并行。如果想要更高的并行度,可以通过Hive.exec.parallel. thread.number参数进行设置,但 要避免设置过大而占用过多资源。 • 2.减少Job数 • 例子: 查询某网站访问过页面a和页面b的用户数量 select count(*) from (select distinct user_id from logs where page_name = ‘a’) a join (select distinct user_id from logs where blog_owner = ‘b’) b on a.user_id = b.user_id;
  • 89. SQL整体优化 select count(*) from logs group by user_id having (count(case when page_name = ‘a’ then 1 end) > 0 and count(case when page_name = ‘b’ then 1 end) > 0)
  • 90. Indexed Hive • Hive Indexing • Provides key-based data view • Keys data duplicated • Storage layout favors search & lookup performance • Provided better data access for certain operations • A cheaper alternative to full data scans!
  • 91. How does the index look like? • An index is a table with 3 columns • Data in index looks like
  • 92. Hive index in HQL • SELECT (mapping, projection, association, given key, fetch value) • WHERE (filters on keys) • GROUP BY (grouping on keys) • JOIN (join key as index key) • Indexes have high potential for accelerating wide range of queries
  • 93. Hive Index • Index as Reference • Index as Data • Here takes the index as data as the demonstration • Uses Query Rewrite technique to transform queries on base table to index table • Limited applicability currently, but technique itself has wide potential • Also a very quick way to demonstrate importance of index for performance
  • 94. Indexes and Query Rewrites • GROUP BY, aggregation • Index as Data • Group By Key = Index Key • Query rewritten to use indexes, but still a valid query (nothing special in it!)
  • 96. where
  • 99. Year on year query
  • 100. Year on year query
  • 101. Why index performs better? • Reducing data increases I/O efficiency • Exploiting storage layout optimization • e.g. GROUP BY: • Sort + agg • Hash & agg • Sort step already in index • Parallelization • Process the index data in the same manner as base table, distribute the processing across nodes • Scalable
  • 105. Hive MetaStore ER diagram BUCKETING_COLS SD_ID BIGINT(20) BUCKET_COL_NAME VARCHAR(256) INTEGER_IDX INT(11) Indexes COLUMNS SD_ID BIGINT(20) COMMENT VARCHAR(256) COLUMN_NAME VARCHAR(128) TYPE_NAME VARCHAR(4000) INTEGER_IDX INT(11) Indexes DATABASE_PARAMS DB_ID BIGINT(20) PARAM_KEY VARCHAR(180) PARAM_VALUE VARCHAR(4000) Indexes DBS DB_ID BIGINT(20) DESC VARCHAR(4000) DB_LOCATION_URI VARCHAR(4000) NAME VARCHAR(128) Indexes DB_PRIVS DB_GRANT_ID BIGINT(20) CREATE_TIME INT(11) DB_ID BIGINT(20) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) DB_PRIV VARCHAR(128) Indexes GLOBAL_PRIVS USER_GRANT_ID BIGINT(20) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) USER_PRIV VARCHAR(128) Indexes IDXS INDEX_ID BIGINT(20) CREATE_TIME INT(11) DEFERRED_REBUILD BIT(1) INDEX_HANDLER_CLASS VARCHAR(4000) INDEX_NAME VARCHAR(128) INDEX_TBL_ID BIGINT(20) LAST_ACCESS_TIME INT(11) ORIG_TBL_ID BIGINT(20) SD_ID BIGINT(20) Indexes INDEX_PARAMS INDEX_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes PARTITION_KEYS TBL_ID BIGINT(20) PKEY_COMMENT VARCHAR(4000) PKEY_NAME VARCHAR(128) PKEY_TYPE VARCHAR(767) INTEGER_IDX INT(11) Indexes ROLES ROLE_ID BIGINT(20) CREATE_TIME INT(11) OWNER_NAME VARCHAR(128) ROLE_NAME VARCHAR(128) Indexes ROLE_MAP ROLE_GRANT_ID BIGINT(20) ADD_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) ROLE_ID BIGINT(20) Indexes SDS SD_ID BIGINT(20) INPUT_FORMAT VARCHAR(4000) IS_COMPRESSED BIT(1) LOCATION VARCHAR(4000) NUM_BUCKETS INT(11) OUTPUT_FORMAT VARCHAR(4000) SERDE_ID BIGINT(20) Indexes SD_PARAMS SD_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes SEQUENCE_TABLE SEQUENCE_NAME VARCHAR(255) NEXT_VAL BIGINT(20) Indexes SERDES SERDE_ID BIGINT(20) NAME VARCHAR(128) SLIB VARCHAR(4000) Indexes SERDE_PARAMS SERDE_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes SORT_COLS SD_ID BIGINT(20) COLUMN_NAME VARCHAR(128) ORDER INT(11) INTEGER_IDX INT(11) Indexes TABLE_PARAMS TBL_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes TBLS TBL_ID BIGINT(20) CREATE_TIME INT(11) DB_ID BIGINT(20) LAST_ACCESS_TIME INT(11) OWNER VARCHAR(767) RETENTION INT(11) SD_ID BIGINT(20) TBL_NAME VARCHAR(128) TBL_TYPE VARCHAR(128) VIEW_EXPANDED_TEXT MEDIUMTEXT VIEW_ORIGINAL_TEXT MEDIUMTEXT Indexes TBL_PRIVS TBL_GRANT_ID BIGINT(20) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) TBL_PRIV VARCHAR(128) TBL_ID BIGINT(20) Indexes
  • 106. Reference • https://cwiki.apache.org/confluence/display/Hive/DesignDocs • FaceBook Hive Summit 2011 – join: Hive from the 2011 Hadoop Summit (Liyin Tang, Namit Jain) • Indexed Hive – Prafulla Tekawade/ Nikhil Deshpande • Internal Hive “http://www.slideshare.net/recruitcojp/internal-hive • Hive SQL的编译过程: http://tech.meituan.com/hive-sql-to- mapreduce.html • MonetDB/X100: Hyper-Pipelining Query Execution. 2005, Peter Boncz, Matcin Zuokowski, Niels Nes • Ysmart: Yet Another SQL-to-MapReduce Translator, Rubao Lee, Tian Luo…