Accelerating distributed joins in Apache Hive: Runtime filtering enhancements

Accelerating distributed joins in
Apache Hive:
Runtime ﬁltering enhancements
ApacheCon 2020
29th September 2020

© 2020 Cloudera, Inc. All rights reserved. 2
About us
Panagiotis (Panos) Garefalakis @pgaref
Software Engineer @Cloudera, Hive runtime team
Working on Apache Hive, ORC, Tez
PhD in Distributed Systems, Imperial College London
Stamatis Zampetakis
Software Engineer @Cloudera, Hive query optimizer team
Working on Apache Hive, Calcite
PhD in Data Management, INRIA & Paris-Sud University

Apache Hive
Hive 1.0 Hive 2.0 Hive 3.0 Hive 4.0
GB to PB and low-latency multi-tenant
HiveQL
MapReduce
Read only
ETL
SQL
Tez & Spark
LLAP
Vectorization
BI Tools
Mat. Views
AntiJoin
Semi-join Red.
Lazy Mat.
Ad-Hoc
Batch SQL Analytics SQL Interactive SQL

Hive users

Hive with LLAP

LLAP daemon anatomy
Query
Coordinator
Query
Coordinator
Query
Coordinator

Outline
Data pruning
• Optimizations
• Benefit
Semijoin reduction
• Less data transfer
• Hive semijoin reduction
Lazy materialization
• ORC file format
• Row-level filtering
• Hive Probe-decode

Data pruning
HOW? WHY?
Reduce memory footprint
Reduce disk & network I/O
Save CPU cycles
Semijoin reduction
Late materialization
... and many more

Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss

Data pruning
Baseline
FROM store_returns
INNER JOIN reason
WHERE
r_reason_desc =
Scan
store_returns
InnerJoin
Scan
reason
Filter
Project
sr_net_loss

Data pruning
Filter pushdown
Scan
store_returns
InnerJoin
Scan
reason
Project
sr_net_loss
Filter
BEFORE AFTER
Scan
store_returns
InnerJoin
Scan
reason
Filter
Project
sr_net_loss

Data pruning
Filter pushdown
BEFORE AFTER

Data pruning
Project pushdown
BEFORE AFTER
Scan
store_returns
InnerJoin
Scan
reason
Project
sr_net_loss
Filter
Scan
store_returns
InnerJoin
Scan
reason
Filter
Project
sr_net_loss
Project
sr_reason_sk,
sr_net_loss
Project
r_reason_sk

Data pruning
Project pushdown
BEFORE AFTER

Data pruning
Semijoin reduction
BEFORE AFTER
Scan
store_returns
InnerJoin
Scan
reason
Filter
Project
sr_net_loss
Project
sr_reason_sk,
sr_net_loss
Project
r_reason_sk
Scan
store_returns
SemiJoin
Scan
reason
Filter
Project
sr_net_loss
Project
sr_reason_sk,
sr_net_loss
Project
r_reason_sk

Data pruning
Semijoin reduction
BEFORE AFTER

Data pruning
Late materialization
BEFORE AFTER
CE CE CE
CE
COMPRESSED
ENCODED (CE)

Data pruning
Beneﬁt
BASELINE FILTER PUSHDOWN PROJECT PUSHDOWN SEMIJOIN REDUCTION &
LATE MATERIALIZATION
CE CE CE
CE

Data pruning
Beneﬁt
BASELINE FILTER PUSHDOWN PROJECT PUSHDOWN SEMIJOIN REDUCTION &
LATE MATERIALIZATION
CE CE
CE
CE

Common names
Semijoin reduction
Invisible join
Selective join pushdown
Sideways information passing

Beneﬁt
Operators between the semijoin & and the reduced join treat less data:
➔ Reduce disk & network I/O
➔ Save CPU cycles
➔ Decrease memory footprint

TPC-DS Query Example
Find the ten items that were returned due to a more competitive price
SELECT ss_item_sk, sum(sr_net_loss) as net_loss, avg(ss_sales_price) as price
FROM store_sales
INNER JOIN store_returns
ON ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
INNER JOIN reason
WHERE r_reason_desc = 'Found a better price in a store'
GROUP BY ss_item_sk
ORDER BY net_loss DESC NULLS LAST, price DESC NULLS LAST
LIMIT 10;

InnerJoin
ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
Limit
10
Sort
net_loss DESC, price DESC
Scan
store_sales
Scan
store_returns
Aggregate
group=ss_item_sk, agg0=avg(ss_sales_price), agg1=(sr_net_loss)
InnerJoin
Scan
reason
Filter

InnerJoin
Limit
10
Scan
store_sales
Scan
store_returns
Aggregate
InnerJoin
Scan
reason
Filter
r_reason_desc=”Found...” A
Sort

InnerJoin
Limit
10
Scan
store_sales
Aggregate
A
Sort

InnerJoin
Limit
10
Scan
store_sales
Aggregate
A
B
Sort

InnerJoin
Scan
store_sales A
B

InnerJoin
Scan
store_sales A
B
2880404|46MB 8000|128KB
8000|160K

InnerJoin
Scan
store_sales A
B
N1 N2
N3
Node

N3
N2N1
46MB 128KB
160KB
Node

InnerJoin
Scan
store_sales A
B

InnerJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
SemiJoin

InnerJoin
Scan
store_sales A
B
SemiJoin
2880404|46MB 8000|128KB
8000|160K
8000|128K

InnerJoin
Scan
store_sales A
B
SemiJoin
N3
N2N1
N1
Node

N3
N2N1
128KB 128KB
160KB
64KB
Node

N3
N2N1
128KB 128KB
160KB
64KB
N3
N2N1
46MB 128KB
160KB
With semijoin
Total I/O=480KB
Without semijoin
Total I/O=46MB

InnerJoin
Scan
store_sales A
B
SemiJoin
Semijoin reduction in Hive

InnerJoin
Scan
store_sales A
B
Filterss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item
ss_ticket_number BETWEEN :min_sr_ticket_number AND :max_sr_ticket_number
IN_BLOOM_FILTER(HASH(ss_item_sk, ss_ticket_number), :bloom_sr)
Aggregatemin(sr_item_sk)
max(sr_item_sk)
min(sr_ticket_number)
max(sr_ticket_number)
bloom_ﬁlter(hash(sr_item_sk,sr_ticket_number))

InnerJoin
Scan
store_sales A
B
Aggregate
MIN(sr_item_sk)
MAX(sr_item_sk)
MIN(sr_ticket_number)
MAX(sr_ticket_number)
BLOOM_FILTER(HASH(sr_item_sk,sr_ticket_number))

InnerJoin
Scan
store_sales A
B
max(sr_item_sk)

InnerJoin
Scan
store_sales A
B
Filter
ss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item
max(sr_item_sk)

N1
64KB
N3
N2N1
46MB 128KB
160KB
With semijoin
Total I/O=480KB
Without semijoin
Total I/O=46MB
N3
N2
128KB 128KB
160KB
N1
9KB
With bloom semijoin
Total I/O=425KB
N3
N2
128KB 128KB
160KB

Four phase overview
Create synthetic predicate (SyntheticJoinPredicate)
Expand to semijoin reducers (DynamicPartitionPruningOptimization)
Merge to multi column semijoin reducers (SemiJoinReductionMerge)
• HIVE-21196
• HIVE-23976
Remove conditionally semijoin reducers (SemiJoinRemoval & TezCompiler)

Common names
Row-level ﬁltering
Lazy decoding
Probe-decode

Beneﬁt
Avoid materializing data, not needed for the evaluation of the remaining of a
query
➔ Save CPU cycles
➔ Reduce memory allocations
➔ Eliminate data and ﬁlter operators from the pipeline

ORC row group selection
• Data organized in stripes
of row groups (row x 10K)
• A single matched key of a
row group leads to Ks of
decoded rows
• Even for low selectivity
queries, most of the
row-groups might be read
and decoded

ORC row group decoding

ORC file structure
The file footer contains:
• Metadata
• Stripe information
• Postscript with compression details
ORC file data divided into stripes:
• Self contained sets of rows (organized by columns)
• Smallest unit of work for tasks
• ~64MB by default, often larger/smaller (16MB for Hive ACID deltas)

ORC ﬁle structure
Stripe footer:
• List of col streams and locations
• Column encoding
Index data:
• Offset to seek to right row group
• Stats (min, max, sum) per column
• Orc.row.index.stride 10,000
Row data:
• Index stream
• Metadata stream
• Data stream
StripeNStripe1

ORC ﬁle structure
Stripe footer:
• Column encoding
Index data:
Row data:
• Index stream
• Metadata stream
• Data stream
•
StripeNStripe1

ORC ﬁle structure
Stripe footer:
• Column encoding
Index data:
Row data:
• Index stream
• Metadata stream
• Data stream
StripeNStripe1

Hive can pass Search Arguments (SArgs) filter to the Reader:
• Limited set of operations (=, <=>, <, <=, in, between, is null)
• Compare one column to literal(s) e.g., “age > 100”
Limitations:
• Can’t support filters with: inequalities, UDFs, or columns with relationships
• Can only eliminate entire row groups, stripes, or files
SELECT x,y,z FROM foo WHERE x <> 1;
ORC search arguments

SArgs filter RGs ONLY if they can guarantee that none of the rows can pass the
filter.
ORC row-level filteringRowgroup1
Row data
Stripe

filter. [ORC-577] Finer grained row-level filtering!
Read a subset of the columns and apply more detailed row level filter.
Row data
Stripe
Early
Expand
1

Row data
Stripe
Early
Expand
1 2
Apply
Filter
Selected rows

Row data
Stripe
Early
Expand
1 2
Apply
Filter
Selected rows
Read
Remaining
3

ORC row-level ﬁltering evaluation
[ORC-597] benchmark ORC Reader for complex types such as Decimals,
Doubles and real datasets (Taxi, Sales, GitHub)

[HIVE-22959] introduce FilterContext as part of VectorRowBatch
[ORC-577] extend ORC Reader with a new option:
setRowFilter(String[] columnNames,Consumer<VectorizedRowBatch> filter)
Consumer callback to implement any kind of filtering logic
New ORC reader logic:
• Early expand the given column(s)
• Evaluate the ﬁlter callback (implemented on the consumer side)
• Decode only the selected rows from the remaining columns
ORC row-level ﬁltering implementation

Probe-decode optimization
Leveraging ORC row-level ﬁltering
SELECT *
FROM catalog_sales
INNER JOIN date_dim
ON cs_sold_date_sk = d_date_sk
WHERE d_year = 2000 and d_qoy = 2
LIMIT 10
Scan
date_dim
Filter
dyear=2000 AND dqoy=2
Scan
catalog_sales
InnerJoin
cs_sold_date_sk=d_date_sk
Limit
10
Decode
d_date_sk, dyear, dqoy
2_220_880_404 1_000_000
8_000
1_000_000
8_000

Static row-level ﬁltering
SELECT *
FROM catalog_sales
INNER JOIN date_dim
LIMIT 10
Scan
date_dim
Filter
Scan
catalog_sales
InnerJoin
Limit
10
ProbeDecode
d_date_sk
2_220_880_404 1_000_000
8_000
8_000
8_000

Dynamic MJ row-level ﬁltering
SELECT *
FROM catalog_sales
INNER JOIN date_dim
LIMIT 10
Scan
date_dim
Filter
Scan
catalog_sales
InnerJoin
Limit
10
SemiJoin
ProbeDecode
cs_item_sk
420_504
2_220_880_404 1_000_000
8_000
8_000
8_000ProbeDecode
d_date_sk

Probe-decode plan
Map 1 <- Map 4 (BROADCAST_EDGE), Map 5 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: store_sales
probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_48_container, bigKeyColName:ss_item_sk, smallTablePos:1, keyRatio:1.6
Statistics: Num rows: 82510879939 Data size: 10343396725952 Basic stats: COMPLETE Column stats: COMPLETE
TableScan Vectorization:
native: true
Select Operator
expressions: ss_item_sk (type: bigint), ss_ext_sales_price (type: decimal(7,2)), ss_sold_date_sk (type: bigint)
outputColumnNames: _col0, _col1, _col2
Select Vectorization:
className: VectorSelectOperator
native: true
projectedOutputColumnNums: [1, 14, 22]
Statistics: Num rows: 82510879939 Data size: 10343396725952 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:"
0 _col2 (type: bigint)
1 _col0 (type: bigint)
Map Join Vectorization:
bigTableKeyColumns: 22:bigint
className: VectorMapJoinInnerBigOnlyLongOperator
input vertices:
1 Map 4
SET hive.optimize.scan.probedecode=true

⬢
–
–
–
–
–
DATA PLANE SERVICES
EDWS Data Analytics Studio (DAS) Management Services
COMPUTE CLUSTER
SHARED
SERVICES
Ranger
Atlas
Metastore
Tiller API Server
DAS Web Service
Query Coordinators
Query Executors
Registry
Blobstore
Indexer
RDBMS
Hive Server
Long-running kubernetes cluster
Inter-cluster communication Intra-cluster communication
Ingress Controller or
Load Balancer
Internal Service Endpoint for
ReplicaSet or StatefulSet
Ephemeral kubernetes cluster

Data pruning
• Prune early → increase query eﬃciency
Semijoin reduction
• Reduce disk & network I/O
Lazy materialization
• Save CPU cycles and memory allocations
Conclusions

References
1. Stocker, Konrad, et al. "Integrating semi-join-reducers into state-of-the-art
query processors." ICDE 2001
2. Abadi, Daniel J., Samuel R. Madden, and Nabil Hachem. "Column-stores vs.
row-stores: how different are they really?." SIGMOD 2008
3. Lang, Harald, et al. "Performance-optimal ﬁltering: Bloom overtakes cuckoo at
high throughput." VLDB 2019

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements

Recommended

Recommended

More Related Content

Similar to Accelerating distributed joins in Apache Hive: Runtime filtering enhancements

Similar to Accelerating distributed joins in Apache Hive: Runtime filtering enhancements (20)

More from Panagiotis Garefalakis

More from Panagiotis Garefalakis (8)

Recently uploaded

Recently uploaded (20)

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements