SlideShare a Scribd company logo
Accelerating distributed joins in
Apache Hive:
Runtime filtering enhancements
ApacheCon 2020
29th September 2020
© 2020 Cloudera, Inc. All rights reserved. 2
About us
Panagiotis (Panos) Garefalakis @pgaref
Software Engineer @Cloudera, Hive runtime team
Working on Apache Hive, ORC, Tez
PhD in Distributed Systems, Imperial College London
Stamatis Zampetakis
Software Engineer @Cloudera, Hive query optimizer team
Working on Apache Hive, Calcite
PhD in Data Management, INRIA & Paris-Sud University
© 2020 Cloudera, Inc. All rights reserved. 3
Apache Hive
Hive 1.0 Hive 2.0 Hive 3.0 Hive 4.0
GB to PB and low-latency multi-tenant
HiveQL
MapReduce
Read only
ETL
SQL
Tez & Spark
LLAP
Vectorization
BI Tools
Mat. Views
AntiJoin
Semi-join Red.
Lazy Mat.
Ad-Hoc
Batch SQL Analytics SQL Interactive SQL
© 2020 Cloudera, Inc. All rights reserved. 4
Hive users
© 2020 Cloudera, Inc. All rights reserved. 5
Hive with LLAP
© 2020 Cloudera, Inc. All rights reserved. 6
LLAP daemon anatomy
Query
Coordinator
Query
Coordinator
Query
Coordinator
© 2020 Cloudera, Inc. All rights reserved. 7
LLAP daemon anatomy
Query
Coordinator
Query
Coordinator
Query
Coordinator
© 2020 Cloudera, Inc. All rights reserved. 8
Outline
Data pruning
• Optimizations
• Benefit
Semijoin reduction
• Less data transfer
• Hive semijoin reduction
Lazy materialization
• ORC file format
• Row-level filtering
• Hive Probe-decode
© 2020 Cloudera, Inc. All rights reserved. 9
Data pruning
HOW? WHY?
Reduce memory footprint
Reduce disk & network I/O
Save CPU cycles
Semijoin reduction
Late materialization
... and many more
© 2020 Cloudera, Inc. All rights reserved. 10
Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 11
Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 12
Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 13
Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 14
Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 15
Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 16
Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 17
Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 18
Data pruning
Baseline
SELECT sr_net_loss as net_loss
FROM store_returns
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE
r_reason_desc =
'Found a better price in a store'
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 19
Data pruning
Filter pushdown
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Project
sr_net_loss
Filter
r_reason_desc=”Found...”
BEFORE AFTER
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
© 2020 Cloudera, Inc. All rights reserved. 20
Data pruning
Filter pushdown
BEFORE AFTER
© 2020 Cloudera, Inc. All rights reserved. 21
Data pruning
Project pushdown
BEFORE AFTER
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Project
sr_net_loss
Filter
r_reason_desc=”Found...”
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
Project
sr_reason_sk,
sr_net_loss
Project
r_reason_sk
© 2020 Cloudera, Inc. All rights reserved. 22
Data pruning
Project pushdown
BEFORE AFTER
© 2020 Cloudera, Inc. All rights reserved. 23
Data pruning
Semijoin reduction
BEFORE AFTER
Scan
store_returns
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
Project
sr_reason_sk,
sr_net_loss
Project
r_reason_sk
Scan
store_returns
SemiJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
Project
sr_net_loss
Project
sr_reason_sk,
sr_net_loss
Project
r_reason_sk
© 2020 Cloudera, Inc. All rights reserved. 24
Data pruning
Semijoin reduction
BEFORE AFTER
© 2020 Cloudera, Inc. All rights reserved. 25
Data pruning
Late materialization
BEFORE AFTER
CE CE CE
CE
COMPRESSED
ENCODED (CE)
© 2020 Cloudera, Inc. All rights reserved. 26
Data pruning
Benefit
BASELINE FILTER PUSHDOWN PROJECT PUSHDOWN SEMIJOIN REDUCTION &
LATE MATERIALIZATION
CE CE CE
CE
© 2020 Cloudera, Inc. All rights reserved. 27
Data pruning
Benefit
BASELINE FILTER PUSHDOWN PROJECT PUSHDOWN SEMIJOIN REDUCTION &
LATE MATERIALIZATION
CE CE
CE
CE
Semijoin reduction
© 2020 Cloudera, Inc. All rights reserved. 29
Common names
Semijoin reduction
Invisible join
Selective join pushdown
Sideways information passing
© 2020 Cloudera, Inc. All rights reserved. 30
Benefit
Operators between the semijoin & and the reduced join treat less data:
➔ Reduce disk & network I/O
➔ Save CPU cycles
➔ Decrease memory footprint
© 2020 Cloudera, Inc. All rights reserved. 31
TPC-DS Query Example
Find the ten items that were returned due to a more competitive price
SELECT ss_item_sk, sum(sr_net_loss) as net_loss, avg(ss_sales_price) as price
FROM store_sales
INNER JOIN store_returns
ON ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
INNER JOIN reason
ON sr_reason_sk = r_reason_sk
WHERE r_reason_desc = 'Found a better price in a store'
GROUP BY ss_item_sk
ORDER BY net_loss DESC NULLS LAST, price DESC NULLS LAST
LIMIT 10;
© 2020 Cloudera, Inc. All rights reserved. 32
InnerJoin
ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
Limit
10
Sort
net_loss DESC, price DESC
Scan
store_sales
Scan
store_returns
Aggregate
group=ss_item_sk, agg0=avg(ss_sales_price), agg1=(sr_net_loss)
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...”
© 2020 Cloudera, Inc. All rights reserved. 33
InnerJoin
ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
Limit
10
Scan
store_sales
Scan
store_returns
Aggregate
group=ss_item_sk, agg0=avg(ss_sales_price), agg1=(sr_net_loss)
InnerJoin
sr_reason_sk = r_reason_sk
Scan
reason
Filter
r_reason_desc=”Found...” A
Sort
net_loss DESC, price DESC
© 2020 Cloudera, Inc. All rights reserved. 34
InnerJoin
ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
Limit
10
Scan
store_sales
Aggregate
group=ss_item_sk, agg0=avg(ss_sales_price), agg1=(sr_net_loss)
A
Sort
net_loss DESC, price DESC
© 2020 Cloudera, Inc. All rights reserved. 35
InnerJoin
ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
Limit
10
Scan
store_sales
Aggregate
group=ss_item_sk, agg0=avg(ss_sales_price), agg1=(sr_net_loss)
A
B
Sort
net_loss DESC, price DESC
© 2020 Cloudera, Inc. All rights reserved. 36
InnerJoin
ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
© 2020 Cloudera, Inc. All rights reserved. 37
InnerJoin
ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
2880404|46MB 8000|128KB
8000|160K
© 2020 Cloudera, Inc. All rights reserved. 38
InnerJoin
ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
N1 N2
N3
Node
© 2020 Cloudera, Inc. All rights reserved. 39
N3
N2N1
46MB 128KB
160KB
Node
© 2020 Cloudera, Inc. All rights reserved. 40
InnerJoin
ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
© 2020 Cloudera, Inc. All rights reserved. 41
InnerJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
SemiJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
© 2020 Cloudera, Inc. All rights reserved. 42
InnerJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
SemiJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
2880404|46MB 8000|128KB
8000|160K
8000|128K
© 2020 Cloudera, Inc. All rights reserved. 43
InnerJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
SemiJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
N3
N2N1
N1
Node
© 2020 Cloudera, Inc. All rights reserved. 44
N3
N2N1
128KB 128KB
160KB
64KB
Node
© 2020 Cloudera, Inc. All rights reserved. 45
N3
N2N1
128KB 128KB
160KB
64KB
N3
N2N1
46MB 128KB
160KB
With semijoin
Total I/O=480KB
Without semijoin
Total I/O=46MB
© 2020 Cloudera, Inc. All rights reserved. 46
InnerJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
SemiJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Semijoin reduction in Hive
© 2020 Cloudera, Inc. All rights reserved. 47
InnerJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
Filterss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item
ss_ticket_number BETWEEN :min_sr_ticket_number AND :max_sr_ticket_number
IN_BLOOM_FILTER(HASH(ss_item_sk, ss_ticket_number), :bloom_sr)
Aggregatemin(sr_item_sk)
max(sr_item_sk)
min(sr_ticket_number)
max(sr_ticket_number)
bloom_filter(hash(sr_item_sk,sr_ticket_number))
Semijoin reduction in Hive
© 2020 Cloudera, Inc. All rights reserved. 48
InnerJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
Filterss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item
ss_ticket_number BETWEEN :min_sr_ticket_number AND :max_sr_ticket_number
IN_BLOOM_FILTER(HASH(ss_item_sk, ss_ticket_number), :bloom_sr)
Aggregate
MIN(sr_item_sk)
MAX(sr_item_sk)
MIN(sr_ticket_number)
MAX(sr_ticket_number)
BLOOM_FILTER(HASH(sr_item_sk,sr_ticket_number))
Semijoin reduction in Hive
© 2020 Cloudera, Inc. All rights reserved. 49
InnerJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
Filterss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item
ss_ticket_number BETWEEN :min_sr_ticket_number AND :max_sr_ticket_number
IN_BLOOM_FILTER(HASH(ss_item_sk, ss_ticket_number), :bloom_sr)
Aggregatemin(sr_item_sk)
max(sr_item_sk)
min(sr_ticket_number)
max(sr_ticket_number)
bloom_filter(hash(sr_item_sk,sr_ticket_number))
Semijoin reduction in Hive
© 2020 Cloudera, Inc. All rights reserved. 50
InnerJoin
ss_item_sk=sr_item_sk
AND ss_ticket_number=sr_ticket_number
Scan
store_sales A
B
Filter
ss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item
ss_ticket_number BETWEEN :min_sr_ticket_number AND :max_sr_ticket_number
IN_BLOOM_FILTER(HASH(ss_item_sk, ss_ticket_number), :bloom_sr)
Aggregatemin(sr_item_sk)
max(sr_item_sk)
min(sr_ticket_number)
max(sr_ticket_number)
bloom_filter(hash(sr_item_sk,sr_ticket_number))
Semijoin reduction in Hive
© 2020 Cloudera, Inc. All rights reserved. 51
N1
64KB
N3
N2N1
46MB 128KB
160KB
With semijoin
Total I/O=480KB
Without semijoin
Total I/O=46MB
N3
N2
128KB 128KB
160KB
N1
9KB
With bloom semijoin
Total I/O=425KB
N3
N2
128KB 128KB
160KB
© 2020 Cloudera, Inc. All rights reserved. 52
Semijoin reduction in Hive
Four phase overview
Create synthetic predicate (SyntheticJoinPredicate)
Expand to semijoin reducers (DynamicPartitionPruningOptimization)
Merge to multi column semijoin reducers (SemiJoinReductionMerge)
• HIVE-21196
• HIVE-23976
Remove conditionally semijoin reducers (SemiJoinRemoval & TezCompiler)
Lazy materialization
© 2020 Cloudera, Inc. All rights reserved. 54
Common names
Row-level filtering
Lazy decoding
Probe-decode
© 2020 Cloudera, Inc. All rights reserved. 55
Benefit
Avoid materializing data, not needed for the evaluation of the remaining of a
query
➔ Save CPU cycles
➔ Reduce memory allocations
➔ Eliminate data and filter operators from the pipeline
© 2020 Cloudera, Inc. All rights reserved. 56
ORC row group selection
• Data organized in stripes
of row groups (row x 10K)
• A single matched key of a
row group leads to Ks of
decoded rows
• Even for low selectivity
queries, most of the
row-groups might be read
and decoded
© 2020 Cloudera, Inc. All rights reserved. 57
ORC row group decoding
© 2020 Cloudera, Inc. All rights reserved. 58
ORC file structure
The file footer contains:
• Metadata
• Stripe information
• Postscript with compression details
ORC file data divided into stripes:
• Self contained sets of rows (organized by columns)
• Smallest unit of work for tasks
• ~64MB by default, often larger/smaller (16MB for Hive ACID deltas)
© 2020 Cloudera, Inc. All rights reserved. 59
ORC file structure
Stripe footer:
• List of col streams and locations
• Column encoding
Index data:
• Offset to seek to right row group
• Stats (min, max, sum) per column
• Orc.row.index.stride 10,000
Row data:
• Index stream
• Metadata stream
• Data stream
StripeNStripe1
© 2020 Cloudera, Inc. All rights reserved. 60
ORC file structure
Stripe footer:
• List of col streams and locations
• Column encoding
Index data:
• Offset to seek to right row group
• Stats (min, max, sum) per column
• Orc.row.index.stride 10,000
Row data:
• Index stream
• Metadata stream
• Data stream
•
StripeNStripe1
© 2020 Cloudera, Inc. All rights reserved. 61
ORC file structure
Stripe footer:
• List of col streams and locations
• Column encoding
Index data:
• Offset to seek to right row group
• Stats (min, max, sum) per column
• Orc.row.index.stride 10,000
Row data:
• Index stream
• Metadata stream
• Data stream
StripeNStripe1
© 2020 Cloudera, Inc. All rights reserved. 62
Hive can pass Search Arguments (SArgs) filter to the Reader:
• Limited set of operations (=, <=>, <, <=, in, between, is null)
• Compare one column to literal(s) e.g., “age > 100”
Limitations:
• Can’t support filters with: inequalities, UDFs, or columns with relationships
• Can only eliminate entire row groups, stripes, or files
SELECT x,y,z FROM foo WHERE x <> 1;
ORC search arguments
© 2020 Cloudera, Inc. All rights reserved. 63
SArgs filter RGs ONLY if they can guarantee that none of the rows can pass the
filter.
ORC row-level filteringRowgroup1
Row data
Stripe
SELECT x,y,z FROM foo WHERE x <> 1;
© 2020 Cloudera, Inc. All rights reserved. 64
SArgs filter RGs ONLY if they can guarantee that none of the rows can pass the
filter. [ORC-577] Finer grained row-level filtering!
Read a subset of the columns and apply more detailed row level filter.
ORC row-level filteringRowgroup1
Row data
Stripe
Early
Expand
1
SELECT x,y,z FROM foo WHERE x <> 1;
© 2020 Cloudera, Inc. All rights reserved. 65
SArgs filter RGs ONLY if they can guarantee that none of the rows can pass the
filter. [ORC-577] Finer grained row-level filtering!
Read a subset of the columns and apply more detailed row level filter.
ORC row-level filteringRowgroup1
Row data
Stripe
Early
Expand
1 2
Apply
Filter
Selected rows
SELECT x,y,z FROM foo WHERE x <> 1;
© 2020 Cloudera, Inc. All rights reserved. 66
SArgs filter RGs ONLY if they can guarantee that none of the rows can pass the
filter. [ORC-577] Finer grained row-level filtering!
Read a subset of the columns and apply more detailed row level filter.
ORC row-level filteringRowgroup1
Row data
Stripe
Early
Expand
1 2
Apply
Filter
Selected rows
Read
Remaining
3
SELECT x,y,z FROM foo WHERE x <> 1;
© 2020 Cloudera, Inc. All rights reserved. 67
ORC row-level filtering evaluation
[ORC-597] benchmark ORC Reader for complex types such as Decimals,
Doubles and real datasets (Taxi, Sales, GitHub)
© 2020 Cloudera, Inc. All rights reserved. 68
[HIVE-22959] introduce FilterContext as part of VectorRowBatch
[ORC-577] extend ORC Reader with a new option:
setRowFilter(String[] columnNames,Consumer<VectorizedRowBatch> filter)
Consumer callback to implement any kind of filtering logic
New ORC reader logic:
• Early expand the given column(s)
• Evaluate the filter callback (implemented on the consumer side)
• Decode only the selected rows from the remaining columns
ORC row-level filtering implementation
© 2020 Cloudera, Inc. All rights reserved. 69
Probe-decode optimization
Leveraging ORC row-level filtering
SELECT *
FROM catalog_sales
INNER JOIN date_dim
ON cs_sold_date_sk = d_date_sk
WHERE d_year = 2000 and d_qoy = 2
LIMIT 10
Scan
date_dim
Filter
dyear=2000 AND dqoy=2
Scan
catalog_sales
InnerJoin
cs_sold_date_sk=d_date_sk
Limit
10
Decode
d_date_sk, dyear, dqoy
2_220_880_404 1_000_000
8_000
1_000_000
8_000
© 2020 Cloudera, Inc. All rights reserved. 70
Probe-decode optimization
Static row-level filtering
SELECT *
FROM catalog_sales
INNER JOIN date_dim
ON cs_sold_date_sk = d_date_sk
WHERE d_year = 2000 and d_qoy = 2
LIMIT 10
Scan
date_dim
Filter
dyear=2000 AND dqoy=2
Scan
catalog_sales
InnerJoin
cs_sold_date_sk=d_date_sk
Limit
10
ProbeDecode
d_date_sk
2_220_880_404 1_000_000
8_000
8_000
8_000
© 2020 Cloudera, Inc. All rights reserved. 71
Probe-decode optimization
Dynamic MJ row-level filtering
SELECT *
FROM catalog_sales
INNER JOIN date_dim
ON cs_sold_date_sk = d_date_sk
WHERE d_year = 2000 and d_qoy = 2
LIMIT 10
Scan
date_dim
Filter
dyear=2000 AND dqoy=2
Scan
catalog_sales
InnerJoin
cs_sold_date_sk=d_date_sk
Limit
10
SemiJoin
cs_sold_date_sk=d_date_sk
ProbeDecode
cs_item_sk
420_504
2_220_880_404 1_000_000
8_000
8_000
8_000ProbeDecode
d_date_sk
© 2020 Cloudera, Inc. All rights reserved. 72
Probe-decode plan
Map 1 <- Map 4 (BROADCAST_EDGE), Map 5 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: store_sales
probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_48_container, bigKeyColName:ss_item_sk, smallTablePos:1, keyRatio:1.6
Statistics: Num rows: 82510879939 Data size: 10343396725952 Basic stats: COMPLETE Column stats: COMPLETE
TableScan Vectorization:
native: true
Select Operator
expressions: ss_item_sk (type: bigint), ss_ext_sales_price (type: decimal(7,2)), ss_sold_date_sk (type: bigint)
outputColumnNames: _col0, _col1, _col2
Select Vectorization:
className: VectorSelectOperator
native: true
projectedOutputColumnNums: [1, 14, 22]
Statistics: Num rows: 82510879939 Data size: 10343396725952 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:"
0 _col2 (type: bigint)
1 _col0 (type: bigint)
Map Join Vectorization:
bigTableKeyColumns: 22:bigint
className: VectorMapJoinInnerBigOnlyLongOperator
input vertices:
1 Map 4
SET hive.optimize.scan.probedecode=true
© 2020 Cloudera, Inc. All rights reserved. 73
⬢
–
–
–
–
–
DATA PLANE SERVICES
EDWS Data Analytics Studio (DAS) Management Services
COMPUTE CLUSTER
SHARED
SERVICES
Ranger
Atlas
Metastore
Tiller API Server
DAS Web Service
Query Coordinators
Query Executors
Registry
Blobstore
Indexer
RDBMS
Hive Server
Long-running kubernetes cluster
Inter-cluster communication Intra-cluster communication
Ingress Controller or
Load Balancer
Internal Service Endpoint for
ReplicaSet or StatefulSet
Ephemeral kubernetes cluster
© 2020 Cloudera, Inc. All rights reserved. 74
Data pruning
• Prune early → increase query efficiency
Semijoin reduction
• Reduce disk & network I/O
Lazy materialization
• Save CPU cycles and memory allocations
Conclusions
© 2020 Cloudera, Inc. All rights reserved. 76
References
1. Stocker, Konrad, et al. "Integrating semi-join-reducers into state-of-the-art
query processors." ICDE 2001
2. Abadi, Daniel J., Samuel R. Madden, and Nabil Hachem. "Column-stores vs.
row-stores: how different are they really?." SIGMOD 2008
3. Lang, Harald, et al. "Performance-optimal filtering: Bloom overtakes cuckoo at
high throughput." VLDB 2019

More Related Content

Similar to Accelerating distributed joins in Apache Hive: Runtime filtering enhancements

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Stamatis Zampetakis
 
Salesforce meetup | Lightning Web Component
Salesforce meetup | Lightning Web ComponentSalesforce meetup | Lightning Web Component
Salesforce meetup | Lightning Web Component
Accenture Hungary
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
Cloudera, Inc.
 
Hug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overviewHug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overview
Mostafa Mokhtar
 
ReadingSEO - Technical SEO at Scale
ReadingSEO - Technical SEO at ScaleReadingSEO - Technical SEO at Scale
ReadingSEO - Technical SEO at Scale
Hayden Roche
 
PHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsPHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the tests
Michelangelo van Dam
 
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
Sencha
 
Wide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingWide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data Modeling
ScyllaDB
 
2019 WIA - Data-Driven Product Improvements
2019 WIA - Data-Driven Product Improvements2019 WIA - Data-Driven Product Improvements
2019 WIA - Data-Driven Product Improvements
Women in Analytics Conference
 
Introducing Amazon Machine Learning
Introducing Amazon Machine LearningIntroducing Amazon Machine Learning
Introducing Amazon Machine Learning
Amazon Web Services
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
AtScale
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Hands on SPA development
Hands on SPA developmentHands on SPA development
Hands on SPA development
Shawn Constance
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech Talks
Amazon Web Services
 
Red Team vs. Blue Team on AWS ~ re:Invent 2018
Red Team vs. Blue Team on AWS ~ re:Invent 2018Red Team vs. Blue Team on AWS ~ re:Invent 2018
Red Team vs. Blue Team on AWS ~ re:Invent 2018
Teri Radichel
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
What's New in ZF 1.10
What's New in ZF 1.10What's New in ZF 1.10
What's New in ZF 1.10
Ralph Schindler
 
Apache Beam in Production
Apache Beam in ProductionApache Beam in Production
Apache Beam in Production
Ferran Fernández Garrido
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
Russell Jurney
 
Mysql python
Mysql pythonMysql python
Mysql python
Janu Jahnavi
 

Similar to Accelerating distributed joins in Apache Hive: Runtime filtering enhancements (20)

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
 
Salesforce meetup | Lightning Web Component
Salesforce meetup | Lightning Web ComponentSalesforce meetup | Lightning Web Component
Salesforce meetup | Lightning Web Component
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Hug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overviewHug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overview
 
ReadingSEO - Technical SEO at Scale
ReadingSEO - Technical SEO at ScaleReadingSEO - Technical SEO at Scale
ReadingSEO - Technical SEO at Scale
 
PHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsPHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the tests
 
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
 
Wide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingWide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data Modeling
 
2019 WIA - Data-Driven Product Improvements
2019 WIA - Data-Driven Product Improvements2019 WIA - Data-Driven Product Improvements
2019 WIA - Data-Driven Product Improvements
 
Introducing Amazon Machine Learning
Introducing Amazon Machine LearningIntroducing Amazon Machine Learning
Introducing Amazon Machine Learning
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Hands on SPA development
Hands on SPA developmentHands on SPA development
Hands on SPA development
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech Talks
 
Red Team vs. Blue Team on AWS ~ re:Invent 2018
Red Team vs. Blue Team on AWS ~ re:Invent 2018Red Team vs. Blue Team on AWS ~ re:Invent 2018
Red Team vs. Blue Team on AWS ~ re:Invent 2018
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
What's New in ZF 1.10
What's New in ZF 1.10What's New in ZF 1.10
What's New in ZF 1.10
 
Apache Beam in Production
Apache Beam in ProductionApache Beam in Production
Apache Beam in Production
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
 
Mysql python
Mysql pythonMysql python
Mysql python
 

More from Panagiotis Garefalakis

Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch ApplicationsNeptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Panagiotis Garefalakis
 
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersMedea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Panagiotis Garefalakis
 
Mres presentation
Mres presentationMres presentation
Mres presentation
Panagiotis Garefalakis
 
Dais 2013 2 6 june
Dais 2013 2 6 juneDais 2013 2 6 june
Dais 2013 2 6 june
Panagiotis Garefalakis
 
Master presentation-21-7-2014
Master presentation-21-7-2014Master presentation-21-7-2014
Master presentation-21-7-2014
Panagiotis Garefalakis
 
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref   Piccolo Building Fast, Distributed Programs with Partitioned TablesPgaref   Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Panagiotis Garefalakis
 
Storage managment using nagios
Storage managment using nagiosStorage managment using nagios
Storage managment using nagios
Panagiotis Garefalakis
 
Ithings2012 20nov
Ithings2012 20novIthings2012 20nov
Ithings2012 20nov
Panagiotis Garefalakis
 

More from Panagiotis Garefalakis (8)

Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch ApplicationsNeptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
 
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersMedea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production Clusters
 
Mres presentation
Mres presentationMres presentation
Mres presentation
 
Dais 2013 2 6 june
Dais 2013 2 6 juneDais 2013 2 6 june
Dais 2013 2 6 june
 
Master presentation-21-7-2014
Master presentation-21-7-2014Master presentation-21-7-2014
Master presentation-21-7-2014
 
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref   Piccolo Building Fast, Distributed Programs with Partitioned TablesPgaref   Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
 
Storage managment using nagios
Storage managment using nagiosStorage managment using nagios
Storage managment using nagios
 
Ithings2012 20nov
Ithings2012 20novIthings2012 20nov
Ithings2012 20nov
 

Recently uploaded

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 

Recently uploaded (20)

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements

  • 1. Accelerating distributed joins in Apache Hive: Runtime filtering enhancements ApacheCon 2020 29th September 2020
  • 2. © 2020 Cloudera, Inc. All rights reserved. 2 About us Panagiotis (Panos) Garefalakis @pgaref Software Engineer @Cloudera, Hive runtime team Working on Apache Hive, ORC, Tez PhD in Distributed Systems, Imperial College London Stamatis Zampetakis Software Engineer @Cloudera, Hive query optimizer team Working on Apache Hive, Calcite PhD in Data Management, INRIA & Paris-Sud University
  • 3. © 2020 Cloudera, Inc. All rights reserved. 3 Apache Hive Hive 1.0 Hive 2.0 Hive 3.0 Hive 4.0 GB to PB and low-latency multi-tenant HiveQL MapReduce Read only ETL SQL Tez & Spark LLAP Vectorization BI Tools Mat. Views AntiJoin Semi-join Red. Lazy Mat. Ad-Hoc Batch SQL Analytics SQL Interactive SQL
  • 4. © 2020 Cloudera, Inc. All rights reserved. 4 Hive users
  • 5. © 2020 Cloudera, Inc. All rights reserved. 5 Hive with LLAP
  • 6. © 2020 Cloudera, Inc. All rights reserved. 6 LLAP daemon anatomy Query Coordinator Query Coordinator Query Coordinator
  • 7. © 2020 Cloudera, Inc. All rights reserved. 7 LLAP daemon anatomy Query Coordinator Query Coordinator Query Coordinator
  • 8. © 2020 Cloudera, Inc. All rights reserved. 8 Outline Data pruning • Optimizations • Benefit Semijoin reduction • Less data transfer • Hive semijoin reduction Lazy materialization • ORC file format • Row-level filtering • Hive Probe-decode
  • 9. © 2020 Cloudera, Inc. All rights reserved. 9 Data pruning HOW? WHY? Reduce memory footprint Reduce disk & network I/O Save CPU cycles Semijoin reduction Late materialization ... and many more
  • 10. © 2020 Cloudera, Inc. All rights reserved. 10 Data pruning Baseline SELECT sr_net_loss as net_loss FROM store_returns INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 11. © 2020 Cloudera, Inc. All rights reserved. 11 Data pruning Baseline SELECT sr_net_loss as net_loss FROM store_returns INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 12. © 2020 Cloudera, Inc. All rights reserved. 12 Data pruning Baseline SELECT sr_net_loss as net_loss FROM store_returns INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 13. © 2020 Cloudera, Inc. All rights reserved. 13 Data pruning Baseline SELECT sr_net_loss as net_loss FROM store_returns INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 14. © 2020 Cloudera, Inc. All rights reserved. 14 Data pruning Baseline SELECT sr_net_loss as net_loss FROM store_returns INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 15. © 2020 Cloudera, Inc. All rights reserved. 15 Data pruning Baseline SELECT sr_net_loss as net_loss FROM store_returns INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 16. © 2020 Cloudera, Inc. All rights reserved. 16 Data pruning Baseline SELECT sr_net_loss as net_loss FROM store_returns INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 17. © 2020 Cloudera, Inc. All rights reserved. 17 Data pruning Baseline SELECT sr_net_loss as net_loss FROM store_returns INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 18. © 2020 Cloudera, Inc. All rights reserved. 18 Data pruning Baseline SELECT sr_net_loss as net_loss FROM store_returns INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 19. © 2020 Cloudera, Inc. All rights reserved. 19 Data pruning Filter pushdown Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Project sr_net_loss Filter r_reason_desc=”Found...” BEFORE AFTER Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss
  • 20. © 2020 Cloudera, Inc. All rights reserved. 20 Data pruning Filter pushdown BEFORE AFTER
  • 21. © 2020 Cloudera, Inc. All rights reserved. 21 Data pruning Project pushdown BEFORE AFTER Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Project sr_net_loss Filter r_reason_desc=”Found...” Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss Project sr_reason_sk, sr_net_loss Project r_reason_sk
  • 22. © 2020 Cloudera, Inc. All rights reserved. 22 Data pruning Project pushdown BEFORE AFTER
  • 23. © 2020 Cloudera, Inc. All rights reserved. 23 Data pruning Semijoin reduction BEFORE AFTER Scan store_returns InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss Project sr_reason_sk, sr_net_loss Project r_reason_sk Scan store_returns SemiJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” Project sr_net_loss Project sr_reason_sk, sr_net_loss Project r_reason_sk
  • 24. © 2020 Cloudera, Inc. All rights reserved. 24 Data pruning Semijoin reduction BEFORE AFTER
  • 25. © 2020 Cloudera, Inc. All rights reserved. 25 Data pruning Late materialization BEFORE AFTER CE CE CE CE COMPRESSED ENCODED (CE)
  • 26. © 2020 Cloudera, Inc. All rights reserved. 26 Data pruning Benefit BASELINE FILTER PUSHDOWN PROJECT PUSHDOWN SEMIJOIN REDUCTION & LATE MATERIALIZATION CE CE CE CE
  • 27. © 2020 Cloudera, Inc. All rights reserved. 27 Data pruning Benefit BASELINE FILTER PUSHDOWN PROJECT PUSHDOWN SEMIJOIN REDUCTION & LATE MATERIALIZATION CE CE CE CE
  • 29. © 2020 Cloudera, Inc. All rights reserved. 29 Common names Semijoin reduction Invisible join Selective join pushdown Sideways information passing
  • 30. © 2020 Cloudera, Inc. All rights reserved. 30 Benefit Operators between the semijoin & and the reduced join treat less data: ➔ Reduce disk & network I/O ➔ Save CPU cycles ➔ Decrease memory footprint
  • 31. © 2020 Cloudera, Inc. All rights reserved. 31 TPC-DS Query Example Find the ten items that were returned due to a more competitive price SELECT ss_item_sk, sum(sr_net_loss) as net_loss, avg(ss_sales_price) as price FROM store_sales INNER JOIN store_returns ON ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number INNER JOIN reason ON sr_reason_sk = r_reason_sk WHERE r_reason_desc = 'Found a better price in a store' GROUP BY ss_item_sk ORDER BY net_loss DESC NULLS LAST, price DESC NULLS LAST LIMIT 10;
  • 32. © 2020 Cloudera, Inc. All rights reserved. 32 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Limit 10 Sort net_loss DESC, price DESC Scan store_sales Scan store_returns Aggregate group=ss_item_sk, agg0=avg(ss_sales_price), agg1=(sr_net_loss) InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...”
  • 33. © 2020 Cloudera, Inc. All rights reserved. 33 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Limit 10 Scan store_sales Scan store_returns Aggregate group=ss_item_sk, agg0=avg(ss_sales_price), agg1=(sr_net_loss) InnerJoin sr_reason_sk = r_reason_sk Scan reason Filter r_reason_desc=”Found...” A Sort net_loss DESC, price DESC
  • 34. © 2020 Cloudera, Inc. All rights reserved. 34 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Limit 10 Scan store_sales Aggregate group=ss_item_sk, agg0=avg(ss_sales_price), agg1=(sr_net_loss) A Sort net_loss DESC, price DESC
  • 35. © 2020 Cloudera, Inc. All rights reserved. 35 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Limit 10 Scan store_sales Aggregate group=ss_item_sk, agg0=avg(ss_sales_price), agg1=(sr_net_loss) A B Sort net_loss DESC, price DESC
  • 36. © 2020 Cloudera, Inc. All rights reserved. 36 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B
  • 37. © 2020 Cloudera, Inc. All rights reserved. 37 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B 2880404|46MB 8000|128KB 8000|160K
  • 38. © 2020 Cloudera, Inc. All rights reserved. 38 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B N1 N2 N3 Node
  • 39. © 2020 Cloudera, Inc. All rights reserved. 39 N3 N2N1 46MB 128KB 160KB Node
  • 40. © 2020 Cloudera, Inc. All rights reserved. 40 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B
  • 41. © 2020 Cloudera, Inc. All rights reserved. 41 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B SemiJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number
  • 42. © 2020 Cloudera, Inc. All rights reserved. 42 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B SemiJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number 2880404|46MB 8000|128KB 8000|160K 8000|128K
  • 43. © 2020 Cloudera, Inc. All rights reserved. 43 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B SemiJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number N3 N2N1 N1 Node
  • 44. © 2020 Cloudera, Inc. All rights reserved. 44 N3 N2N1 128KB 128KB 160KB 64KB Node
  • 45. © 2020 Cloudera, Inc. All rights reserved. 45 N3 N2N1 128KB 128KB 160KB 64KB N3 N2N1 46MB 128KB 160KB With semijoin Total I/O=480KB Without semijoin Total I/O=46MB
  • 46. © 2020 Cloudera, Inc. All rights reserved. 46 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B SemiJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Semijoin reduction in Hive
  • 47. © 2020 Cloudera, Inc. All rights reserved. 47 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B Filterss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item ss_ticket_number BETWEEN :min_sr_ticket_number AND :max_sr_ticket_number IN_BLOOM_FILTER(HASH(ss_item_sk, ss_ticket_number), :bloom_sr) Aggregatemin(sr_item_sk) max(sr_item_sk) min(sr_ticket_number) max(sr_ticket_number) bloom_filter(hash(sr_item_sk,sr_ticket_number)) Semijoin reduction in Hive
  • 48. © 2020 Cloudera, Inc. All rights reserved. 48 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B Filterss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item ss_ticket_number BETWEEN :min_sr_ticket_number AND :max_sr_ticket_number IN_BLOOM_FILTER(HASH(ss_item_sk, ss_ticket_number), :bloom_sr) Aggregate MIN(sr_item_sk) MAX(sr_item_sk) MIN(sr_ticket_number) MAX(sr_ticket_number) BLOOM_FILTER(HASH(sr_item_sk,sr_ticket_number)) Semijoin reduction in Hive
  • 49. © 2020 Cloudera, Inc. All rights reserved. 49 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B Filterss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item ss_ticket_number BETWEEN :min_sr_ticket_number AND :max_sr_ticket_number IN_BLOOM_FILTER(HASH(ss_item_sk, ss_ticket_number), :bloom_sr) Aggregatemin(sr_item_sk) max(sr_item_sk) min(sr_ticket_number) max(sr_ticket_number) bloom_filter(hash(sr_item_sk,sr_ticket_number)) Semijoin reduction in Hive
  • 50. © 2020 Cloudera, Inc. All rights reserved. 50 InnerJoin ss_item_sk=sr_item_sk AND ss_ticket_number=sr_ticket_number Scan store_sales A B Filter ss_item_sk BETWEEN :min_sr_item_sk AND :max_sr_item ss_ticket_number BETWEEN :min_sr_ticket_number AND :max_sr_ticket_number IN_BLOOM_FILTER(HASH(ss_item_sk, ss_ticket_number), :bloom_sr) Aggregatemin(sr_item_sk) max(sr_item_sk) min(sr_ticket_number) max(sr_ticket_number) bloom_filter(hash(sr_item_sk,sr_ticket_number)) Semijoin reduction in Hive
  • 51. © 2020 Cloudera, Inc. All rights reserved. 51 N1 64KB N3 N2N1 46MB 128KB 160KB With semijoin Total I/O=480KB Without semijoin Total I/O=46MB N3 N2 128KB 128KB 160KB N1 9KB With bloom semijoin Total I/O=425KB N3 N2 128KB 128KB 160KB
  • 52. © 2020 Cloudera, Inc. All rights reserved. 52 Semijoin reduction in Hive Four phase overview Create synthetic predicate (SyntheticJoinPredicate) Expand to semijoin reducers (DynamicPartitionPruningOptimization) Merge to multi column semijoin reducers (SemiJoinReductionMerge) • HIVE-21196 • HIVE-23976 Remove conditionally semijoin reducers (SemiJoinRemoval & TezCompiler)
  • 54. © 2020 Cloudera, Inc. All rights reserved. 54 Common names Row-level filtering Lazy decoding Probe-decode
  • 55. © 2020 Cloudera, Inc. All rights reserved. 55 Benefit Avoid materializing data, not needed for the evaluation of the remaining of a query ➔ Save CPU cycles ➔ Reduce memory allocations ➔ Eliminate data and filter operators from the pipeline
  • 56. © 2020 Cloudera, Inc. All rights reserved. 56 ORC row group selection • Data organized in stripes of row groups (row x 10K) • A single matched key of a row group leads to Ks of decoded rows • Even for low selectivity queries, most of the row-groups might be read and decoded
  • 57. © 2020 Cloudera, Inc. All rights reserved. 57 ORC row group decoding
  • 58. © 2020 Cloudera, Inc. All rights reserved. 58 ORC file structure The file footer contains: • Metadata • Stripe information • Postscript with compression details ORC file data divided into stripes: • Self contained sets of rows (organized by columns) • Smallest unit of work for tasks • ~64MB by default, often larger/smaller (16MB for Hive ACID deltas)
  • 59. © 2020 Cloudera, Inc. All rights reserved. 59 ORC file structure Stripe footer: • List of col streams and locations • Column encoding Index data: • Offset to seek to right row group • Stats (min, max, sum) per column • Orc.row.index.stride 10,000 Row data: • Index stream • Metadata stream • Data stream StripeNStripe1
  • 60. © 2020 Cloudera, Inc. All rights reserved. 60 ORC file structure Stripe footer: • List of col streams and locations • Column encoding Index data: • Offset to seek to right row group • Stats (min, max, sum) per column • Orc.row.index.stride 10,000 Row data: • Index stream • Metadata stream • Data stream • StripeNStripe1
  • 61. © 2020 Cloudera, Inc. All rights reserved. 61 ORC file structure Stripe footer: • List of col streams and locations • Column encoding Index data: • Offset to seek to right row group • Stats (min, max, sum) per column • Orc.row.index.stride 10,000 Row data: • Index stream • Metadata stream • Data stream StripeNStripe1
  • 62. © 2020 Cloudera, Inc. All rights reserved. 62 Hive can pass Search Arguments (SArgs) filter to the Reader: • Limited set of operations (=, <=>, <, <=, in, between, is null) • Compare one column to literal(s) e.g., “age > 100” Limitations: • Can’t support filters with: inequalities, UDFs, or columns with relationships • Can only eliminate entire row groups, stripes, or files SELECT x,y,z FROM foo WHERE x <> 1; ORC search arguments
  • 63. © 2020 Cloudera, Inc. All rights reserved. 63 SArgs filter RGs ONLY if they can guarantee that none of the rows can pass the filter. ORC row-level filteringRowgroup1 Row data Stripe SELECT x,y,z FROM foo WHERE x <> 1;
  • 64. © 2020 Cloudera, Inc. All rights reserved. 64 SArgs filter RGs ONLY if they can guarantee that none of the rows can pass the filter. [ORC-577] Finer grained row-level filtering! Read a subset of the columns and apply more detailed row level filter. ORC row-level filteringRowgroup1 Row data Stripe Early Expand 1 SELECT x,y,z FROM foo WHERE x <> 1;
  • 65. © 2020 Cloudera, Inc. All rights reserved. 65 SArgs filter RGs ONLY if they can guarantee that none of the rows can pass the filter. [ORC-577] Finer grained row-level filtering! Read a subset of the columns and apply more detailed row level filter. ORC row-level filteringRowgroup1 Row data Stripe Early Expand 1 2 Apply Filter Selected rows SELECT x,y,z FROM foo WHERE x <> 1;
  • 66. © 2020 Cloudera, Inc. All rights reserved. 66 SArgs filter RGs ONLY if they can guarantee that none of the rows can pass the filter. [ORC-577] Finer grained row-level filtering! Read a subset of the columns and apply more detailed row level filter. ORC row-level filteringRowgroup1 Row data Stripe Early Expand 1 2 Apply Filter Selected rows Read Remaining 3 SELECT x,y,z FROM foo WHERE x <> 1;
  • 67. © 2020 Cloudera, Inc. All rights reserved. 67 ORC row-level filtering evaluation [ORC-597] benchmark ORC Reader for complex types such as Decimals, Doubles and real datasets (Taxi, Sales, GitHub)
  • 68. © 2020 Cloudera, Inc. All rights reserved. 68 [HIVE-22959] introduce FilterContext as part of VectorRowBatch [ORC-577] extend ORC Reader with a new option: setRowFilter(String[] columnNames,Consumer<VectorizedRowBatch> filter) Consumer callback to implement any kind of filtering logic New ORC reader logic: • Early expand the given column(s) • Evaluate the filter callback (implemented on the consumer side) • Decode only the selected rows from the remaining columns ORC row-level filtering implementation
  • 69. © 2020 Cloudera, Inc. All rights reserved. 69 Probe-decode optimization Leveraging ORC row-level filtering SELECT * FROM catalog_sales INNER JOIN date_dim ON cs_sold_date_sk = d_date_sk WHERE d_year = 2000 and d_qoy = 2 LIMIT 10 Scan date_dim Filter dyear=2000 AND dqoy=2 Scan catalog_sales InnerJoin cs_sold_date_sk=d_date_sk Limit 10 Decode d_date_sk, dyear, dqoy 2_220_880_404 1_000_000 8_000 1_000_000 8_000
  • 70. © 2020 Cloudera, Inc. All rights reserved. 70 Probe-decode optimization Static row-level filtering SELECT * FROM catalog_sales INNER JOIN date_dim ON cs_sold_date_sk = d_date_sk WHERE d_year = 2000 and d_qoy = 2 LIMIT 10 Scan date_dim Filter dyear=2000 AND dqoy=2 Scan catalog_sales InnerJoin cs_sold_date_sk=d_date_sk Limit 10 ProbeDecode d_date_sk 2_220_880_404 1_000_000 8_000 8_000 8_000
  • 71. © 2020 Cloudera, Inc. All rights reserved. 71 Probe-decode optimization Dynamic MJ row-level filtering SELECT * FROM catalog_sales INNER JOIN date_dim ON cs_sold_date_sk = d_date_sk WHERE d_year = 2000 and d_qoy = 2 LIMIT 10 Scan date_dim Filter dyear=2000 AND dqoy=2 Scan catalog_sales InnerJoin cs_sold_date_sk=d_date_sk Limit 10 SemiJoin cs_sold_date_sk=d_date_sk ProbeDecode cs_item_sk 420_504 2_220_880_404 1_000_000 8_000 8_000 8_000ProbeDecode d_date_sk
  • 72. © 2020 Cloudera, Inc. All rights reserved. 72 Probe-decode plan Map 1 <- Map 4 (BROADCAST_EDGE), Map 5 (BROADCAST_EDGE) Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_48_container, bigKeyColName:ss_item_sk, smallTablePos:1, keyRatio:1.6 Statistics: Num rows: 82510879939 Data size: 10343396725952 Basic stats: COMPLETE Column stats: COMPLETE TableScan Vectorization: native: true Select Operator expressions: ss_item_sk (type: bigint), ss_ext_sales_price (type: decimal(7,2)), ss_sold_date_sk (type: bigint) outputColumnNames: _col0, _col1, _col2 Select Vectorization: className: VectorSelectOperator native: true projectedOutputColumnNums: [1, 14, 22] Statistics: Num rows: 82510879939 Data size: 10343396725952 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys:" 0 _col2 (type: bigint) 1 _col0 (type: bigint) Map Join Vectorization: bigTableKeyColumns: 22:bigint className: VectorMapJoinInnerBigOnlyLongOperator input vertices: 1 Map 4 SET hive.optimize.scan.probedecode=true
  • 73. © 2020 Cloudera, Inc. All rights reserved. 73 ⬢ – – – – – DATA PLANE SERVICES EDWS Data Analytics Studio (DAS) Management Services COMPUTE CLUSTER SHARED SERVICES Ranger Atlas Metastore Tiller API Server DAS Web Service Query Coordinators Query Executors Registry Blobstore Indexer RDBMS Hive Server Long-running kubernetes cluster Inter-cluster communication Intra-cluster communication Ingress Controller or Load Balancer Internal Service Endpoint for ReplicaSet or StatefulSet Ephemeral kubernetes cluster
  • 74. © 2020 Cloudera, Inc. All rights reserved. 74 Data pruning • Prune early → increase query efficiency Semijoin reduction • Reduce disk & network I/O Lazy materialization • Save CPU cycles and memory allocations Conclusions
  • 75.
  • 76. © 2020 Cloudera, Inc. All rights reserved. 76 References 1. Stocker, Konrad, et al. "Integrating semi-join-reducers into state-of-the-art query processors." ICDE 2001 2. Abadi, Daniel J., Samuel R. Madden, and Nabil Hachem. "Column-stores vs. row-stores: how different are they really?." SIGMOD 2008 3. Lang, Harald, et al. "Performance-optimal filtering: Bloom overtakes cuckoo at high throughput." VLDB 2019