Impala 2.5 Performance Deep Dive

Impala 2.5 Performance Update
Tuesday March 1, 2016 meetup
http://www.meetup.com/Bay-Area-Impala-Users-Group/events/227944275/

2© Cloudera, Inc. All rights reserved.
Apache Impala 2.5 (Incubating)
Performance improvements overview

Agenda
1. Brief overview Impala quality and reliability improvements
2. Impala 2.5 Vs. 2.3 Performance improvements
3. Performance deep dive
1. Improved Cardinality Estimates and Join Order
2. Query startup improvements
3. Runtime filters
4. Additional codegen and code optimizations
5. Decimal arithmetic improvements
6. Pass through aggregation
7. Fast min/max values on partition columns
8. Scheduling improvements
9. Admission control

Improved Impala Stability and Reliability
Better quality, enable optimization
● Query Generator
● Stress
● Scale testing
● Fault injection testing
● Longevity testing
● Performance regression
● … More on the way

● Find correctness and crash bugs
○ Not found by other tests
● In github
○ https://github.com/cloudera/Impala/tree/cdh5-
trunk/tests/comparison
○ https://github.com/cloudera/Impala/blob/cdh5-
trunk/tests/stress/concurrent_select.py
● Fixing them
Query Generator and Stress

Query Generator
● Built in-house, open source python (Taras Bobrovytsky)
● Random data generator as well
● Adapts to any schema
● Generates random queries based on a query model
○ Tunable complexity
○ Extensible
● Uses postgres for expected values
○ Query translation for syntax
○ Table flattening for nested types
● Fast, small: Uses Docker-ized cluster
● Adapted for use with Hive

Apache Impala: Open Source & Open Standard
1 > 1 MM downloads since GA
2 Majority adoption across Cloudera customers
3 Certification across key application partners:
4 De facto standard with multi-vendor support:
and others

SQL-on-Hadoop engines
SQL
Impala

Impala 2.3 Vs Spark-SQL 1.5 & Hive on Tez (Upstream)
Full Details
http://tinyurl.com/gotkdlq

New in Impala 2.5
Performance and Scalability
•Better join ordering and cardinality
estimation
•Query start-up improvements
•Runtime filters
•Additional codegen and code optimizations
•Decimal arithmetic improvements
•Fast min/max values on partition columns
(with query option)
•Incremental metadata updates (DDL)
Integrations
•Support for EMC DSSD
Usability Enhancements
•Admission Control Improvements
•Null-safe join/equals

Usability Enhancements
•Admission Control Improvements
New in Impala 2.5
Performance and Scalability
•Runtime filters
•Improved Cardinality Estimates and Join
Order
•Additional codegen and code optimizations
•Decimal arithmetic improvements
•Faster Query Startup
•Fast min/max values on partition columns
(with query option)
Covered today

How does Impala 2.5 fare against 2.3?
• 363% speedup for TPC-DS
• 92% speedup for TPC-H
• 71% speedup for TPC-H
(Nested)

Improved Cardinality Estimates and Join Order
1. More robust scan cardinality estimation
• Mitigate correlated predicates (exponential backoff)
2. Improved join cardinality estimation
• Special treatment of common case of PK/FK joins
• Detect selective joins by applying the selectivity of build-side
predicates to the estimated join cardinality
3. More robust join strategy selection (broadcast vs. shuffle)
• Account for data serialization overhead vs. raw data

TPCH-Q8 on Impala 2.3
… #Rows #Est Rows
14:HASH JOIN | 728.85K | 1.94B |
|--23:EXCHANGE | 15 | 1 |
| 07:SCAN HDFS | 1 | 1 |
13:HASH JOIN | 3.65M | 1.94B |
|--22:EXCHANGE | 375 | 25 |
| 05:SCAN HDFS | 25 | 25 |
12:HASH JOIN | 3.65M | 1.94B |
|--21:EXCHANGE | 675.00M | 45.00M |
| 04:SCAN HDFS | 45.00M | 45.00M |
11:HASH JOIN | 3.65M | 1.81B |
|--20:EXCHANGE | 375 | 25 |
| 06:SCAN HDFS | 25 | 25 |
10:HASH JOIN | 3.65M | 1.81B |
|--19:EXCHANGE | 45.00M | 3.00M |
| 01:SCAN HDFS | 3.00M | 3.00M |
09:HASH JOIN | 3.65M | 1.80B |
|--18:EXCHANGE | 2.05B | 4.50M |
| 03:SCAN HDFS | 136.72M | 4.50M |
08:HASH JOIN | 12.02M | 1.80B |
|--17:EXCHANGE | 6.01M | 397.35K |
| 00:SCAN HDFS | 400.62K | 397.35K |
02:SCAN HDFS | 1.80B | 1.80B |
Run time: 91s
TPCH-Q8 on Impala 2.5
… #Rows #Est Rows
14:HASH JOIN | 728.85K | 238.41K |
|--26:EXCHANGE | 375 | 25 |
| 06:SCAN HDFS | 25 | 25 |
13:HASH JOIN | 728.85K | 238.41K |
|--25:EXCHANGE | 15 | 1 |
| 07:SCAN HDFS | 1 | 1 |
12:HASH JOIN | 3.65M | 1.19M |
|--24:EXCHANGE | 375 | 25 |
| 05:SCAN HDFS | 25 | 25 |
11:HASH JOIN | 3.65M | 1.19M |
|--23:EXCHANGE | 45.00M | 45.00M |
| 04:SCAN HDFS | 45.00M | 45.00M |
22:EXCHANGE | 3.65M | 1.19M |
10:HASH JOIN | 3.65M | 1.19M |
|--21:EXCHANGE | 3.00M | 3.00M |
| 01:SCAN HDFS | 3.00M | 3.00M |
20:EXCHANGE | 3.65M | 1.19M |
09:HASH JOIN | 3.65M | 1.19M |
|--19:EXCHANGE | 136.72M | 45.00M |
| 03:SCAN HDFS | 136.72M | 45.00M |
18:EXCHANGE | 12.02M | 11.92M |
08:HASH JOIN | 12.02M | 11.92M |
|--17:EXCHANGE | 6.01M | 397.35K |
| 00:SCAN HDFS | 400.62K | 397.35K |
02:SCAN HDFS | 12.40M | 1.80B |
Run time: 11s (>8x speedup)

•Why better cardinality estimation matters
•TPC-DS Q14
• 225 joins
• 285 scan nodes

Query start-up improvements (IMPALA-1599)
• At start-up, coordinator has to tell every Impala daemon to run a set of
fragments.
• Since data flows up from leaves, need to make sure that receivers are ready
for the data produced by senders.
• Pre Impala-2.5, we did the obvious thing: start the receivers first!
• But that serializes query start-up...

TPCDS-Q59:
Each colour is
a different
plan fragment.

TPCDS-Q59:
Wave 1

TPCDS-Q59:
Wave 2

TPCDS-Q59:
Wave 3

TPCDS-Q59:
Wave 4

TPCDS-Q59:
Wave 5.
Only now can
the tree with
the light-green
scan start to
make
progress!

Query start-up: what we did
• Instead of starting fragments wave-by-wave, start them all at once in any
order.
• Need to change a lot about sender / receiver logic to allow senders to wait
for receivers to arrive. Lots of tricky error conditions to deal with!
• Also move all heavy-lifting out of synchronous RPC to fragment executor, to
asynchronous fragment start-up.
• Reduced plan fragment size for partitioned tables

Query start-up: performance impact

Query start-up: future improvements
• Work still to do:
• Batch the fragment start RPCs (rather than 10s of RPCs per backend)
• Batching will amortize cost of sending identical data structures.
• Also should trim these data structures down.

Runtime filtering
• General idea: some predicates can only be computed at runtime
• Example: SELECT count(*) FROM probe P JOIN build B on P.id = B.id;
• How does Impala execute this query?

SELECT count(*)
FROM probe P
JOIN build B
on P.id = B.id
Note: 29200 rows
scanned, but only 40
match.

Runtime filters: the opportunity
The planner doesn’t know what the set of B.id contains - even with statistics.
But there’s clearly an opportunity to save some work - why bother sending
29160 of those rows to the hash join node?
Runtime filters computes this predicate at runtime.

Step 1: planner tells
02:HASH JOIN to
produce filter
containing all ID from
build side.

Step 2: 02:HASH JOIN
reads all 10 rows from
build side (right
input), and computes
filter containing all
distinct values of ID.

Step 3: 02:HASH JOIN
sends filter to 00:
SCAN HDFS before the
scan starts.
Scan eliminates all
rows that don’t match
in the filter.

The result: only 40 rows
are produced by the scan,
reducing the amount of
work the hash join has to
do
by > 99%!

Runtime filters: real-world results
Runtime filters can be highly effective. Some benchmark queries are more than
20 times faster in Impala 2.5.0.
As always, depends on your queries, your schemas and your cluster
environment.
By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0.
They can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL.

LLVM Codegen Support in Impala
Operations:
• Hash join
• Aggregation
• Scans: Text, Sequence, Avro
• Expressions in all operators
• Sort
• Top-N
Data Types:
• TINYINT, SMALLINT, INT, BIGINT
• FLOAT, DOUBLE
• BOOLEAN
• STRING, VARCHAR
• DECIMAL
New in Impala 2.5
Extended in Impala 2.5

Codegen for Order by & Top-N
l_orderkey l_extendedprice l_shipdate l_shipmode
26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
select
l_extendedprice, l_orderkey,..
from
lineitem
order by l_orderkey
limit 100
SQL language offers
great flexibility

26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
select
from
lineitem
order by l_extendedprice
limit 100
SQL language offers
great flexibility

26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
select
from
lineitem
limit 100
Flexibility requires
generic code which is
often inefficient

26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
select
from
lineitem
limit 100
Compare

26039617 13515 12/8/1997 FOB
28809990 7208 10/19/1997 AIR
30525093 16218 12/16/1997 REG AIR
select
from
lineitem
limit 100
Compare

int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
select
l_extendedprice, l_orderkey
from
lineitem
limit 100
26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR

void* ExprContext::GetValue(Expr* e, TupleRow* row) {
switch (e->type_.type) {
case TYPE_BOOLEAN: {
..
..
}
case TYPE_TINYINT: {
..
..
}
case TYPE_INT: {
..
.
}
}

int RawValue::Compare(const void* v1, const void* v2,
const ColumnType& type) {
switch (type.type) {
case TYPE_INT:
i1 = *reinterpret_cast<const int32_t*>(v1);
i2 = *reinterpret_cast<const int32_t*>(v2);
return i1 > i2 ? 1 : (i1 < i2 ? -1 : 0);
case TYPE_BIGINT:
b1 = *reinterpret_cast<const int64_t*>(v1);
b2 = *reinterpret_cast<const int64_t*>(v2);
return b1 > b2 ? 1 : (b1 < b2 ? -1 : 0);
}
}

int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs);
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs);
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
Codegen codeOriginal code
}
}

int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1
}
Codegen code
• Perfectly unrolls “for each grouping column” loop
• No switching on input type(s)
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
}
}

}
}
Codegen code
NULLS FIRST/LAST
Original code

}
}
Codegen code
NULLS FIRST/LAST
Original code
10x more efficient code

}
}
Codegen code
NULLS FIRST/LAST
Original code

}
}
Codegen code
NULLS FIRST/LAST
Original code
77% speedup in query
due to Amdahl’s law

Float/Double Vs Decimal?
Pros for Float/Double
• Uses less memory.
• Faster because floating point is better supported in codegen.
(Note: Decimal uses fixed-point hardware types - int64 and __int128)
• Can represent a larger range of numbers.
Cons for Float/Double
• Precision errors compound during aggregations
• Can’t do math with wide number of significant digits (123456789.1 * .0000987654321)
Decimal arithmetic and aggregation

No go for applications requiring high precision & accuracy
What about performance penalty?
Float , Double Vs Decimal?
Pros
• Uses less memory.
• Faster because floating point math operations are natively supported by processors.
• Can represent a larger range of numbers.
Cons
• Precision errors compound during aggregations
• Can’t do math with wide number of significant digits (123456789.1 * .0000987654321)

SELECT l_returnflag,
l_linestatus,
Sum(l_quantity) AS SUM_QTY,
Sum(l_extendedprice)AS SUM_BASE_PRICE,
Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICE
FROM lineitem
GROUP BY l_returnflag,
l_linestatus
ORDER BY l_returnflag,
l_linestatus
3x speedup
● Simplified overflow check for decimal.
● Extended Codegen framework to support aggregations involving decimal.
● Bridged the performance gap between double and decimal

Network
Distributed Aggregations in Impala
Preagg Preagg Preagg
Merge Merge Merge
select cust_id, sum(dollars)
from sales group by cust_id;
Scan ScanScan
• Impala aggregations have two phases:
• Pre-aggregation phase
• Merge phase
• The pre-aggregation phase greatly reduces
network traffic if there are many input rows
per grouping value.
• E.g. many sales per customer.

Network
Downsides of Pre-aggregations
Preagg Preagg Preagg
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Pre-aggregations consume:
• Memory
• CPU cycles
• Pre-aggregations are not always effective
at reducing network traffic
• E.g. select distinct for nearly-distinct rows
• Pre-aggregations can spill to disk under
memory pressure
• Disk I/O is bad - better to send to
merge agg rather than disk

Network
Streaming Pre-aggregations in Impala 2.5
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Reduction factor is dynamically estimated based
on the actual data processed
• Pre-aggregation expands memory usage only if
reduction factor is good
• Benefits:
• Certain aggregations with low reduction
factor see speedups of up to 25%
• Memory consumption can be reduced by
50% or more
• Streaming pre-aggregations don’t spill to
disk

Optimization for partition keys scan
• Use metadata to avoid table accesses for partition key scans:
• select min(month), max(year) from functional.alltypes;
• month, year are partition keys of the table
• Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS
• Applicable:
• min(), max(), ndv() and aggregate functions with distinct keyword
• partition keys only
01:AGGREGATE [FINALIZE]
| output: min(month),max(year)
|
00:UNION
constant-operands=24
03:AGGREGATE [FINALIZE]
| output: min:merge(month), max:merge(year)
|
02:EXCHANGE [UNPARTITIONED]
|
01:AGGREGATE
| output: min(month), max(year)
|
00:SCAN HDFS [functional.alltypes]
partitions=24/24 files=24 size=478.45KB
Plan without optimization Plan with optimization

Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).
First it selects an HDFS datanode to read from.
A B C

A B C

Then it selects an impalad to perform the scan.
A B C

A B C
Selection will always start with the same
replica to make optimal use of OS buffer
caches.
This can lead to hot-spots for some
workloads.

A B C
Selection will always start with the same
replica to make optimal use of OS buffer
caches.
This can lead to hot-spots for some
workloads.
Improvement: Pick impalad at random.

New Query Option: random_replica
Disabled by default.
set random_replica = 1;
Also has a corresponding query hint:
SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA */;

What to Look out for

Where It Can Help
• Large number of small queries, each with few input tables.
• High load on only one of multiple replicas of a table.
• Queries are CPU bound.
• Benefit: Distribute load more evenly over replicas.
• Tradeoff: Distribution of local reads will increase buffer cache usage.

What’s Next
• Add possibility to prefer remote reads.
• Switch remote impalad selection from round-robin to load-based.
• Add rack-awareness.

Impala Admission Control
•Purpose: throttle workload to avoid oversubscription, maximize throughput
•Previously (since Impala 1.4):
• Throttle based on number of concurrent queries
• Knobs for throttling based on memory, but not recommended
•New in 2.5:
• Improved CM integration: configuration and monitoring
• Improved memory-based admission
• Improved control over pools w/ new configurations

Admission Control Improvements
•Use cases that work well with admission control:
1. [Since 1.4] Simple throttling: Setting max number running queries
2. [NEW] well understood, memory bound-workloads
•Admission control accepts new per-pool configurations
• Query memory limits (and any other query options)
• Queue timeout (was previously a global setting only)
•Admission algorithm admits/queues based on:
• Request’s aggregate memory usage fitting within the pool’s configured mem
• Request’s per-node mem requirement fitting on each impalad’s available
mem

Example
•Impala cluster has 10 nodes, 200gb/node = 2TB total
•Workload has queries with known memory requirements:
• Many small, fast queries (<10gb/node)
• Some very large queries (100gb/node)
•Need to run 1 big query at a time, many small queries
•Configure two resource pools:

Guidance for Using Memory-Based Admission
For a well known, memory bound workload (i.e. can set MEM_LIMIT on queries):
1. Group queries by resource requirements (e.g. small/HighThrpt, XXL)
2. Create resource pools for each group
3. Control concurrency for pools:
- large query pools: set pool max mem resources
- small query pools: set max num running queries limit
4. Set default query mem limits on all pools (upper bound)
5. Set REQUEST_POOL query option to direct queries into pools
- via ‘SET’ command
- [NEW] Add key-value pair to JDBC connection string

CM Impala Admission Control Configuration:
http://vc0726.halxg.cloudera.com:
7180/cmf/services/13/pools/configuration
Admission Control Dashboard: link
Admission control demo

Impala Roadmap
2H 2015 1H 2016 2016
• SQL Support & Usability
• Nested structures
• Kudu updates (beta)
• Management & Security
• Record reader service
(beta)
• Finer-grained security
(Sentry)
• Integration
• Isilon support
• Python interface (Ibis)
• Performance & Scale
• Improved predictability
under concurrency
• Continued scalability and
concurrency
• Initial perf/scale
improvements
• Improved admission
control
• Resource utilization and
showback
• Dynamic partitioning
• Improved timestamp
compatibility
• >20x performance
• Multi-threaded
joins/aggregations
• Continued scale work
• Improved YARN
integration
• Automated metadata
• Integration
• S3 support
• Nested types with Avro
• Date type
• Added SQL extensions

Impala 2.5 Performance Deep Dive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Impala 2.5 Performance Deep Dive

Similar to Impala 2.5 Performance Deep Dive (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Impala 2.5 Performance Deep Dive