Apache Impala (incubating) 2.5 Performance Update

1© Cloudera, Inc. All rights reserved.
Apache Impala 2.5 (Incubating)
Performance improvements overview

Agenda
• What is Impala?
• Impala at Apache
• What is new in Impala 2.5 (CDH 5.7)
• Impala performance update
• Roadmap
• Q&A

SQL-on-Hadoop engines
SQL
Impala
SQL-on-Apache Hadoop – Choosing the right tool for the right
job

• General-purpose SQL engine
• Real-time queries in Apache Hadoop
• General availability (v1.0) release out since April 2013
• Analytic SQL functionality (v2.0) since October 2014
• Apache incubator project since December 2015
• Previous release 2.3 (CDH 5.5) released November 2015
• Current release 2.5 (CDH 5.7) April 2016
What is Impala?
Today’s topic

• Query speed over Hadoop that meets or exceeds that of a proprietary analytic DBMS
• General-purpose SQL query engine:
• Targeted for analytical workloads
• Supports queries that take from milliseconds to hours
• Runs directly within Hadoop:
• reads widely used Hadoop file formats
• talks to widely used Hadoop storage managers
• runs on same nodes that run Hadoop processes
• Highly available
• High performance:
• C++ instead of Java
• Run time code generation
Impala overview

Impala Use Cases
•Interactive BI/analytics on more data
•Asking new questions – exploration, ML (Ibis)
•Data processing with tight SLAs
•Query-able archive w/full fidelity

• Incubator project since
December 2015
• Development process slowly
moving to ASF infrastructure (see
IMPALA-3221)
• Help wanted!
Where to find the Impala community:
dev@impala.incubator.apache.org
user@impala.incubator.apache.org
http://impala.io
@apacheimpala
Impala at Apache

New in Impala 2.5
Usability Enhancements
• Admission Control Improvements
• Null-safe join/equals
Performance and Scalability
• Runtime filters
• Improved Cardinality Estimation and Join
Ordering
• Query start-up improvements
• Additional codegen and code
optimizations
• Decimal arithmetic improvements
• Fast min/max values on partition
columns(with query option)
Integrations
•Support for EMC DSSD

New in Impala 2.5
Performance and Scalability
• Runtime filters
• Improved Cardinality Estimation and Join
Ordering
• Query start-up improvements
• Additional codegen and code
optimizations
• Decimal arithmetic improvements
• Incremental metadata updates (DDL)
• Fast min/max values on partition
columns(with query option)
Covered today

Impala 2.5 (CDH 5.7) improvements vs Impala 2.3 (CDH 5.5)
• 2.2x speedup for TPC-H
• 1.7x speedup for TPC-H (Nested)
• 4.3X speedup for TPC-DS

Runtime filtering
• General idea: some predicates can only be computed at runtime
• Example: SELECT count(*) FROM date_dim dt ,store_sales WHERE dt.d_date_sk =
store_sales.ss_sold_date_sk AND dt.d_moy = 12;
• How does Impala execute this query?

SELECT dt.d_year
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND i_category = "Books"
AND i_class = "fiction"
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows

SELECT dt.d_year
,item.i_brand brand
FROM date_dim dt
,store_sales
,item
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
Runtime filters: the opportunity
● The planner doesn’t know what the set of
ss_sold_date_sk and ss_item_sk contains -
even with statistics.
● opportunity to save some work - why bother
sending 43 billion of those rows to the joins?
● Runtime filters computes this predicate at
runtime.

SELECT dt.d_year
,item.i_brand brand
FROM date_dim dt
,store_sales
,item
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
Step 1: planner tells Join #1 to
produce bloom filter qualifying
i_item_sk & Join #2 to produce
bloom filter for qualifying
d_date_sk

SELECT dt.d_year
,item.i_brand brand
FROM date_dim dt
,store_sales
,item
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
Step 2: Join reads all rows from
build side (right input), and
computes filter containing all
distinct values of i_item_sk and
d_date_sk

SELECT dt.d_year
,item.i_brand brand
FROM date_dim dt
,store_sales
,item
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
Step 3: Join #1 & #2 sends filter
to store_sales scan.
Scan eliminates rows that don’t
have a match in the bloom
filters.

SELECT dt.d_year
,item.i_brand brand
FROM date_dim dt
,store_sales
,item
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
47 million rows
item
198 rows
Broadcast
Join #1
47 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
store_sales scan uses bloom
filter from Join #2 to filter out
partitions (ss_sold_date_sk)and
bloom filter from Join #1 to filter
out rows that don’t qualify
(ss_item_sk)

SELECT dt.d_year
,item.i_brand brand
FROM date_dim dt
,store_sales
,item
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
47 million rows
item
198 rows
Broadcast
Join #1
47 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
914x reduction in number
of rows coming out of scan
43 billion -> 47 million
6x reduction in number of
rows coming out of join
290 million -> 47 million

SELECT c_email_address
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Runtime filters variation : Global filters
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle

FROM store_sales
,customer
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
Join #1 & #2 are expensive
joins since left side of the
joins have 43 billion rows
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle

FROM store_sales
,customer
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
Create bloom filter from
Join #2 on cd_demo_sk and
push down to customer
table scan
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle

FROM store_sales
,customer
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
Reduced customer rows by
826X
3.8 million to 4,600 rows
store_sales
43 billion rows
customer
4,600 rows
Shuffle Shuffle

FROM store_sales
,customer
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
store_sales
43 billion rows
customer
4,600 rows
Shuffle Shuffle
Create bloom filter from
Join #1 on c_customer_sk
and push down to
store_sales table scan

FROM store_sales
,customer
Shuffle
Join #1
49 million rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
store_sales
49 million rows
customer
4,600 rows
Shuffle Shuffle
877x reduction in rows
43 billion -> 49 million rows
set RUNTIME_FILTER_MODE=GLOBAL;

Runtime filters: real-world results
• Runtime filters can be highly effective. Some benchmark queries are more than 30
times faster in Impala 2.5.0.
• As always, depends on your queries, your schemas and your cluster environment.
• By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0. They
can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL.
• Other runtime filter parameters include :
• RUNTIME_BLOOM_FILTER_SIZE: [1048576]
• RUNTIME_FILTER_WAIT_TIME_MS: [0]

Improved Cardinality Estimates and Join Order
1. More robust scan cardinality estimation
• Mitigate correlated predicates (exponential backoff)
2. Improved join cardinality estimation
• Special treatment of common case of PK/FK joins
• Detect selective joins by applying the selectivity of build-side predicates to the
estimated join cardinality
• TPC-H Q8 Impact: >8x speedup (91s in Impala 2.3 -> 11s in Impala 2.5)
SELECT *
FROM cars
WHERE
cars.make = 'Toyota'
AND cars.model = 'Camry'

Query start-up: performance impact

LLVM Codegen Support in Impala
Operations:
• Hash join
• Aggregation
• Scans: Text, Sequence, Avro
• Expressions in all operators
• Sort
• Top-N
Data Types:
• TINYINT, SMALLINT, INT, BIGINT
• FLOAT, DOUBLE
• BOOLEAN
• STRING, VARCHAR
• DECIMALNew in Impala
2.5
Extended in
Impala 2.5

Codegen for Order by & Top-N
void* ExprContext::GetValue(Expr* e, TupleRow* row) {
switch (e->type_.type) {
case TYPE_BOOLEAN: {
..
..
}
case TYPE_TINYINT: {
..
..
}
case TYPE_INT: {
..
.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}

int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
}
Codegen code
• Perfectly unrolls “for each grouping column” loop
• No switching on input type(s)
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
}
}

int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
}
Codegen code
• Perfectly unrolls “for each grouping column” loop
• No switching on input type(s)
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
}
}
10x more efficient
code

Float/Double Vs Decimal?
Pros for Float/Double
• Uses less memory.
• Faster because floating point math operations are natively supported by processors.
(Note: Decimal uses fixed-point hardware types - int64 and __int128)
• Can represent a larger range of numbers.
Cons for Float/Double
• Precision errors compound during aggregations
• Can’t do math with wide number of significant digits (123456789.1 * .0000987654321)
Decimal arithmetic and aggregation
No go for applications requiring high precision & accuracy
What about performance penalty?

Decimal arithmetic and aggregation
SELECT l_returnflag,
l_linestatus,
Sum(l_quantity) AS SUM_QTY,
Sum(l_extendedprice)AS SUM_BASE_PRICE,
Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICE
FROM lineitem
GROUP BY l_returnflag,
l_linestatus
ORDER BY l_returnflag,
l_linestatus
3x speedup
● Simplified overflow check for decimal.
● Extended Codegen framework to support aggregations involving decimal.
● Bridged the performance gap between double and decimal

Network
Distributed Aggregations in Impala
Preagg Preagg Preagg
Merge Merge Merge
select cust_id, sum(dollars)
from sales group by cust_id;
Scan ScanScan
• Impala aggregations have two phases:
• Pre-aggregation phase
• Merge phase
• The pre-aggregation phase greatly reduces
network traffic if there are many input
rows per grouping value.
• E.g. many sales per customer.

Network
Downsides of Pre-aggregations
Preagg Preagg Preagg
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Pre-aggregations consume:
• Memory
• CPU cycles
• Pre-aggregations are not always effective
at reducing network traffic
• E.g. select distinct for nearly-distinct rows
• Pre-aggregations can spill to disk under
memory pressure
• Disk I/O is bad - better to send to
merge agg rather than disk

Network
Streaming Pre-aggregations in Impala 2.5
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Reduction factor is dynamically estimated based
on the actual data processed
• Pre-aggregation expands memory usage only if
reduction factor is good
• Benefits:
• Certain aggregations with low reduction
factor see speedups of up to 40%
• Memory consumption can be reduced by
50% or more
• Streaming pre-aggregations don’t spill to
disk

Streaming Pre-aggregations in Impala 2.5
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
06:AGGREGATE 1 366.581ms 366.581ms 1 1 72.00 KB -1.00 B FINALIZE
05:EXCHANGE 1 149.923us 149.923us 15 1 0 -1.00 B UNPARTITIONED
02:AGGREGATE 15 243.604ms 248.701ms 15 1 12.00 KB 10.00 MB
04:AGGREGATE 15 8s887ms 9s585ms 450.00M 437.91M 1.53 GB 245.01 MB FINALIZE
03:EXCHANGE 15 827.770ms 932.785ms 450.00M 437.91M 0 0 HASH(o_orderkey)
01:AGGREGATE 15 9s995ms 11s484ms 450.00M 437.91M 1.64 GB 3.59 GB
00:SCAN HDFS 15 142.192ms 189.179ms 450.00M 450.00M 150.94 MB 88.00 MB tpch_300_parquet.orders
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
06:AGGREGATE 1 356.667ms 356.667ms 1 1 72.00 KB -1.00 B FINALIZE
05:EXCHANGE 1 110.924us 110.924us 15 1 0 -1.00 B UNPARTITIONED
02:AGGREGATE 15 246.188ms 250.408ms 15 1 12.00 KB 10.00 MB
04:AGGREGATE 15 11s174ms 11s753ms 450.00M 437.91M 1.51 GB 245.01 MB FINALIZE
03:EXCHANGE 15 750.620ms 805.099ms 450.00M 437.91M 0 0 HASH(o_orderkey)
01:AGGREGATE 15 5s670ms 6s715ms 450.00M 437.91M 153.40 MB 3.59 GB STREAMING
00:SCAN HDFS 15 151.746ms 201.804ms 450.00M 450.00M 150.95 MB 88.00 MB tpch_300_parquet.orders
Baseline finished in 23.13 seconds
With stream pre-aggregation enabled finished in 14.9 seconds

Optimization for partition keys scan
• Use metadata to avoid table accesses for partition key scans:
• select min(month), max(year) from functional.alltypes;
• month, year are partition keys of the table
• Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS
• Applicable:
• min(), max(), ndv() and aggregate functions with distinct keyword
• partition keys only
01:AGGREGATE [FINALIZE]
| output: min(month),max(year)
|
00:UNION
constant-operands=24
03:AGGREGATE [FINALIZE]
| output: min:merge(month), max:merge(year)
|
02:EXCHANGE [UNPARTITIONED]
|
01:AGGREGATE
| output: min(month), max(year)
|
00:SCAN HDFS [functional.alltypes]
partitions=24/24 files=24 size=478.45KB
Plan without optimization Plan with optimization

21x node cluster each with Hardware
● 384GB memory, 2s sockets, 12x total cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz
● 12 disk drives at 932GB each (one for the OS, the rest for HDFS)
Comparative Set
● Impala 2.5
○ RUNTIME_FILTER_MODE = 2;
● Spark SQL 1.6
○ Thrift JDBC server used to avoid startup cost
○ --master yarn --deploy-mode client --driver-memory 24G --driver-cores 8 --executor-memory 24G --num-executors 240
Workload
● TPC-DS 15TB stored in Parquet file format (default of 256MB block size)
● Un-modified TPC-DS queries : 3, 7, 8, 19, 25, 27, 34, 42, 43, 46, 47, 52, 53, 55, 59, 61, 63, 68, 73, 79, 88, 89, 96, 98
● Caveats:
○ Spark-SQL failed running :
■ Q25 : Bad plan
■ Q47 : StackOverflowError
■ Q89 : StackOverflowError
Competitive benchmark : TPC-DS

Q25 (Fact to fact joins)
SELECT i_item_id,i_item_desc, s_store_id, s_store_name,
Stddev_samp(ss_net_profit),Stddev_samp(sr_net_loss), Stddev_samp(cs_net_profit)
AS catalog_sales_profit
FROM store_sales,
store_returns,
catalog_sales,
date_dim d1,
date_dim d2,
date_dim d3,
store,
item
WHERE d1.d_moy = 4 AND d1.d_year = 2001 AND d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk AND s_store_sk = ss_store_sk AND ss_customer_sk =
sr_customer_sk AND ss_item_sk = sr_item_sk AND ss_ticket_number = sr_ticket_number
AND sr_returned_date_sk = d2.d_date_sk AND d2.d_moy BETWEEN 4 AND 10
AND d2.d_year = 2001 AND sr_customer_sk = cs_bill_customer_sk
AND sr_item_sk = cs_item_sk AND cs_sold_date_sk = d3.d_date_sk
AND d3.d_moy BETWEEN 4 AND 10 AND d3.d_year = 2001
GROUP BY i_item_id, i_item_desc,
s_store_id, s_store_name
ORDER BY i_item_id, i_item_desc,
s_store_id, s_store_name
LIMIT 100;
Competitive benchmark
Query complexity varied from Q3
SELECT dt.d_year,
item.i_brand_id brand_id,
item.i_brand brand,
Sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,
store_sales,
item
AND item.i_manufact_id = 436
AND dt.d_moy = 12
GROUP BY dt.d_year,
item.i_brand,
item.i_brand_id
ORDER BY dt.d_year,
sum_agg DESC,
brand_id
LIMIT 100;

Impala 2.5 is 11x faster
(based on geomean)

Performance Benchmark Takeaways
• Impala unlocks BI usage directly on Hadoop
• Meets BI low-latency and multi-user requirements
• Advantage expands for single-user vs just 10 users
• Spark SQL enables easier Spark application development
• Enables mixed procedural Spark (Java/Scala) and SQL job development
• Mid-term trends will further favor Impala’s design approach for latency and concurrency
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)
• CPU efficiency will increase in importance
• Native code enables easy optimizations for CPU instruction sets

• Available today in Impala 2.5:
• All the same Impala functionality, performance, and third-party integrations
• Supported across our cloud partners
• Deployment via Director
• Modular architecture enables cloud’s decoupled storage and elasticity future
• Available soon in Impala 2.6:
• Impala read/write to S3 in addition to local HDFS IMPALA-1878
• Dynamically sized runtime filters
• Parquet scanner optimization
• Faster joins, aggregations, sorts and decimal arithmetic
• Rack aware scheduling
• Faster code generation
Impala and Cloud

Impala Roadmap
2H 2015 1H 2016 2016
• SQL Support & Usability
• Nested structures
• Kudu updates (beta)
• Management & Security
• Record reader service
(beta)
• Finer-grained security
(Sentry)
• Integration
• Isilon support
• Python interface (Ibis)
• Performance & Scale
• Improved predictability
under concurrency
• Continued scalability and
concurrency
• Initial perf/scale
improvements
• Improved admission
control
• Resource utilization and
showback
• Dynamic partitioning
• >20x performance
• Multi-threaded
joins/aggregations
• Continued scale work
• Cloud
• S3 read/write support
• Improved YARN
integration
• Automated metadata
• Data type improvements
• Added SQL extensions

Appendix.

• Pre Impala 2.5:
• Coordinator starts receiving fragments before
senders
• Problem:
• Serializes startup
• Scale and plan complexity ~ slower startup
• Impala 2.5:
• Coordinator starts fragments in any order
• Added wait logic for senders and receivers
Query start-up improvements

Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).
First it selects an HDFS datanode to read from.
A B C
Selection will always start with the same
replica to make optimal use of OS buffer
caches.
This can lead to hot-spots for some
workloads.
Improvement: Pick impalad at random.

New Query Option: random_replica
Disabled by default.
set random_replica = 1;
Also has a corresponding query hint:
SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA */;

Where It Can Help
• Large number of small queries, each with few input tables.
• High load on only one of multiple replicas of a table.
• Queries are CPU bound.
• Benefit: Distribute load more evenly over replicas.
• Tradeoff: Distribution of local reads will increase buffer cache usage.
What’s Next
• Add possibility to prefer remote reads.
• Switch remote impalad selection from round-robin to load-based.
• Add rack-awareness.

Catalog Improvements
Incrementally update table metadata instead of force-reloading all table metadata
during DDL/DML operations
Reload metadata of only ‘dirty’ partitions
Reuse descriptors of HDFS files to avoid loading file/block metadata for files that
haven’t been modified
Significantly reduce the latency of DDL/DML operations that change a small
fraction of table metadata (e.g. alter table foo partition (year = 2010) set
location ‘blah’)

Catalog Improvements - Results

Apache Impala (incubating) 2.5 Performance Update

More Related Content

What's hot

Viewers also liked

Similar to Apache Impala (incubating) 2.5 Performance Update

More from Cloudera, Inc.

Recently uploaded

Apache Impala (incubating) 2.5 Performance Update