SlideShare a Scribd company logo
1 of 73
Impala 2.5 Performance Update
Tuesday March 1, 2016 meetup
http://www.meetup.com/Bay-Area-Impala-Users-Group/events/227944275/
2© Cloudera, Inc. All rights reserved.
Apache Impala 2.5 (Incubating)
Performance improvements overview
Agenda
1. Brief overview Impala quality and reliability improvements
2. Impala 2.5 Vs. 2.3 Performance improvements
3. Performance deep dive
1. Improved Cardinality Estimates and Join Order
2. Query startup improvements
3. Runtime filters
4. Additional codegen and code optimizations
5. Decimal arithmetic improvements
6. Pass through aggregation
7. Fast min/max values on partition columns
8. Scheduling improvements
9. Admission control
4© Cloudera, Inc. All rights reserved.
Improved Impala Stability and Reliability
Better quality, enable optimization
● Query Generator
● Stress
● Scale testing
● Fault injection testing
● Longevity testing
● Performance regression
● … More on the way
5© Cloudera, Inc. All rights reserved.
● Find correctness and crash bugs
○ Not found by other tests
● In github
○ https://github.com/cloudera/Impala/tree/cdh5-
trunk/tests/comparison
○ https://github.com/cloudera/Impala/blob/cdh5-
trunk/tests/stress/concurrent_select.py
● Fixing them
Query Generator and Stress
6© Cloudera, Inc. All rights reserved.
Query Generator
● Built in-house, open source python (Taras Bobrovytsky)
● Random data generator as well
● Adapts to any schema
● Generates random queries based on a query model
○ Tunable complexity
○ Extensible
● Uses postgres for expected values
○ Query translation for syntax
○ Table flattening for nested types
● Fast, small: Uses Docker-ized cluster
● Adapted for use with Hive
7© Cloudera, Inc. All rights reserved.
Apache Impala: Open Source & Open Standard
1 > 1 MM downloads since GA
2 Majority adoption across Cloudera customers
3 Certification across key application partners:
4 De facto standard with multi-vendor support:
and others
8© Cloudera, Inc. All rights reserved.
SQL-on-Hadoop engines
SQL
Impala
9© Cloudera, Inc. All rights reserved.
Impala 2.3 Vs Spark-SQL 1.5 & Hive on Tez (Upstream)
Full Details
http://tinyurl.com/gotkdlq
10© Cloudera, Inc. All rights reserved.
New in Impala 2.5
Performance and Scalability
•Better join ordering and cardinality
estimation
•Query start-up improvements
•Runtime filters
•Additional codegen and code optimizations
•Decimal arithmetic improvements
•Fast min/max values on partition columns
(with query option)
•Incremental metadata updates (DDL)
Integrations
•Support for EMC DSSD
Usability Enhancements
•Admission Control Improvements
•Null-safe join/equals
11© Cloudera, Inc. All rights reserved.
Usability Enhancements
•Admission Control Improvements
New in Impala 2.5
Performance and Scalability
•Runtime filters
•Improved Cardinality Estimates and Join
Order
•Additional codegen and code optimizations
•Decimal arithmetic improvements
•Faster Query Startup
•Fast min/max values on partition columns
(with query option)
Covered today
12© Cloudera, Inc. All rights reserved.
How does Impala 2.5 fare against 2.3?
• 363% speedup for TPC-DS
• 92% speedup for TPC-H
• 71% speedup for TPC-H
(Nested)
13© Cloudera, Inc. All rights reserved.
Improved Cardinality Estimates and Join Order
1. More robust scan cardinality estimation
• Mitigate correlated predicates (exponential backoff)
2. Improved join cardinality estimation
• Special treatment of common case of PK/FK joins
• Detect selective joins by applying the selectivity of build-side
predicates to the estimated join cardinality
3. More robust join strategy selection (broadcast vs. shuffle)
• Account for data serialization overhead vs. raw data
14© Cloudera, Inc. All rights reserved.
Improved Cardinality Estimates and Join Order
TPCH-Q8 on Impala 2.3
… #Rows #Est Rows
14:HASH JOIN | 728.85K | 1.94B |
|--23:EXCHANGE | 15 | 1 |
| 07:SCAN HDFS | 1 | 1 |
13:HASH JOIN | 3.65M | 1.94B |
|--22:EXCHANGE | 375 | 25 |
| 05:SCAN HDFS | 25 | 25 |
12:HASH JOIN | 3.65M | 1.94B |
|--21:EXCHANGE | 675.00M | 45.00M |
| 04:SCAN HDFS | 45.00M | 45.00M |
11:HASH JOIN | 3.65M | 1.81B |
|--20:EXCHANGE | 375 | 25 |
| 06:SCAN HDFS | 25 | 25 |
10:HASH JOIN | 3.65M | 1.81B |
|--19:EXCHANGE | 45.00M | 3.00M |
| 01:SCAN HDFS | 3.00M | 3.00M |
09:HASH JOIN | 3.65M | 1.80B |
|--18:EXCHANGE | 2.05B | 4.50M |
| 03:SCAN HDFS | 136.72M | 4.50M |
08:HASH JOIN | 12.02M | 1.80B |
|--17:EXCHANGE | 6.01M | 397.35K |
| 00:SCAN HDFS | 400.62K | 397.35K |
02:SCAN HDFS | 1.80B | 1.80B |
Run time: 91s
TPCH-Q8 on Impala 2.5
… #Rows #Est Rows
14:HASH JOIN | 728.85K | 238.41K |
|--26:EXCHANGE | 375 | 25 |
| 06:SCAN HDFS | 25 | 25 |
13:HASH JOIN | 728.85K | 238.41K |
|--25:EXCHANGE | 15 | 1 |
| 07:SCAN HDFS | 1 | 1 |
12:HASH JOIN | 3.65M | 1.19M |
|--24:EXCHANGE | 375 | 25 |
| 05:SCAN HDFS | 25 | 25 |
11:HASH JOIN | 3.65M | 1.19M |
|--23:EXCHANGE | 45.00M | 45.00M |
| 04:SCAN HDFS | 45.00M | 45.00M |
22:EXCHANGE | 3.65M | 1.19M |
10:HASH JOIN | 3.65M | 1.19M |
|--21:EXCHANGE | 3.00M | 3.00M |
| 01:SCAN HDFS | 3.00M | 3.00M |
20:EXCHANGE | 3.65M | 1.19M |
09:HASH JOIN | 3.65M | 1.19M |
|--19:EXCHANGE | 136.72M | 45.00M |
| 03:SCAN HDFS | 136.72M | 45.00M |
18:EXCHANGE | 12.02M | 11.92M |
08:HASH JOIN | 12.02M | 11.92M |
|--17:EXCHANGE | 6.01M | 397.35K |
| 00:SCAN HDFS | 400.62K | 397.35K |
02:SCAN HDFS | 12.40M | 1.80B |
Run time: 11s (>8x speedup)
15© Cloudera, Inc. All rights reserved.
•Why better cardinality estimation matters
•TPC-DS Q14
• 225 joins
• 285 scan nodes
Improved Cardinality Estimates and Join Order
16© Cloudera, Inc. All rights reserved.
Query start-up improvements (IMPALA-1599)
• At start-up, coordinator has to tell every Impala daemon to run a set of
fragments.
• Since data flows up from leaves, need to make sure that receivers are ready
for the data produced by senders.
• Pre Impala-2.5, we did the obvious thing: start the receivers first!
• But that serializes query start-up...
17© Cloudera, Inc. All rights reserved.
TPCDS-Q59:
Each colour is
a different
plan fragment.
18© Cloudera, Inc. All rights reserved.
TPCDS-Q59:
Wave 1
19© Cloudera, Inc. All rights reserved.
TPCDS-Q59:
Wave 2
20© Cloudera, Inc. All rights reserved.
TPCDS-Q59:
Wave 3
21© Cloudera, Inc. All rights reserved.
TPCDS-Q59:
Wave 4
22© Cloudera, Inc. All rights reserved.
TPCDS-Q59:
Wave 5.
Only now can
the tree with
the light-green
scan start to
make
progress!
23© Cloudera, Inc. All rights reserved.
Query start-up: what we did
• Instead of starting fragments wave-by-wave, start them all at once in any
order.
• Need to change a lot about sender / receiver logic to allow senders to wait
for receivers to arrive. Lots of tricky error conditions to deal with!
• Also move all heavy-lifting out of synchronous RPC to fragment executor, to
asynchronous fragment start-up.
• Reduced plan fragment size for partitioned tables
24© Cloudera, Inc. All rights reserved.
Query start-up: performance impact
25© Cloudera, Inc. All rights reserved.
Query start-up: future improvements
• Work still to do:
• Batch the fragment start RPCs (rather than 10s of RPCs per backend)
• Batching will amortize cost of sending identical data structures.
• Also should trim these data structures down.
26© Cloudera, Inc. All rights reserved.
Runtime filtering
• General idea: some predicates can only be computed at runtime
• Example: SELECT count(*) FROM probe P JOIN build B on P.id = B.id;
• How does Impala execute this query?
27© Cloudera, Inc. All rights reserved.
SELECT count(*)
FROM probe P
JOIN build B
on P.id = B.id
Note: 29200 rows
scanned, but only 40
match.
28© Cloudera, Inc. All rights reserved.
Runtime filters: the opportunity
The planner doesn’t know what the set of B.id contains - even with statistics.
But there’s clearly an opportunity to save some work - why bother sending
29160 of those rows to the hash join node?
Runtime filters computes this predicate at runtime.
29© Cloudera, Inc. All rights reserved.
Step 1: planner tells
02:HASH JOIN to
produce filter
containing all ID from
build side.
30© Cloudera, Inc. All rights reserved.
Step 2: 02:HASH JOIN
reads all 10 rows from
build side (right
input), and computes
filter containing all
distinct values of ID.
31© Cloudera, Inc. All rights reserved.
Step 3: 02:HASH JOIN
sends filter to 00:
SCAN HDFS before the
scan starts.
Scan eliminates all
rows that don’t match
in the filter.
32© Cloudera, Inc. All rights reserved.
The result: only 40 rows
are produced by the scan,
reducing the amount of
work the hash join has to
do
by > 99%!
33© Cloudera, Inc. All rights reserved.
Runtime filters: real-world results
Runtime filters can be highly effective. Some benchmark queries are more than
20 times faster in Impala 2.5.0.
As always, depends on your queries, your schemas and your cluster
environment.
By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0.
They can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL.
34© Cloudera, Inc. All rights reserved.
35© Cloudera, Inc. All rights reserved.
LLVM Codegen Support in Impala
Operations:
• Hash join
• Aggregation
• Scans: Text, Sequence, Avro
• Expressions in all operators
• Sort
• Top-N
Data Types:
• TINYINT, SMALLINT, INT, BIGINT
• FLOAT, DOUBLE
• BOOLEAN
• STRING, VARCHAR
• DECIMAL
New in Impala 2.5
Extended in Impala 2.5
36© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
l_orderkey l_extendedprice l_shipdate l_shipmode
26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
select
l_extendedprice, l_orderkey,..
from
lineitem
order by l_orderkey
limit 100
SQL language offers
great flexibility
37© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
l_orderkey l_extendedprice l_shipdate l_shipmode
26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
select
l_extendedprice, l_orderkey,..
from
lineitem
order by l_extendedprice
limit 100
SQL language offers
great flexibility
38© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
l_orderkey l_extendedprice l_shipdate l_shipmode
26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
select
l_extendedprice, l_orderkey,..
from
lineitem
order by l_extendedprice
limit 100
Flexibility requires
generic code which is
often inefficient
39© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
l_orderkey l_extendedprice l_shipdate l_shipmode
26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
select
l_extendedprice, l_orderkey,..
from
lineitem
order by l_extendedprice
limit 100
Compare
40© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
l_orderkey l_extendedprice l_shipdate l_shipmode
26039617 13515 12/8/1997 FOB
28809990 7208 10/19/1997 AIR
30525093 16218 12/16/1997 REG AIR
select
l_extendedprice, l_orderkey,..
from
lineitem
order by l_extendedprice
limit 100
Compare
41© Cloudera, Inc. All rights reserved.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
select
l_extendedprice, l_orderkey
from
lineitem
order by l_extendedprice
limit 100
Codegen for Order by & Top-N
l_orderkey l_extendedprice l_shipdate l_shipmode
26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
42© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
void* ExprContext::GetValue(Expr* e, TupleRow* row) {
switch (e->type_.type) {
case TYPE_BOOLEAN: {
..
..
}
case TYPE_TINYINT: {
..
..
}
case TYPE_INT: {
..
.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
43© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
void* ExprContext::GetValue(Expr* e, TupleRow* row) {
switch (e->type_.type) {
case TYPE_BOOLEAN: {
..
..
}
case TYPE_TINYINT: {
..
..
}
case TYPE_INT: {
..
.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
44© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
int RawValue::Compare(const void* v1, const void* v2,
const ColumnType& type) {
switch (type.type) {
case TYPE_INT:
i1 = *reinterpret_cast<const int32_t*>(v1);
i2 = *reinterpret_cast<const int32_t*>(v2);
return i1 > i2 ? 1 : (i1 < i2 ? -1 : 0);
case TYPE_BIGINT:
b1 = *reinterpret_cast<const int64_t*>(v1);
b2 = *reinterpret_cast<const int64_t*>(v2);
return b1 > b2 ? 1 : (b1 < b2 ? -1 : 0);
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
45© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs);
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs);
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
Codegen codeOriginal code
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
46© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
}
Codegen code
• Perfectly unrolls “for each grouping column” loop
• No switching on input type(s)
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
47© Cloudera, Inc. All rights reserved.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs);
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs);
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
Codegen code
• No switching on input type(s)
• Perfectly unrolls “for each grouping column” loop
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
48© Cloudera, Inc. All rights reserved.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs);
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs);
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
Codegen code
• No switching on input type(s)
• Perfectly unrolls “for each grouping column” loop
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
10x more efficient code
49© Cloudera, Inc. All rights reserved.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs);
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs);
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
Codegen code
• No switching on input type(s)
• Perfectly unrolls “for each grouping column” loop
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
10x more efficient code
50© Cloudera, Inc. All rights reserved.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs);
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs);
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
Codegen code
• No switching on input type(s)
• Perfectly unrolls “for each grouping column” loop
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
10x more efficient code
77% speedup in query
due to Amdahl’s law
51© Cloudera, Inc. All rights reserved.
Float/Double Vs Decimal?
Pros for Float/Double
• Uses less memory.
• Faster because floating point is better supported in codegen.
(Note: Decimal uses fixed-point hardware types - int64 and __int128)
• Can represent a larger range of numbers.
Cons for Float/Double
• Precision errors compound during aggregations
• Can’t do math with wide number of significant digits (123456789.1 * .0000987654321)
Decimal arithmetic and aggregation
52© Cloudera, Inc. All rights reserved.
No go for applications requiring high precision & accuracy
What about performance penalty?
Float , Double Vs Decimal?
Pros
• Uses less memory.
• Faster because floating point math operations are natively supported by processors.
• Can represent a larger range of numbers.
Cons
• Precision errors compound during aggregations
• Can’t do math with wide number of significant digits (123456789.1 * .0000987654321)
Decimal arithmetic and aggregation
53© Cloudera, Inc. All rights reserved.
Decimal arithmetic and aggregation
SELECT l_returnflag,
l_linestatus,
Sum(l_quantity) AS SUM_QTY,
Sum(l_extendedprice)AS SUM_BASE_PRICE,
Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICE
FROM lineitem
GROUP BY l_returnflag,
l_linestatus
ORDER BY l_returnflag,
l_linestatus
3x speedup
● Simplified overflow check for decimal.
● Extended Codegen framework to support aggregations involving decimal.
● Bridged the performance gap between double and decimal
54© Cloudera, Inc. All rights reserved.
Network
Distributed Aggregations in Impala
Preagg Preagg Preagg
Merge Merge Merge
select cust_id, sum(dollars)
from sales group by cust_id;
Scan ScanScan
• Impala aggregations have two phases:
• Pre-aggregation phase
• Merge phase
• The pre-aggregation phase greatly reduces
network traffic if there are many input rows
per grouping value.
• E.g. many sales per customer.
55© Cloudera, Inc. All rights reserved.
Network
Downsides of Pre-aggregations
Preagg Preagg Preagg
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Pre-aggregations consume:
• Memory
• CPU cycles
• Pre-aggregations are not always effective
at reducing network traffic
• E.g. select distinct for nearly-distinct rows
• Pre-aggregations can spill to disk under
memory pressure
• Disk I/O is bad - better to send to
merge agg rather than disk
56© Cloudera, Inc. All rights reserved.
Network
Streaming Pre-aggregations in Impala 2.5
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Reduction factor is dynamically estimated based
on the actual data processed
• Pre-aggregation expands memory usage only if
reduction factor is good
• Benefits:
• Certain aggregations with low reduction
factor see speedups of up to 25%
• Memory consumption can be reduced by
50% or more
• Streaming pre-aggregations don’t spill to
disk
57© Cloudera, Inc. All rights reserved.
Optimization for partition keys scan
• Use metadata to avoid table accesses for partition key scans:
• select min(month), max(year) from functional.alltypes;
• month, year are partition keys of the table
• Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS
• Applicable:
• min(), max(), ndv() and aggregate functions with distinct keyword
• partition keys only
01:AGGREGATE [FINALIZE]
| output: min(month),max(year)
|
00:UNION
constant-operands=24
03:AGGREGATE [FINALIZE]
| output: min:merge(month), max:merge(year)
|
02:EXCHANGE [UNPARTITIONED]
|
01:AGGREGATE
| output: min(month), max(year)
|
00:SCAN HDFS [functional.alltypes]
partitions=24/24 files=24 size=478.45KB
Plan without optimization Plan with optimization
58© Cloudera, Inc. All rights reserved.
Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).
First it selects an HDFS datanode to read from.
A B C
59© Cloudera, Inc. All rights reserved.
Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).
First it selects an HDFS datanode to read from.
A B C
60© Cloudera, Inc. All rights reserved.
Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).
First it selects an HDFS datanode to read from.
Then it selects an impalad to perform the scan.
A B C
61© Cloudera, Inc. All rights reserved.
Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).
First it selects an HDFS datanode to read from.
Then it selects an impalad to perform the scan.
A B C
62© Cloudera, Inc. All rights reserved.
Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).
First it selects an HDFS datanode to read from.
A B C
Selection will always start with the same
replica to make optimal use of OS buffer
caches.
This can lead to hot-spots for some
workloads.
63© Cloudera, Inc. All rights reserved.
Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).
First it selects an HDFS datanode to read from.
A B C
Selection will always start with the same
replica to make optimal use of OS buffer
caches.
This can lead to hot-spots for some
workloads.
Improvement: Pick impalad at random.
64© Cloudera, Inc. All rights reserved.
New Query Option: random_replica
Disabled by default.
set random_replica = 1;
Also has a corresponding query hint:
SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA */;
65© Cloudera, Inc. All rights reserved.
What to Look out for
66© Cloudera, Inc. All rights reserved.
Where It Can Help
• Large number of small queries, each with few input tables.
• High load on only one of multiple replicas of a table.
• Queries are CPU bound.
• Benefit: Distribute load more evenly over replicas.
• Tradeoff: Distribution of local reads will increase buffer cache usage.
67© Cloudera, Inc. All rights reserved.
What’s Next
• Add possibility to prefer remote reads.
• Switch remote impalad selection from round-robin to load-based.
• Add rack-awareness.
68© Cloudera, Inc. All rights reserved.
Impala Admission Control
•Purpose: throttle workload to avoid oversubscription, maximize throughput
•Previously (since Impala 1.4):
• Throttle based on number of concurrent queries
• Knobs for throttling based on memory, but not recommended
•New in 2.5:
• Improved CM integration: configuration and monitoring
• Improved memory-based admission
• Improved control over pools w/ new configurations
69© Cloudera, Inc. All rights reserved.
Admission Control Improvements
•Use cases that work well with admission control:
1. [Since 1.4] Simple throttling: Setting max number running queries
2. [NEW] well understood, memory bound-workloads
•Admission control accepts new per-pool configurations
• Query memory limits (and any other query options)
• Queue timeout (was previously a global setting only)
•Admission algorithm admits/queues based on:
• Request’s aggregate memory usage fitting within the pool’s configured mem
• Request’s per-node mem requirement fitting on each impalad’s available
mem
70© Cloudera, Inc. All rights reserved.
Example
•Impala cluster has 10 nodes, 200gb/node = 2TB total
•Workload has queries with known memory requirements:
• Many small, fast queries (<10gb/node)
• Some very large queries (100gb/node)
•Need to run 1 big query at a time, many small queries
•Configure two resource pools:
71© Cloudera, Inc. All rights reserved.
Guidance for Using Memory-Based Admission
For a well known, memory bound workload (i.e. can set MEM_LIMIT on queries):
1. Group queries by resource requirements (e.g. small/HighThrpt, XXL)
2. Create resource pools for each group
3. Control concurrency for pools:
- large query pools: set pool max mem resources
- small query pools: set max num running queries limit
4. Set default query mem limits on all pools (upper bound)
5. Set REQUEST_POOL query option to direct queries into pools
- via ‘SET’ command
- [NEW] Add key-value pair to JDBC connection string
72© Cloudera, Inc. All rights reserved.
CM Impala Admission Control Configuration:
http://vc0726.halxg.cloudera.com:
7180/cmf/services/13/pools/configuration
Admission Control Dashboard: link
Admission control demo
73© Cloudera, Inc. All rights reserved.
Impala Roadmap
2H 2015 1H 2016 2016
• SQL Support & Usability
• Nested structures
• Kudu updates (beta)
• Management & Security
• Record reader service
(beta)
• Finer-grained security
(Sentry)
• Integration
• Isilon support
• Python interface (Ibis)
• Performance & Scale
• Improved predictability
under concurrency
• Performance & Scale
• Continued scalability and
concurrency
• Initial perf/scale
improvements
• Management & Security
• Improved admission
control
• Resource utilization and
showback
• SQL Support & Usability
• Dynamic partitioning
• Improved timestamp
compatibility
• Performance & Scale
• >20x performance
• Multi-threaded
joins/aggregations
• Continued scale work
• Management & Security
• Improved YARN
integration
• Automated metadata
• Integration
• S3 support
• SQL Support & Usability
• Nested types with Avro
• Date type
• Added SQL extensions

More Related Content

What's hot

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Cloudera, Inc.
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in ImpalaCloudera, Inc.
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 

What's hot (20)

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 

Viewers also liked

Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep divehuguk
 
Designing Big Data Systems Like a Pro
Designing Big Data Systems Like a ProDesigning Big Data Systems Like a Pro
Designing Big Data Systems Like a ProSoftServe
 
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovLog Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovSoftServe
 
Approaching Quality in Digital Era
Approaching Quality in Digital EraApproaching Quality in Digital Era
Approaching Quality in Digital EraSoftServe
 
ML Pipelineで実践機械学習
ML Pipelineで実践機械学習ML Pipelineで実践機械学習
ML Pipelineで実践機械学習Kazuki Taniguchi
 
基礎から学ぶ超並列SQLエンジンImpala #cwt2015
基礎から学ぶ超並列SQLエンジンImpala #cwt2015基礎から学ぶ超並列SQLエンジンImpala #cwt2015
基礎から学ぶ超並列SQLエンジンImpala #cwt2015Cloudera Japan
 
Introduction to Impala ~Hadoop用のSQLエンジン~ #hcj13w
Introduction to Impala ~Hadoop用のSQLエンジン~ #hcj13wIntroduction to Impala ~Hadoop用のSQLエンジン~ #hcj13w
Introduction to Impala ~Hadoop用のSQLエンジン~ #hcj13wCloudera Japan
 
Evolution of Impala #hcj2014
Evolution of Impala #hcj2014Evolution of Impala #hcj2014
Evolution of Impala #hcj2014Cloudera Japan
 
Well Log Interpretation and Petrophysical Analisis in [Autosaved]
Well Log Interpretation and Petrophysical Analisis in [Autosaved]Well Log Interpretation and Petrophysical Analisis in [Autosaved]
Well Log Interpretation and Petrophysical Analisis in [Autosaved]Ridho Nanda Pratama
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ ZooskCloudera, Inc.
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in ImpalaCloudera, Inc.
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera, Inc.
 
Database vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewDatabase vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewHealth Catalyst
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 

Viewers also liked (20)

Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep dive
 
Designing Big Data Systems Like a Pro
Designing Big Data Systems Like a ProDesigning Big Data Systems Like a Pro
Designing Big Data Systems Like a Pro
 
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovLog Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin Kropov
 
Approaching Quality in Digital Era
Approaching Quality in Digital EraApproaching Quality in Digital Era
Approaching Quality in Digital Era
 
ML Pipelineで実践機械学習
ML Pipelineで実践機械学習ML Pipelineで実践機械学習
ML Pipelineで実践機械学習
 
基礎から学ぶ超並列SQLエンジンImpala #cwt2015
基礎から学ぶ超並列SQLエンジンImpala #cwt2015基礎から学ぶ超並列SQLエンジンImpala #cwt2015
基礎から学ぶ超並列SQLエンジンImpala #cwt2015
 
Impalaチューニングポイントベストプラクティス
ImpalaチューニングポイントベストプラクティスImpalaチューニングポイントベストプラクティス
Impalaチューニングポイントベストプラクティス
 
Introduction to Impala ~Hadoop用のSQLエンジン~ #hcj13w
Introduction to Impala ~Hadoop用のSQLエンジン~ #hcj13wIntroduction to Impala ~Hadoop用のSQLエンジン~ #hcj13w
Introduction to Impala ~Hadoop用のSQLエンジン~ #hcj13w
 
Evolution of Impala #hcj2014
Evolution of Impala #hcj2014Evolution of Impala #hcj2014
Evolution of Impala #hcj2014
 
Well Log Interpretation and Petrophysical Analisis in [Autosaved]
Well Log Interpretation and Petrophysical Analisis in [Autosaved]Well Log Interpretation and Petrophysical Analisis in [Autosaved]
Well Log Interpretation and Petrophysical Analisis in [Autosaved]
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Well log data processing
Well log data processingWell log data processing
Well log data processing
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
 
Database vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewDatabase vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative Review
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 

Similar to Impala 2.5 Performance Deep Dive

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...Cloudera, Inc.
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
CA Spectrum® Just Keeps Getting Better and Better
CA Spectrum® Just Keeps Getting Better and BetterCA Spectrum® Just Keeps Getting Better and Better
CA Spectrum® Just Keeps Getting Better and BetterCA Technologies
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServicePivotalOpenSourceHub
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy ProxyMark McBride
 
Netherlands Tech Tour 02 - MySQL Fabric
Netherlands Tech Tour 02 -   MySQL FabricNetherlands Tech Tour 02 -   MySQL Fabric
Netherlands Tech Tour 02 - MySQL FabricMark Swarbrick
 
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMark Swarbrick
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Application High Availability and Upgrades Using Oracle GoldenGate
Application High Availability and Upgrades Using Oracle GoldenGateApplication High Availability and Upgrades Using Oracle GoldenGate
Application High Availability and Upgrades Using Oracle GoldenGateShane Borden
 
High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)Arnaud Bouchez
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Cloudera, Inc.
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Lari Hotari
 
Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10CA Technologies
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics HeroTechWell
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
 
Sneak Peek into the New ChangeMan ZMF Release
Sneak Peek into the New ChangeMan ZMF ReleaseSneak Peek into the New ChangeMan ZMF Release
Sneak Peek into the New ChangeMan ZMF ReleaseNavita Sood
 

Similar to Impala 2.5 Performance Deep Dive (20)

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
CA Spectrum® Just Keeps Getting Better and Better
CA Spectrum® Just Keeps Getting Better and BetterCA Spectrum® Just Keeps Getting Better and Better
CA Spectrum® Just Keeps Getting Better and Better
 
Univa Presentation at DAC 2020
Univa Presentation at DAC 2020 Univa Presentation at DAC 2020
Univa Presentation at DAC 2020
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a Service
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy Proxy
 
Netherlands Tech Tour 02 - MySQL Fabric
Netherlands Tech Tour 02 -   MySQL FabricNetherlands Tech Tour 02 -   MySQL Fabric
Netherlands Tech Tour 02 - MySQL Fabric
 
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL Fabric
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Application High Availability and Upgrades Using Oracle GoldenGate
Application High Availability and Upgrades Using Oracle GoldenGateApplication High Availability and Upgrades Using Oracle GoldenGate
Application High Availability and Upgrades Using Oracle GoldenGate
 
High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics Hero
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Sneak Peek into the New ChangeMan ZMF Release
Sneak Peek into the New ChangeMan ZMF ReleaseSneak Peek into the New ChangeMan ZMF Release
Sneak Peek into the New ChangeMan ZMF Release
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 

Recently uploaded (20)

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 

Impala 2.5 Performance Deep Dive

  • 1. Impala 2.5 Performance Update Tuesday March 1, 2016 meetup http://www.meetup.com/Bay-Area-Impala-Users-Group/events/227944275/
  • 2. 2© Cloudera, Inc. All rights reserved. Apache Impala 2.5 (Incubating) Performance improvements overview
  • 3. Agenda 1. Brief overview Impala quality and reliability improvements 2. Impala 2.5 Vs. 2.3 Performance improvements 3. Performance deep dive 1. Improved Cardinality Estimates and Join Order 2. Query startup improvements 3. Runtime filters 4. Additional codegen and code optimizations 5. Decimal arithmetic improvements 6. Pass through aggregation 7. Fast min/max values on partition columns 8. Scheduling improvements 9. Admission control
  • 4. 4© Cloudera, Inc. All rights reserved. Improved Impala Stability and Reliability Better quality, enable optimization ● Query Generator ● Stress ● Scale testing ● Fault injection testing ● Longevity testing ● Performance regression ● … More on the way
  • 5. 5© Cloudera, Inc. All rights reserved. ● Find correctness and crash bugs ○ Not found by other tests ● In github ○ https://github.com/cloudera/Impala/tree/cdh5- trunk/tests/comparison ○ https://github.com/cloudera/Impala/blob/cdh5- trunk/tests/stress/concurrent_select.py ● Fixing them Query Generator and Stress
  • 6. 6© Cloudera, Inc. All rights reserved. Query Generator ● Built in-house, open source python (Taras Bobrovytsky) ● Random data generator as well ● Adapts to any schema ● Generates random queries based on a query model ○ Tunable complexity ○ Extensible ● Uses postgres for expected values ○ Query translation for syntax ○ Table flattening for nested types ● Fast, small: Uses Docker-ized cluster ● Adapted for use with Hive
  • 7. 7© Cloudera, Inc. All rights reserved. Apache Impala: Open Source & Open Standard 1 > 1 MM downloads since GA 2 Majority adoption across Cloudera customers 3 Certification across key application partners: 4 De facto standard with multi-vendor support: and others
  • 8. 8© Cloudera, Inc. All rights reserved. SQL-on-Hadoop engines SQL Impala
  • 9. 9© Cloudera, Inc. All rights reserved. Impala 2.3 Vs Spark-SQL 1.5 & Hive on Tez (Upstream) Full Details http://tinyurl.com/gotkdlq
  • 10. 10© Cloudera, Inc. All rights reserved. New in Impala 2.5 Performance and Scalability •Better join ordering and cardinality estimation •Query start-up improvements •Runtime filters •Additional codegen and code optimizations •Decimal arithmetic improvements •Fast min/max values on partition columns (with query option) •Incremental metadata updates (DDL) Integrations •Support for EMC DSSD Usability Enhancements •Admission Control Improvements •Null-safe join/equals
  • 11. 11© Cloudera, Inc. All rights reserved. Usability Enhancements •Admission Control Improvements New in Impala 2.5 Performance and Scalability •Runtime filters •Improved Cardinality Estimates and Join Order •Additional codegen and code optimizations •Decimal arithmetic improvements •Faster Query Startup •Fast min/max values on partition columns (with query option) Covered today
  • 12. 12© Cloudera, Inc. All rights reserved. How does Impala 2.5 fare against 2.3? • 363% speedup for TPC-DS • 92% speedup for TPC-H • 71% speedup for TPC-H (Nested)
  • 13. 13© Cloudera, Inc. All rights reserved. Improved Cardinality Estimates and Join Order 1. More robust scan cardinality estimation • Mitigate correlated predicates (exponential backoff) 2. Improved join cardinality estimation • Special treatment of common case of PK/FK joins • Detect selective joins by applying the selectivity of build-side predicates to the estimated join cardinality 3. More robust join strategy selection (broadcast vs. shuffle) • Account for data serialization overhead vs. raw data
  • 14. 14© Cloudera, Inc. All rights reserved. Improved Cardinality Estimates and Join Order TPCH-Q8 on Impala 2.3 … #Rows #Est Rows 14:HASH JOIN | 728.85K | 1.94B | |--23:EXCHANGE | 15 | 1 | | 07:SCAN HDFS | 1 | 1 | 13:HASH JOIN | 3.65M | 1.94B | |--22:EXCHANGE | 375 | 25 | | 05:SCAN HDFS | 25 | 25 | 12:HASH JOIN | 3.65M | 1.94B | |--21:EXCHANGE | 675.00M | 45.00M | | 04:SCAN HDFS | 45.00M | 45.00M | 11:HASH JOIN | 3.65M | 1.81B | |--20:EXCHANGE | 375 | 25 | | 06:SCAN HDFS | 25 | 25 | 10:HASH JOIN | 3.65M | 1.81B | |--19:EXCHANGE | 45.00M | 3.00M | | 01:SCAN HDFS | 3.00M | 3.00M | 09:HASH JOIN | 3.65M | 1.80B | |--18:EXCHANGE | 2.05B | 4.50M | | 03:SCAN HDFS | 136.72M | 4.50M | 08:HASH JOIN | 12.02M | 1.80B | |--17:EXCHANGE | 6.01M | 397.35K | | 00:SCAN HDFS | 400.62K | 397.35K | 02:SCAN HDFS | 1.80B | 1.80B | Run time: 91s TPCH-Q8 on Impala 2.5 … #Rows #Est Rows 14:HASH JOIN | 728.85K | 238.41K | |--26:EXCHANGE | 375 | 25 | | 06:SCAN HDFS | 25 | 25 | 13:HASH JOIN | 728.85K | 238.41K | |--25:EXCHANGE | 15 | 1 | | 07:SCAN HDFS | 1 | 1 | 12:HASH JOIN | 3.65M | 1.19M | |--24:EXCHANGE | 375 | 25 | | 05:SCAN HDFS | 25 | 25 | 11:HASH JOIN | 3.65M | 1.19M | |--23:EXCHANGE | 45.00M | 45.00M | | 04:SCAN HDFS | 45.00M | 45.00M | 22:EXCHANGE | 3.65M | 1.19M | 10:HASH JOIN | 3.65M | 1.19M | |--21:EXCHANGE | 3.00M | 3.00M | | 01:SCAN HDFS | 3.00M | 3.00M | 20:EXCHANGE | 3.65M | 1.19M | 09:HASH JOIN | 3.65M | 1.19M | |--19:EXCHANGE | 136.72M | 45.00M | | 03:SCAN HDFS | 136.72M | 45.00M | 18:EXCHANGE | 12.02M | 11.92M | 08:HASH JOIN | 12.02M | 11.92M | |--17:EXCHANGE | 6.01M | 397.35K | | 00:SCAN HDFS | 400.62K | 397.35K | 02:SCAN HDFS | 12.40M | 1.80B | Run time: 11s (>8x speedup)
  • 15. 15© Cloudera, Inc. All rights reserved. •Why better cardinality estimation matters •TPC-DS Q14 • 225 joins • 285 scan nodes Improved Cardinality Estimates and Join Order
  • 16. 16© Cloudera, Inc. All rights reserved. Query start-up improvements (IMPALA-1599) • At start-up, coordinator has to tell every Impala daemon to run a set of fragments. • Since data flows up from leaves, need to make sure that receivers are ready for the data produced by senders. • Pre Impala-2.5, we did the obvious thing: start the receivers first! • But that serializes query start-up...
  • 17. 17© Cloudera, Inc. All rights reserved. TPCDS-Q59: Each colour is a different plan fragment.
  • 18. 18© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 1
  • 19. 19© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 2
  • 20. 20© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 3
  • 21. 21© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 4
  • 22. 22© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 5. Only now can the tree with the light-green scan start to make progress!
  • 23. 23© Cloudera, Inc. All rights reserved. Query start-up: what we did • Instead of starting fragments wave-by-wave, start them all at once in any order. • Need to change a lot about sender / receiver logic to allow senders to wait for receivers to arrive. Lots of tricky error conditions to deal with! • Also move all heavy-lifting out of synchronous RPC to fragment executor, to asynchronous fragment start-up. • Reduced plan fragment size for partitioned tables
  • 24. 24© Cloudera, Inc. All rights reserved. Query start-up: performance impact
  • 25. 25© Cloudera, Inc. All rights reserved. Query start-up: future improvements • Work still to do: • Batch the fragment start RPCs (rather than 10s of RPCs per backend) • Batching will amortize cost of sending identical data structures. • Also should trim these data structures down.
  • 26. 26© Cloudera, Inc. All rights reserved. Runtime filtering • General idea: some predicates can only be computed at runtime • Example: SELECT count(*) FROM probe P JOIN build B on P.id = B.id; • How does Impala execute this query?
  • 27. 27© Cloudera, Inc. All rights reserved. SELECT count(*) FROM probe P JOIN build B on P.id = B.id Note: 29200 rows scanned, but only 40 match.
  • 28. 28© Cloudera, Inc. All rights reserved. Runtime filters: the opportunity The planner doesn’t know what the set of B.id contains - even with statistics. But there’s clearly an opportunity to save some work - why bother sending 29160 of those rows to the hash join node? Runtime filters computes this predicate at runtime.
  • 29. 29© Cloudera, Inc. All rights reserved. Step 1: planner tells 02:HASH JOIN to produce filter containing all ID from build side.
  • 30. 30© Cloudera, Inc. All rights reserved. Step 2: 02:HASH JOIN reads all 10 rows from build side (right input), and computes filter containing all distinct values of ID.
  • 31. 31© Cloudera, Inc. All rights reserved. Step 3: 02:HASH JOIN sends filter to 00: SCAN HDFS before the scan starts. Scan eliminates all rows that don’t match in the filter.
  • 32. 32© Cloudera, Inc. All rights reserved. The result: only 40 rows are produced by the scan, reducing the amount of work the hash join has to do by > 99%!
  • 33. 33© Cloudera, Inc. All rights reserved. Runtime filters: real-world results Runtime filters can be highly effective. Some benchmark queries are more than 20 times faster in Impala 2.5.0. As always, depends on your queries, your schemas and your cluster environment. By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0. They can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL.
  • 34. 34© Cloudera, Inc. All rights reserved.
  • 35. 35© Cloudera, Inc. All rights reserved. LLVM Codegen Support in Impala Operations: • Hash join • Aggregation • Scans: Text, Sequence, Avro • Expressions in all operators • Sort • Top-N Data Types: • TINYINT, SMALLINT, INT, BIGINT • FLOAT, DOUBLE • BOOLEAN • STRING, VARCHAR • DECIMAL New in Impala 2.5 Extended in Impala 2.5
  • 36. 36© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_orderkey limit 100 SQL language offers great flexibility
  • 37. 37© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_extendedprice limit 100 SQL language offers great flexibility
  • 38. 38© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_extendedprice limit 100 Flexibility requires generic code which is often inefficient
  • 39. 39© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_extendedprice limit 100 Compare
  • 40. 40© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 28809990 7208 10/19/1997 AIR 30525093 16218 12/16/1997 REG AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_extendedprice limit 100 Compare
  • 41. 41© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } select l_extendedprice, l_orderkey from lineitem order by l_extendedprice limit 100 Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR
  • 42. 42© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  • 43. 43© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  • 44. 44© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N int RawValue::Compare(const void* v1, const void* v2, const ColumnType& type) { switch (type.type) { case TYPE_INT: i1 = *reinterpret_cast<const int32_t*>(v1); i2 = *reinterpret_cast<const int32_t*>(v2); return i1 > i2 ? 1 : (i1 < i2 ? -1 : 0); case TYPE_BIGINT: b1 = *reinterpret_cast<const int64_t*>(v1); b2 = *reinterpret_cast<const int64_t*>(v2); return b1 > b2 ? 1 : (b1 < b2 ? -1 : 0); int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  • 45. 45© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen codeOriginal code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  • 46. 46© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1 int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key } Codegen code • Perfectly unrolls “for each grouping column” loop • No switching on input type(s) • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  • 47. 47© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code
  • 48. 48© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code 10x more efficient code
  • 49. 49© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code 10x more efficient code
  • 50. 50© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code 10x more efficient code 77% speedup in query due to Amdahl’s law
  • 51. 51© Cloudera, Inc. All rights reserved. Float/Double Vs Decimal? Pros for Float/Double • Uses less memory. • Faster because floating point is better supported in codegen. (Note: Decimal uses fixed-point hardware types - int64 and __int128) • Can represent a larger range of numbers. Cons for Float/Double • Precision errors compound during aggregations • Can’t do math with wide number of significant digits (123456789.1 * .0000987654321) Decimal arithmetic and aggregation
  • 52. 52© Cloudera, Inc. All rights reserved. No go for applications requiring high precision & accuracy What about performance penalty? Float , Double Vs Decimal? Pros • Uses less memory. • Faster because floating point math operations are natively supported by processors. • Can represent a larger range of numbers. Cons • Precision errors compound during aggregations • Can’t do math with wide number of significant digits (123456789.1 * .0000987654321) Decimal arithmetic and aggregation
  • 53. 53© Cloudera, Inc. All rights reserved. Decimal arithmetic and aggregation SELECT l_returnflag, l_linestatus, Sum(l_quantity) AS SUM_QTY, Sum(l_extendedprice)AS SUM_BASE_PRICE, Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICE FROM lineitem GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus 3x speedup ● Simplified overflow check for decimal. ● Extended Codegen framework to support aggregations involving decimal. ● Bridged the performance gap between double and decimal
  • 54. 54© Cloudera, Inc. All rights reserved. Network Distributed Aggregations in Impala Preagg Preagg Preagg Merge Merge Merge select cust_id, sum(dollars) from sales group by cust_id; Scan ScanScan • Impala aggregations have two phases: • Pre-aggregation phase • Merge phase • The pre-aggregation phase greatly reduces network traffic if there are many input rows per grouping value. • E.g. many sales per customer.
  • 55. 55© Cloudera, Inc. All rights reserved. Network Downsides of Pre-aggregations Preagg Preagg Preagg Merge Merge Merge select distinct * from sales; Scan ScanScan • Pre-aggregations consume: • Memory • CPU cycles • Pre-aggregations are not always effective at reducing network traffic • E.g. select distinct for nearly-distinct rows • Pre-aggregations can spill to disk under memory pressure • Disk I/O is bad - better to send to merge agg rather than disk
  • 56. 56© Cloudera, Inc. All rights reserved. Network Streaming Pre-aggregations in Impala 2.5 Merge Merge Merge select distinct * from sales; Scan ScanScan • Reduction factor is dynamically estimated based on the actual data processed • Pre-aggregation expands memory usage only if reduction factor is good • Benefits: • Certain aggregations with low reduction factor see speedups of up to 25% • Memory consumption can be reduced by 50% or more • Streaming pre-aggregations don’t spill to disk
  • 57. 57© Cloudera, Inc. All rights reserved. Optimization for partition keys scan • Use metadata to avoid table accesses for partition key scans: • select min(month), max(year) from functional.alltypes; • month, year are partition keys of the table • Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS • Applicable: • min(), max(), ndv() and aggregate functions with distinct keyword • partition keys only 01:AGGREGATE [FINALIZE] | output: min(month),max(year) | 00:UNION constant-operands=24 03:AGGREGATE [FINALIZE] | output: min:merge(month), max:merge(year) | 02:EXCHANGE [UNPARTITIONED] | 01:AGGREGATE | output: min(month), max(year) | 00:SCAN HDFS [functional.alltypes] partitions=24/24 files=24 size=478.45KB Plan without optimization Plan with optimization
  • 58. 58© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C
  • 59. 59© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C
  • 60. 60© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. Then it selects an impalad to perform the scan. A B C
  • 61. 61© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. Then it selects an impalad to perform the scan. A B C
  • 62. 62© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C Selection will always start with the same replica to make optimal use of OS buffer caches. This can lead to hot-spots for some workloads.
  • 63. 63© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C Selection will always start with the same replica to make optimal use of OS buffer caches. This can lead to hot-spots for some workloads. Improvement: Pick impalad at random.
  • 64. 64© Cloudera, Inc. All rights reserved. New Query Option: random_replica Disabled by default. set random_replica = 1; Also has a corresponding query hint: SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA */;
  • 65. 65© Cloudera, Inc. All rights reserved. What to Look out for
  • 66. 66© Cloudera, Inc. All rights reserved. Where It Can Help • Large number of small queries, each with few input tables. • High load on only one of multiple replicas of a table. • Queries are CPU bound. • Benefit: Distribute load more evenly over replicas. • Tradeoff: Distribution of local reads will increase buffer cache usage.
  • 67. 67© Cloudera, Inc. All rights reserved. What’s Next • Add possibility to prefer remote reads. • Switch remote impalad selection from round-robin to load-based. • Add rack-awareness.
  • 68. 68© Cloudera, Inc. All rights reserved. Impala Admission Control •Purpose: throttle workload to avoid oversubscription, maximize throughput •Previously (since Impala 1.4): • Throttle based on number of concurrent queries • Knobs for throttling based on memory, but not recommended •New in 2.5: • Improved CM integration: configuration and monitoring • Improved memory-based admission • Improved control over pools w/ new configurations
  • 69. 69© Cloudera, Inc. All rights reserved. Admission Control Improvements •Use cases that work well with admission control: 1. [Since 1.4] Simple throttling: Setting max number running queries 2. [NEW] well understood, memory bound-workloads •Admission control accepts new per-pool configurations • Query memory limits (and any other query options) • Queue timeout (was previously a global setting only) •Admission algorithm admits/queues based on: • Request’s aggregate memory usage fitting within the pool’s configured mem • Request’s per-node mem requirement fitting on each impalad’s available mem
  • 70. 70© Cloudera, Inc. All rights reserved. Example •Impala cluster has 10 nodes, 200gb/node = 2TB total •Workload has queries with known memory requirements: • Many small, fast queries (<10gb/node) • Some very large queries (100gb/node) •Need to run 1 big query at a time, many small queries •Configure two resource pools:
  • 71. 71© Cloudera, Inc. All rights reserved. Guidance for Using Memory-Based Admission For a well known, memory bound workload (i.e. can set MEM_LIMIT on queries): 1. Group queries by resource requirements (e.g. small/HighThrpt, XXL) 2. Create resource pools for each group 3. Control concurrency for pools: - large query pools: set pool max mem resources - small query pools: set max num running queries limit 4. Set default query mem limits on all pools (upper bound) 5. Set REQUEST_POOL query option to direct queries into pools - via ‘SET’ command - [NEW] Add key-value pair to JDBC connection string
  • 72. 72© Cloudera, Inc. All rights reserved. CM Impala Admission Control Configuration: http://vc0726.halxg.cloudera.com: 7180/cmf/services/13/pools/configuration Admission Control Dashboard: link Admission control demo
  • 73. 73© Cloudera, Inc. All rights reserved. Impala Roadmap 2H 2015 1H 2016 2016 • SQL Support & Usability • Nested structures • Kudu updates (beta) • Management & Security • Record reader service (beta) • Finer-grained security (Sentry) • Integration • Isilon support • Python interface (Ibis) • Performance & Scale • Improved predictability under concurrency • Performance & Scale • Continued scalability and concurrency • Initial perf/scale improvements • Management & Security • Improved admission control • Resource utilization and showback • SQL Support & Usability • Dynamic partitioning • Improved timestamp compatibility • Performance & Scale • >20x performance • Multi-threaded joins/aggregations • Continued scale work • Management & Security • Improved YARN integration • Automated metadata • Integration • S3 support • SQL Support & Usability • Nested types with Avro • Date type • Added SQL extensions