Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Impala Performance Update

2,458 views

Published on

Cloudera’s performance engineering team recently completed a new round of benchmark testing based on Impala 2.5 and the most recent stable releases of the major SQL engine options for the Apache Hadoop platform, including Apache Hive-on-Tez and Apache Spark/Spark SQL. This presentation explains the methodology and results.

Published in: Software
  • Be the first to comment

Impala Performance Update

  1. 1. Impala 2.5 Performance Update Tuesday March 1, 2016 meetup http://www.meetup.com/Bay-Area-Impala-Users-Group/events/227944275/
  2. 2. 2© Cloudera, Inc. All rights reserved. Apache Impala 2.5 (Incubating) Performance improvements overview
  3. 3. Agenda 1. Brief overview Impala quality and reliability improvements 2. Impala 2.5 Vs. 2.3 Performance improvements 3. Performance deep dive 1. Improved Cardinality Estimates and Join Order 2. Query startup improvements 3. Runtime filters 4. Additional codegen and code optimizations 5. Decimal arithmetic improvements 6. Pass through aggregation 7. Fast min/max values on partition columns 8. Scheduling improvements 9. Admission control
  4. 4. 4© Cloudera, Inc. All rights reserved. Improved Impala Stability and Reliability Better quality, enable optimization ● Query Generator ● Stress ● Scale testing ● Fault injection testing ● Longevity testing ● Performance regression ● … More on the way
  5. 5. 5© Cloudera, Inc. All rights reserved. ● Find correctness and crash bugs ○ Not found by other tests ● In github ○ https://github.com/cloudera/Impala/tree/cdh5- trunk/tests/comparison ○ https://github.com/cloudera/Impala/blob/cdh5- trunk/tests/stress/concurrent_select.py ● Fixing them Query Generator and Stress
  6. 6. 6© Cloudera, Inc. All rights reserved. Query Generator ● Built in-house, open source python (Taras Bobrovytsky) ● Random data generator as well ● Adapts to any schema ● Generates random queries based on a query model ○ Tunable complexity ○ Extensible ● Uses postgres for expected values ○ Query translation for syntax ○ Table flattening for nested types ● Fast, small: Uses Docker-ized cluster ● Adapted for use with Hive
  7. 7. 7© Cloudera, Inc. All rights reserved. Apache Impala: Open Source & Open Standard 1 > 1 MM downloads since GA 2 Majority adoption across Cloudera customers 3 Certification across key application partners: 4 De facto standard with multi-vendor support: and others
  8. 8. 8© Cloudera, Inc. All rights reserved. SQL-on-Hadoop engines SQL Impala
  9. 9. 9© Cloudera, Inc. All rights reserved. Impala 2.3 Vs Spark-SQL 1.5 & Hive on Tez (Upstream) Full Details http://tinyurl.com/gotkdlq
  10. 10. 10© Cloudera, Inc. All rights reserved. New in Impala 2.5 Performance and Scalability •Better join ordering and cardinality estimation •Query start-up improvements •Runtime filters •Additional codegen and code optimizations •Decimal arithmetic improvements •Fast min/max values on partition columns (with query option) •Incremental metadata updates (DDL) Integrations •Support for EMC DSSD Usability Enhancements •Admission Control Improvements •Null-safe join/equals
  11. 11. 11© Cloudera, Inc. All rights reserved. Usability Enhancements •Admission Control Improvements New in Impala 2.5 Performance and Scalability •Runtime filters •Improved Cardinality Estimates and Join Order •Additional codegen and code optimizations •Decimal arithmetic improvements •Faster Query Startup •Fast min/max values on partition columns (with query option) Covered today
  12. 12. 12© Cloudera, Inc. All rights reserved. How does Impala 2.5 fare against 2.3? • 363% speedup for TPC-DS • 92% speedup for TPC-H • 71% speedup for TPC-H (Nested)
  13. 13. 13© Cloudera, Inc. All rights reserved. Improved Cardinality Estimates and Join Order 1. More robust scan cardinality estimation • Mitigate correlated predicates (exponential backoff) 2. Improved join cardinality estimation • Special treatment of common case of PK/FK joins • Detect selective joins by applying the selectivity of build-side predicates to the estimated join cardinality 3. More robust join strategy selection (broadcast vs. shuffle) • Account for data serialization overhead vs. raw data
  14. 14. 14© Cloudera, Inc. All rights reserved. Improved Cardinality Estimates and Join Order TPCH-Q8 on Impala 2.3 … #Rows #Est Rows 14:HASH JOIN | 728.85K | 1.94B | |--23:EXCHANGE | 15 | 1 | | 07:SCAN HDFS | 1 | 1 | 13:HASH JOIN | 3.65M | 1.94B | |--22:EXCHANGE | 375 | 25 | | 05:SCAN HDFS | 25 | 25 | 12:HASH JOIN | 3.65M | 1.94B | |--21:EXCHANGE | 675.00M | 45.00M | | 04:SCAN HDFS | 45.00M | 45.00M | 11:HASH JOIN | 3.65M | 1.81B | |--20:EXCHANGE | 375 | 25 | | 06:SCAN HDFS | 25 | 25 | 10:HASH JOIN | 3.65M | 1.81B | |--19:EXCHANGE | 45.00M | 3.00M | | 01:SCAN HDFS | 3.00M | 3.00M | 09:HASH JOIN | 3.65M | 1.80B | |--18:EXCHANGE | 2.05B | 4.50M | | 03:SCAN HDFS | 136.72M | 4.50M | 08:HASH JOIN | 12.02M | 1.80B | |--17:EXCHANGE | 6.01M | 397.35K | | 00:SCAN HDFS | 400.62K | 397.35K | 02:SCAN HDFS | 1.80B | 1.80B | Run time: 91s TPCH-Q8 on Impala 2.5 … #Rows #Est Rows 14:HASH JOIN | 728.85K | 238.41K | |--26:EXCHANGE | 375 | 25 | | 06:SCAN HDFS | 25 | 25 | 13:HASH JOIN | 728.85K | 238.41K | |--25:EXCHANGE | 15 | 1 | | 07:SCAN HDFS | 1 | 1 | 12:HASH JOIN | 3.65M | 1.19M | |--24:EXCHANGE | 375 | 25 | | 05:SCAN HDFS | 25 | 25 | 11:HASH JOIN | 3.65M | 1.19M | |--23:EXCHANGE | 45.00M | 45.00M | | 04:SCAN HDFS | 45.00M | 45.00M | 22:EXCHANGE | 3.65M | 1.19M | 10:HASH JOIN | 3.65M | 1.19M | |--21:EXCHANGE | 3.00M | 3.00M | | 01:SCAN HDFS | 3.00M | 3.00M | 20:EXCHANGE | 3.65M | 1.19M | 09:HASH JOIN | 3.65M | 1.19M | |--19:EXCHANGE | 136.72M | 45.00M | | 03:SCAN HDFS | 136.72M | 45.00M | 18:EXCHANGE | 12.02M | 11.92M | 08:HASH JOIN | 12.02M | 11.92M | |--17:EXCHANGE | 6.01M | 397.35K | | 00:SCAN HDFS | 400.62K | 397.35K | 02:SCAN HDFS | 12.40M | 1.80B | Run time: 11s (>8x speedup)
  15. 15. 15© Cloudera, Inc. All rights reserved. •Why better cardinality estimation matters •TPC-DS Q14 • 225 joins • 285 scan nodes Improved Cardinality Estimates and Join Order
  16. 16. 16© Cloudera, Inc. All rights reserved. Query start-up improvements (IMPALA-1599) • At start-up, coordinator has to tell every Impala daemon to run a set of fragments. • Since data flows up from leaves, need to make sure that receivers are ready for the data produced by senders. • Pre Impala-2.5, we did the obvious thing: start the receivers first! • But that serializes query start-up...
  17. 17. 17© Cloudera, Inc. All rights reserved. TPCDS-Q59: Each colour is a different plan fragment.
  18. 18. 18© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 1
  19. 19. 19© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 2
  20. 20. 20© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 3
  21. 21. 21© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 4
  22. 22. 22© Cloudera, Inc. All rights reserved. TPCDS-Q59: Wave 5. Only now can the tree with the light-green scan start to make progress!
  23. 23. 23© Cloudera, Inc. All rights reserved. Query start-up: what we did • Instead of starting fragments wave-by-wave, start them all at once in any order. • Need to change a lot about sender / receiver logic to allow senders to wait for receivers to arrive. Lots of tricky error conditions to deal with! • Also move all heavy-lifting out of synchronous RPC to fragment executor, to asynchronous fragment start-up. • Reduced plan fragment size for partitioned tables
  24. 24. 24© Cloudera, Inc. All rights reserved. Query start-up: performance impact
  25. 25. 25© Cloudera, Inc. All rights reserved. Query start-up: future improvements • Work still to do: • Batch the fragment start RPCs (rather than 10s of RPCs per backend) • Batching will amortize cost of sending identical data structures. • Also should trim these data structures down.
  26. 26. 26© Cloudera, Inc. All rights reserved. Runtime filtering • General idea: some predicates can only be computed at runtime • Example: SELECT count(*) FROM probe P JOIN build B on P.id = B.id; • How does Impala execute this query?
  27. 27. 27© Cloudera, Inc. All rights reserved. SELECT count(*) FROM probe P JOIN build B on P.id = B.id Note: 29200 rows scanned, but only 40 match.
  28. 28. 28© Cloudera, Inc. All rights reserved. Runtime filters: the opportunity The planner doesn’t know what the set of B.id contains - even with statistics. But there’s clearly an opportunity to save some work - why bother sending 29160 of those rows to the hash join node? Runtime filters computes this predicate at runtime.
  29. 29. 29© Cloudera, Inc. All rights reserved. Step 1: planner tells 02:HASH JOIN to produce filter containing all ID from build side.
  30. 30. 30© Cloudera, Inc. All rights reserved. Step 2: 02:HASH JOIN reads all 10 rows from build side (right input), and computes filter containing all distinct values of ID.
  31. 31. 31© Cloudera, Inc. All rights reserved. Step 3: 02:HASH JOIN sends filter to 00: SCAN HDFS before the scan starts. Scan eliminates all rows that don’t match in the filter.
  32. 32. 32© Cloudera, Inc. All rights reserved. The result: only 40 rows are produced by the scan, reducing the amount of work the hash join has to do by > 99%!
  33. 33. 33© Cloudera, Inc. All rights reserved. Runtime filters: real-world results Runtime filters can be highly effective. Some benchmark queries are more than 20 times faster in Impala 2.5.0. As always, depends on your queries, your schemas and your cluster environment. By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0. They can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL.
  34. 34. 34© Cloudera, Inc. All rights reserved.
  35. 35. 35© Cloudera, Inc. All rights reserved. LLVM Codegen Support in Impala Operations: • Hash join • Aggregation • Scans: Text, Sequence, Avro • Expressions in all operators • Sort • Top-N Data Types: • TINYINT, SMALLINT, INT, BIGINT • FLOAT, DOUBLE • BOOLEAN • STRING, VARCHAR • DECIMAL New in Impala 2.5 Extended in Impala 2.5
  36. 36. 36© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_orderkey limit 100 SQL language offers great flexibility
  37. 37. 37© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_extendedprice limit 100 SQL language offers great flexibility
  38. 38. 38© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_extendedprice limit 100 Flexibility requires generic code which is often inefficient
  39. 39. 39© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_extendedprice limit 100 Compare
  40. 40. 40© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 28809990 7208 10/19/1997 AIR 30525093 16218 12/16/1997 REG AIR select l_extendedprice, l_orderkey,.. from lineitem order by l_extendedprice limit 100 Compare
  41. 41. 41© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } select l_extendedprice, l_orderkey from lineitem order by l_extendedprice limit 100 Codegen for Order by & Top-N l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR
  42. 42. 42© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  43. 43. 43© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  44. 44. 44© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N int RawValue::Compare(const void* v1, const void* v2, const ColumnType& type) { switch (type.type) { case TYPE_INT: i1 = *reinterpret_cast<const int32_t*>(v1); i2 = *reinterpret_cast<const int32_t*>(v2); return i1 > i2 ? 1 : (i1 < i2 ? -1 : 0); case TYPE_BIGINT: b1 = *reinterpret_cast<const int64_t*>(v1); b2 = *reinterpret_cast<const int64_t*>(v2); return b1 > b2 ? 1 : (b1 < b2 ? -1 : 0); int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  45. 45. 45© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen codeOriginal code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  46. 46. 46© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1 int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key } Codegen code • Perfectly unrolls “for each grouping column” loop • No switching on input type(s) • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  47. 47. 47© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code
  48. 48. 48© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code 10x more efficient code
  49. 49. 49© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code 10x more efficient code
  50. 50. 50© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code 10x more efficient code 77% speedup in query due to Amdahl’s law
  51. 51. 51© Cloudera, Inc. All rights reserved. Float/Double Vs Decimal? Pros for Float/Double • Uses less memory. • Faster because floating point is better supported in codegen. (Note: Decimal uses fixed-point hardware types - int64 and __int128) • Can represent a larger range of numbers. Cons for Float/Double • Precision errors compound during aggregations • Can’t do math with wide number of significant digits (123456789.1 * .0000987654321) Decimal arithmetic and aggregation
  52. 52. 52© Cloudera, Inc. All rights reserved. No go for applications requiring high precision & accuracy What about performance penalty? Float , Double Vs Decimal? Pros • Uses less memory. • Faster because floating point math operations are natively supported by processors. • Can represent a larger range of numbers. Cons • Precision errors compound during aggregations • Can’t do math with wide number of significant digits (123456789.1 * .0000987654321) Decimal arithmetic and aggregation
  53. 53. 53© Cloudera, Inc. All rights reserved. Decimal arithmetic and aggregation SELECT l_returnflag, l_linestatus, Sum(l_quantity) AS SUM_QTY, Sum(l_extendedprice)AS SUM_BASE_PRICE, Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICE FROM lineitem GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus 3x speedup ● Simplified overflow check for decimal. ● Extended Codegen framework to support aggregations involving decimal. ● Bridged the performance gap between double and decimal
  54. 54. 54© Cloudera, Inc. All rights reserved. Network Distributed Aggregations in Impala Preagg Preagg Preagg Merge Merge Merge select cust_id, sum(dollars) from sales group by cust_id; Scan ScanScan • Impala aggregations have two phases: • Pre-aggregation phase • Merge phase • The pre-aggregation phase greatly reduces network traffic if there are many input rows per grouping value. • E.g. many sales per customer.
  55. 55. 55© Cloudera, Inc. All rights reserved. Network Downsides of Pre-aggregations Preagg Preagg Preagg Merge Merge Merge select distinct * from sales; Scan ScanScan • Pre-aggregations consume: • Memory • CPU cycles • Pre-aggregations are not always effective at reducing network traffic • E.g. select distinct for nearly-distinct rows • Pre-aggregations can spill to disk under memory pressure • Disk I/O is bad - better to send to merge agg rather than disk
  56. 56. 56© Cloudera, Inc. All rights reserved. Network Streaming Pre-aggregations in Impala 2.5 Merge Merge Merge select distinct * from sales; Scan ScanScan • Reduction factor is dynamically estimated based on the actual data processed • Pre-aggregation expands memory usage only if reduction factor is good • Benefits: • Certain aggregations with low reduction factor see speedups of up to 25% • Memory consumption can be reduced by 50% or more • Streaming pre-aggregations don’t spill to disk
  57. 57. 57© Cloudera, Inc. All rights reserved. Optimization for partition keys scan • Use metadata to avoid table accesses for partition key scans: • select min(month), max(year) from functional.alltypes; • month, year are partition keys of the table • Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS • Applicable: • min(), max(), ndv() and aggregate functions with distinct keyword • partition keys only 01:AGGREGATE [FINALIZE] | output: min(month),max(year) | 00:UNION constant-operands=24 03:AGGREGATE [FINALIZE] | output: min:merge(month), max:merge(year) | 02:EXCHANGE [UNPARTITIONED] | 01:AGGREGATE | output: min(month), max(year) | 00:SCAN HDFS [functional.alltypes] partitions=24/24 files=24 size=478.45KB Plan without optimization Plan with optimization
  58. 58. 58© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C
  59. 59. 59© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C
  60. 60. 60© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. Then it selects an impalad to perform the scan. A B C
  61. 61. 61© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. Then it selects an impalad to perform the scan. A B C
  62. 62. 62© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C Selection will always start with the same replica to make optimal use of OS buffer caches. This can lead to hot-spots for some workloads.
  63. 63. 63© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C Selection will always start with the same replica to make optimal use of OS buffer caches. This can lead to hot-spots for some workloads. Improvement: Pick impalad at random.
  64. 64. 64© Cloudera, Inc. All rights reserved. New Query Option: random_replica Disabled by default. set random_replica = 1; Also has a corresponding query hint: SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA */;
  65. 65. 65© Cloudera, Inc. All rights reserved. What to Look out for
  66. 66. 66© Cloudera, Inc. All rights reserved. Where It Can Help • Large number of small queries, each with few input tables. • High load on only one of multiple replicas of a table. • Queries are CPU bound. • Benefit: Distribute load more evenly over replicas. • Tradeoff: Distribution of local reads will increase buffer cache usage.
  67. 67. 67© Cloudera, Inc. All rights reserved. What’s Next • Add possibility to prefer remote reads. • Switch remote impalad selection from round-robin to load-based. • Add rack-awareness.
  68. 68. 68© Cloudera, Inc. All rights reserved. Impala Admission Control •Purpose: throttle workload to avoid oversubscription, maximize throughput •Previously (since Impala 1.4): • Throttle based on number of concurrent queries • Knobs for throttling based on memory, but not recommended •New in 2.5: • Improved CM integration: configuration and monitoring • Improved memory-based admission • Improved control over pools w/ new configurations
  69. 69. 69© Cloudera, Inc. All rights reserved. Admission Control Improvements •Use cases that work well with admission control: 1. [Since 1.4] Simple throttling: Setting max number running queries 2. [NEW] well understood, memory bound-workloads •Admission control accepts new per-pool configurations • Query memory limits (and any other query options) • Queue timeout (was previously a global setting only) •Admission algorithm admits/queues based on: • Request’s aggregate memory usage fitting within the pool’s configured mem • Request’s per-node mem requirement fitting on each impalad’s available mem
  70. 70. 70© Cloudera, Inc. All rights reserved. Example •Impala cluster has 10 nodes, 200gb/node = 2TB total •Workload has queries with known memory requirements: • Many small, fast queries (<10gb/node) • Some very large queries (100gb/node) •Need to run 1 big query at a time, many small queries •Configure two resource pools:
  71. 71. 71© Cloudera, Inc. All rights reserved. Guidance for Using Memory-Based Admission For a well known, memory bound workload (i.e. can set MEM_LIMIT on queries): 1. Group queries by resource requirements (e.g. small/HighThrpt, XXL) 2. Create resource pools for each group 3. Control concurrency for pools: - large query pools: set pool max mem resources - small query pools: set max num running queries limit 4. Set default query mem limits on all pools (upper bound) 5. Set REQUEST_POOL query option to direct queries into pools - via ‘SET’ command - [NEW] Add key-value pair to JDBC connection string
  72. 72. 72© Cloudera, Inc. All rights reserved. CM Impala Admission Control Configuration: http://vc0726.halxg.cloudera.com: 7180/cmf/services/13/pools/configuration Admission Control Dashboard: link Admission control demo
  73. 73. 73© Cloudera, Inc. All rights reserved. Impala Roadmap 2H 2015 1H 2016 2016 • SQL Support & Usability • Nested structures • Kudu updates (beta) • Management & Security • Record reader service (beta) • Finer-grained security (Sentry) • Integration • Isilon support • Python interface (Ibis) • Performance & Scale • Improved predictability under concurrency • Performance & Scale • Continued scalability and concurrency • Initial perf/scale improvements • Management & Security • Improved admission control • Resource utilization and showback • SQL Support & Usability • Dynamic partitioning • Improved timestamp compatibility • Performance & Scale • >20x performance • Multi-threaded joins/aggregations • Continued scale work • Management & Security • Improved YARN integration • Automated metadata • Integration • S3 support • SQL Support & Usability • Nested types with Avro • Date type • Added SQL extensions

×