Successfully reported this slideshow.

Photon Technical Deep Dive: How to Think Vectorized

2

Share

Loading in …3
×
1 of 38
1 of 38

Photon Technical Deep Dive: How to Think Vectorized

2

Share

Download to read offline

Description

Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.

Transcript

  1. 1. Technical Deep Dive: How to Think Vectorized Alex Behm Tech Lead, Photon
  2. 2. Agenda Introduction Delta Engine, vectorization, micro-benchmarks Expressions Compute kernels, adaptivity, lazy filters Aggregation Hash tables, mixed row/columnar kernels End-to-End Performance
  3. 3. Hardware Changes since 2015 2010 2015 2020 Storage 50 MB/s (HDD) 500 MB/s (SSD) 16 GB/s (NVMe) 10X Network 1 Gbps 10 Gbps 100 Gbps 10X CPU ~3 GHz ~3 GHz ~3 GHz ☹ CPUs continue to be the bottleneck. How do we achieve next level performance?
  4. 4. Workload Trends Businesses are moving faster, and as a result organizations spend less time in data modeling, leading to worse performance: ▪ Most columns don’t have “NOT NULL” defined ▪ Strings are convenient, and many date columns are stored as strings ▪ Raw → Bronze → Silver → Gold: from nothing to pristine schema/quality Can we get both agility and performance?
  5. 5. Query Optimizer Photon Execution Engine SQL Spark DataFrame Koalas Caching Delta Engine
  6. 6. Photon New execution engine for Delta Engine to accelerate Spark SQL Built from scratch in C++, for performance: ▪ Vectorization: data-level and instruction-level parallelism ▪ Optimize for modern structured and semi-structured workloads
  7. 7. Vectorization ● Decompose query into compute kernels that process vectors of data ● Typically: Columnar in-memory format ● Cache and CPU friendly: simple predictable loops, many data items, SIMD ● Adaptive: Batch-level specialization, e.g., NULLs or no NULLs ● Modular: Can optimize individual kernels as needed Sounds great! But… what does it really mean? How does it work? Is it worth it? This talk: I will teach you how to think vectorized!
  8. 8. Microbenchmarks Does not necessarily reflect speedups on end-to-end queries
  9. 9. Let’s build a simple engine from scratch. 1. Expression evaluation and adaptivity 2. Filters and laziness 3. Hash tables and mixed column/row operations Vectorization: Basic Building Blocks
  10. 10. Expressions
  11. 11. Running Example SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2 Scan Filter c1 + c2 < 10 Aggregate SUM(c3) We’re not covering this part Operators pass batches of columnar data
  12. 12. Expression Evaluation c1 c2 + < 10 Out SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  13. 13. Expression Evaluation c1 c2 + < 10 Out Kernels! SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  14. 14. Expression Evaluation void PlusKernel(const int64_t* left, const int64_t* right int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { output[i] = left[i] + right[i] } } SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  15. 15. Expression Evaluation void PlusKernel(const int64_t* left, const int64_t* right int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { output[i] = left[i] + right[i] } } 🤔 What about NULLs? SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  16. 16. Expression Evaluation void PlusKernel(const int64_t* left, const bool* left_nulls, const int64_t* right, const bool* right_nulls, int32_t num_rows, int64_t* output, bool* output_nulls) { for (int32_t i = 0; i < num_rows; ++i) { bool is_null = left_nulls[i] || right[nulls]; if (!is_null) output[i] = left[i] + right[i]; output_nulls[i] = is_null; } } SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  17. 17. Expression Evaluation void PlusKernel(const int64_t* left, const bool* left_nulls, const int64_t* right, const bool* right_nulls, int32_t num_rows, int64_t* output, bool* output_nulls) { for (int32_t i = 0; i < num_rows; ++i) { bool is_null = left_nulls[i] || right[nulls]; if (!is_null) output[i] = left[i] + right[i]; output_nulls[i] = is_null; } } > 30% slower with NULL checks SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  18. 18. Expression Evaluation: Runtime Adaptivity void PlusKernelNoNulls(...); void PlusKernel(...); void PlusEval(Column left, Column right, Column output) { if (!left.has_nulls() && !right.has_nulls()) { PlusKernelNoNulls(left.data(), right.data(), output.data()); } else { PlusKernel(left.data(), left.nulls(), …); } } But what if my data rarely has NULLs?
  19. 19. Expression Evaluation c1 c2 + < 10 Out ● Similar kernel approach ● Can optimize for literals, ~25% faster SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  20. 20. Filters SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2 c1 c2 + < 10 Out ??? ● What exactly is the output? ● What should we do with our input column batch?
  21. 21. Filters: Lazy Representation as Active Rows 5 4 3 2 1 7 2 3 8 5 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c1 c 2 c3 g1 g 2 Scan Filter c1 + c2 < 10 Aggregate SUM(c3) {c1, c2, c3, g1, g2} {c1, c2, c3, g1, g2} Column Batch 3 2 0 c1 + c2 < 10 Active RowsSELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  22. 22. Filters: Lazy Representation as Active Rows void PlusNoNullsSomeActiveKernel( const int64_t* left, const int64_t* right, const int32_t* active_rows, int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { int32_t active_idx = active_rows[i]; output[active_idx] = left[active_idx] * right[active_idx] } } Active rows concept must be supported throughout the engine ● Adds complexity, code ● Will come in handy for advanced operations like aggregation/join
  23. 23. Aggregation
  24. 24. Hash Aggregation Basic Algorithm 1. Hash and find bucket 2. If bucket empty, initialize entry with keys and aggregation buffers 3. Compare keys and follow probing strategy to resolve collisions 4. Update aggregation buffers according to aggregation function and input Hash Table {g1, g2, SUM}
  25. 25. Hash Aggregation Think vectorized! ● Columnar, batch-oriented ● Type specialized Basic Algorithm 1. Hash and find bucket 2. If bucket empty, initialize entry with keys and aggregation buffers 3. Compare keys and follow probing strategy to resolve collisions 4. Update aggregation buffers according to aggregation function and input
  26. 26. Microbenchmarks Does not necessarily reflect speedups on end-to-end queries SELECT co1l, SUM(col2) FROM t GROUP BY col1
  27. 27. Hash Aggregation Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3} 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes
  28. 28. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3}
  29. 29. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3} ● Compare keys ● Create an active rows for non-matches (collisions) Collision
  30. 30. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 3, 0} {7, 4, 3} ● Advance buckets for all collisions and compare keys ● Repeat until match or empty bucket
  31. 31. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 12} {7, 3, 1} {7, 4, 5} ● Update the aggregation state for each aggregate
  32. 32. Mixed Column/Row Kernel Example void AggKernel(AggFn* fn, int64_t* input, int8_t** buckets, int64_t buffer_offset, int32_t num_rows) { for (int32_t i = 0; i < num_rows; ++i) { // Memory access into large array. Good to have a tight loop. int8_t* bucket = buckets[i]; // Make sure this gets inlined. fn->update(input[i], bucket + buffer_offset); } } A “column” whose values are sprayed across rows in the hash table
  33. 33. End-to-End Performance
  34. 34. Why go to the trouble? TPC-DS 30TB Queries/Hour 3.3x speedup 110 32 (Higher is better)
  35. 35. 32 23 columns mixed types 1 column
  36. 36. Real-World Queries ▪ Several preview customers from different industries ▪ Need to have a suitable workload with sufficient Photon feature coverage ▪ Typical experience: 2-3x speedup end-to-end ▪ Mileage varies, best speedup: From 80 → 5 minutes!
  37. 37. ▪ Vectorization: Decompose query into simple loops over vectors of data ▪ Batch-level adaptivity, e.g., NULLs vs no-NULLs ▪ Lazy filter evaluation with an active rows → useful concept ▪ Mixed column/row operations for accessing hash tables Recap
  38. 38. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Description

Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.

Transcript

  1. 1. Technical Deep Dive: How to Think Vectorized Alex Behm Tech Lead, Photon
  2. 2. Agenda Introduction Delta Engine, vectorization, micro-benchmarks Expressions Compute kernels, adaptivity, lazy filters Aggregation Hash tables, mixed row/columnar kernels End-to-End Performance
  3. 3. Hardware Changes since 2015 2010 2015 2020 Storage 50 MB/s (HDD) 500 MB/s (SSD) 16 GB/s (NVMe) 10X Network 1 Gbps 10 Gbps 100 Gbps 10X CPU ~3 GHz ~3 GHz ~3 GHz ☹ CPUs continue to be the bottleneck. How do we achieve next level performance?
  4. 4. Workload Trends Businesses are moving faster, and as a result organizations spend less time in data modeling, leading to worse performance: ▪ Most columns don’t have “NOT NULL” defined ▪ Strings are convenient, and many date columns are stored as strings ▪ Raw → Bronze → Silver → Gold: from nothing to pristine schema/quality Can we get both agility and performance?
  5. 5. Query Optimizer Photon Execution Engine SQL Spark DataFrame Koalas Caching Delta Engine
  6. 6. Photon New execution engine for Delta Engine to accelerate Spark SQL Built from scratch in C++, for performance: ▪ Vectorization: data-level and instruction-level parallelism ▪ Optimize for modern structured and semi-structured workloads
  7. 7. Vectorization ● Decompose query into compute kernels that process vectors of data ● Typically: Columnar in-memory format ● Cache and CPU friendly: simple predictable loops, many data items, SIMD ● Adaptive: Batch-level specialization, e.g., NULLs or no NULLs ● Modular: Can optimize individual kernels as needed Sounds great! But… what does it really mean? How does it work? Is it worth it? This talk: I will teach you how to think vectorized!
  8. 8. Microbenchmarks Does not necessarily reflect speedups on end-to-end queries
  9. 9. Let’s build a simple engine from scratch. 1. Expression evaluation and adaptivity 2. Filters and laziness 3. Hash tables and mixed column/row operations Vectorization: Basic Building Blocks
  10. 10. Expressions
  11. 11. Running Example SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2 Scan Filter c1 + c2 < 10 Aggregate SUM(c3) We’re not covering this part Operators pass batches of columnar data
  12. 12. Expression Evaluation c1 c2 + < 10 Out SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  13. 13. Expression Evaluation c1 c2 + < 10 Out Kernels! SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  14. 14. Expression Evaluation void PlusKernel(const int64_t* left, const int64_t* right int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { output[i] = left[i] + right[i] } } SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  15. 15. Expression Evaluation void PlusKernel(const int64_t* left, const int64_t* right int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { output[i] = left[i] + right[i] } } 🤔 What about NULLs? SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  16. 16. Expression Evaluation void PlusKernel(const int64_t* left, const bool* left_nulls, const int64_t* right, const bool* right_nulls, int32_t num_rows, int64_t* output, bool* output_nulls) { for (int32_t i = 0; i < num_rows; ++i) { bool is_null = left_nulls[i] || right[nulls]; if (!is_null) output[i] = left[i] + right[i]; output_nulls[i] = is_null; } } SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  17. 17. Expression Evaluation void PlusKernel(const int64_t* left, const bool* left_nulls, const int64_t* right, const bool* right_nulls, int32_t num_rows, int64_t* output, bool* output_nulls) { for (int32_t i = 0; i < num_rows; ++i) { bool is_null = left_nulls[i] || right[nulls]; if (!is_null) output[i] = left[i] + right[i]; output_nulls[i] = is_null; } } > 30% slower with NULL checks SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  18. 18. Expression Evaluation: Runtime Adaptivity void PlusKernelNoNulls(...); void PlusKernel(...); void PlusEval(Column left, Column right, Column output) { if (!left.has_nulls() && !right.has_nulls()) { PlusKernelNoNulls(left.data(), right.data(), output.data()); } else { PlusKernel(left.data(), left.nulls(), …); } } But what if my data rarely has NULLs?
  19. 19. Expression Evaluation c1 c2 + < 10 Out ● Similar kernel approach ● Can optimize for literals, ~25% faster SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  20. 20. Filters SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2 c1 c2 + < 10 Out ??? ● What exactly is the output? ● What should we do with our input column batch?
  21. 21. Filters: Lazy Representation as Active Rows 5 4 3 2 1 7 2 3 8 5 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c1 c 2 c3 g1 g 2 Scan Filter c1 + c2 < 10 Aggregate SUM(c3) {c1, c2, c3, g1, g2} {c1, c2, c3, g1, g2} Column Batch 3 2 0 c1 + c2 < 10 Active RowsSELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  22. 22. Filters: Lazy Representation as Active Rows void PlusNoNullsSomeActiveKernel( const int64_t* left, const int64_t* right, const int32_t* active_rows, int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { int32_t active_idx = active_rows[i]; output[active_idx] = left[active_idx] * right[active_idx] } } Active rows concept must be supported throughout the engine ● Adds complexity, code ● Will come in handy for advanced operations like aggregation/join
  23. 23. Aggregation
  24. 24. Hash Aggregation Basic Algorithm 1. Hash and find bucket 2. If bucket empty, initialize entry with keys and aggregation buffers 3. Compare keys and follow probing strategy to resolve collisions 4. Update aggregation buffers according to aggregation function and input Hash Table {g1, g2, SUM}
  25. 25. Hash Aggregation Think vectorized! ● Columnar, batch-oriented ● Type specialized Basic Algorithm 1. Hash and find bucket 2. If bucket empty, initialize entry with keys and aggregation buffers 3. Compare keys and follow probing strategy to resolve collisions 4. Update aggregation buffers according to aggregation function and input
  26. 26. Microbenchmarks Does not necessarily reflect speedups on end-to-end queries SELECT co1l, SUM(col2) FROM t GROUP BY col1
  27. 27. Hash Aggregation Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3} 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes
  28. 28. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3}
  29. 29. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3} ● Compare keys ● Create an active rows for non-matches (collisions) Collision
  30. 30. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 3, 0} {7, 4, 3} ● Advance buckets for all collisions and compare keys ● Repeat until match or empty bucket
  31. 31. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 12} {7, 3, 1} {7, 4, 5} ● Update the aggregation state for each aggregate
  32. 32. Mixed Column/Row Kernel Example void AggKernel(AggFn* fn, int64_t* input, int8_t** buckets, int64_t buffer_offset, int32_t num_rows) { for (int32_t i = 0; i < num_rows; ++i) { // Memory access into large array. Good to have a tight loop. int8_t* bucket = buckets[i]; // Make sure this gets inlined. fn->update(input[i], bucket + buffer_offset); } } A “column” whose values are sprayed across rows in the hash table
  33. 33. End-to-End Performance
  34. 34. Why go to the trouble? TPC-DS 30TB Queries/Hour 3.3x speedup 110 32 (Higher is better)
  35. 35. 32 23 columns mixed types 1 column
  36. 36. Real-World Queries ▪ Several preview customers from different industries ▪ Need to have a suitable workload with sufficient Photon feature coverage ▪ Typical experience: 2-3x speedup end-to-end ▪ Mileage varies, best speedup: From 80 → 5 minutes!
  37. 37. ▪ Vectorization: Decompose query into simple loops over vectors of data ▪ Batch-level adaptivity, e.g., NULLs vs no-NULLs ▪ Lazy filter evaluation with an active rows → useful concept ▪ Mixed column/row operations for accessing hash tables Recap
  38. 38. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

More Related Content

×