In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks
"Databricks Unified Analytics Platform (UAP) is a cloud-based service for running all analytics in one place - from highly reliable and performant data pipelines to state-of-the-art Machine Learning. From the original creators of Apache Spark and MLflow, it provides data science and engineering teams ready to use pre-packaged clusters with optimized Apache Spark and various ML frameworks coupled with powerful collaboration capabilities to improve productivity across the ML lifecycle. Yada yada yada... But in addition to being a vendor Databricks is also a user of UAP.
So, what have we learned by eating our own dogfood? Attend a “from the trenches report” from Suraj Acharya, Director Engineering responsible for Databricks’ in-house data engineering team how his team put Databricks technology to use, the lessons they have learned along the way and best practices for using Databricks for data engineering.
"
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks
"Databricks Unified Analytics Platform (UAP) is a cloud-based service for running all analytics in one place - from highly reliable and performant data pipelines to state-of-the-art Machine Learning. From the original creators of Apache Spark and MLflow, it provides data science and engineering teams ready to use pre-packaged clusters with optimized Apache Spark and various ML frameworks coupled with powerful collaboration capabilities to improve productivity across the ML lifecycle. Yada yada yada... But in addition to being a vendor Databricks is also a user of UAP.
So, what have we learned by eating our own dogfood? Attend a “from the trenches report” from Suraj Acharya, Director Engineering responsible for Databricks’ in-house data engineering team how his team put Databricks technology to use, the lessons they have learned along the way and best practices for using Databricks for data engineering.
"
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
This short text will get you up to speed in no time on creating visualizations using R's ggplot2 package. It was developed as part of a training to those who had no prior experience in R and had limited knowledge on general programming concepts. It's a must have initial guide for those exploring the field of Data Science
Simplifying Change Data Capture using Databricks DeltaDatabricks
In this talk, we will present recent enhancements to the techniques previously discussed in this blog: https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html. We will start by discussing the different CDC architectures that can be deployed in concert with Databricks Delta. We will then use notebooks to demonstrate updated CDC SQL and look at performance tuning considerations for both batch as well as streaming CDC pipelines into Delta.
Managing Apache Spark Workload and Automatic OptimizingDatabricks
eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.
Speaker: Lantao Jin
"Lessons learned using Apache Spark for self-service data prep in SaaS world"Pavel Hardak
Slide deck for the presentation we delivered at Spark+AI Summit 2019 in San Francisco.
In this talk, we will share how we benefited from using Apache Spark to build Workday's new analytics product, as well as some of the challenges we faced along the way. Workday Prism Analytics was launched in September 2017 and went from zero to one hundred enterprise customers in under 15 months. Leveraging innovative technologies from Platfora acquisition gave us a jump-start, but it still required a considerable engineering effort to integrate with Workday ecosystem. We enhanced workflows, added new functionalities and transformed Hadoop-based on-premises engines to run on Workday cloud. All of this would not have been possible without Spark, to which we migrated most of earlier MapReduce code. This enabled us to shorten time to market while adding advanced functionality with high performance and rock-solid reliability. One of the key components of our product is Self-Service Data Prep. Powerful and intuitive UI empowers users to create ETL-like pipelines, blending Workday and external data, while providing immediate feedback by re-executing the pipelines on sampled data. Behind the scenes, we compile these pipelines into plans to be executed by Spark SQL, taking advantage of the years of work done by the open source community to improve the engine's query optimizer and physical execution. We will outline the high-level implementation of product features, mapping logical models and sub-systems, adding new data types on top of Spark, and using caches effectively and securely, in multiple Spark clusters running under YARN, while sharing HDFS resources. We will also describe several real-life war stories, caused by customers stretching the product boundaries in complexity and performance.
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldDatabricks
In this talk, we will share how we benefited from using Apache Spark to build Workday's new analytics product, as well as some of the challenges we faced along the way. Workday Prism Analytics was launched in September 2017, and went from zero to one hundred enterprise customers in under 15 months. Leveraging innovative technologies from Platfora acquisition gave us a jump-start, but it still required a considerable engineering effort to integrate with Workday ecosystem. We enhanced workflows, added new functionalities and transformed Hadoop-based on-premises engines to run on Workday cloud. All of this would not have been possible without Spark, to which we migrated most of earlier MapReduce code. This enabled us to shorten time to market while adding advanced functionality with high performance and rock-solid reliability. One of the key components of our product is Self-Service Data Prep. Powerful and intuitive UI empower users to create ETL-like pipelines, blending Workday and external data, while providing immediate feedback by re-executing the pipelines on sampled data. Behind the scenes, we compile these pipelines into plans to be executed by Spark SQL, taking advantage of the years of work done by the open source community to improve the engine's query optimizer and physical execution. We will outline the high-level implementation of product features, mapping logical models and sub-systems, adding new data types on top of Spark, and using caches effectively and securely, in multiple Spark clusters running under YARN, while sharing HDFS resources. We will also describe several real-life war stories, caused by customers stretching the product boundaries in complexity and performance. We conclude with the unique Spark tuning guidelines distilled from our experience of running it in production, in order to ensure that the system is able to execute complex, nested pipelines with multiple self-joins and self-unions.
Speakers: Pavel Hardak, Jianneng Li
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
Presentation of Data lineage an Observability with OpenLineage at the "Data and AI summit" (formerly Spark summit). With a focus on the Apache Spark integration for OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together.
It contains three major projects:
1) barrier execution mode
2) optimized data exchange and
3) accelerator-aware scheduling.
A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps.
First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work.
Second, we will give updates on accelerator-aware scheduling and how it shall help accelerate your Spark training jobs. We will also outline on-going work for optimized data exchange.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
3. ● David Vrba Ph.D.
● Data Scientist & Data Engineer at Socialbakers
○ Developing ETL pipelines in Spark
○ Optimizing Spark jobs
○ Productionalizing Spark applications
● Lecturing Spark trainings and workshops
○ Studying Spark source code
3#UnifiedDataAnalytics #SparkAISummit
4. Goal
● Share what we have learned by studying Spark source code and by using
Spark on daily basis
○ Processing data from social networks (Facebook, Twitter, Instagram, ...)
○ Regularly processing data on scales up to 10s of TBs
○ Understanding the query plan is a must for efficient processing
4#UnifiedDataAnalytics #SparkAISummit
5. Outline
● Two part talk
○ In the first part we cover some theory
■ How query execution works
■ Physical Plan operators
■ Where to look for relevant information in Spark UI
○ In the second part we show some examples
■ Model queries with particular optimizations
■ Useful tips
○ Q/A
5#UnifiedDataAnalytics #SparkAISummit
6. Part I
● Query Execution
● Physical Plan Operators
● Spark UI
6#UnifiedDataAnalytics #SparkAISummit
7. Query Execution
7#UnifiedDataAnalytics #SparkAISummit
Query
Logical Planning
Parser
Unresolved Plan
Analyzer
Analyzed Plan
Cache
Manager
Optimizer
Optimized Plan
Physical Planning
Query Planner
Spark Plan
Preparation
Executed Plan
Execution
RDD
DAG
Task
Scheduler
DAG Scheduler
Stages + Tasks
Executor
Today’s
session
8. Logical Planning
● Logical plan is created, analyzed and optimized
● Logical plan
○ Tree representation of the query
○ It is an abstraction that carries information about what is supposed to happen
○ It does not contain precise information on how it happens
○ Composed of
■ Relational operators
– Filter, Join, Project, ... (they represent DataFrame transformations)
■ Expressions
– Column transformations, filtering conditions, joining conditions, ...
8#UnifiedDataAnalytics #SparkAISummit
9. Physical Planning
● In the phase of logical planning we get Optimized Logical Plan
● Execution layer does not understand DataFrame / Logical Plan (LP)
● Logical Plan has to be converted to a Physical Plan (PP)
● Physical Plan
○ Bridge between LP and RDDs
○ Similarly to Logical Plan it is a tree
○ Contains more specific description of how things should happen (specific
choice of algorithms)
○ Uses lower level primitives - RDDs
9#UnifiedDataAnalytics #SparkAISummit
10. Physical Planning - 2 phases
10#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
● Generated by Query
Planner using Strategies
● For each node in LP
there is a node in PP
Additional Rules
11. Physical Planning - 2 phases
11#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
● Generated by Query
Planner using Strategies
● For each node in LP
there is a node in PP
Additional Rules
Join
SortMergeJoin
BroadcastHashJoin
In Logical Plan:
In Physical Plan:
Strategy example: JoinSelection
12. Physical Planning - 2 phases
12#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
Additional Rules
● Generated by Query
Planner using Strategies
● For each node in LP
there is a node in PP
● Final version of query plan
● This will be executed
○ generates RDD code
13. ● Generated by Query
Planner using Strategies
● For each node in LP
there is a node in PP
Physical Planning - 2 phases
13#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
● Final version of query plan
● This will be executed
○ generates RDD code
Additional Rules
df.queryExecution.sparkPlan
df.queryExecution.executedPlan
df.explain()
See the plans:
17. Let’s see some operators
● FileScan
● Exchange
● HashAggregate, SortAggregate, ObjectHashAggregate
● SortMergeJoin
● BroadcastHashJoin
17#UnifiedDataAnalytics #SparkAISummit
18. FileScan
18#UnifiedDataAnalytics #SparkAISummit
● Represents reading the data from a file format
spark.table(“posts_fb”)
.filter($“month” === 5)
.filter($”profile_id” === ...)
● table: posts_fb
● partitioned by month
● bucketed by profile_id, 20b
● 1 file per bucket
26. 26#UnifiedDataAnalytics #SparkAISummit
We read from bucketed table (20b, 1f/b)
Size of the whole partition. This will be read if we
turn off bucketing (no bucket pruning)
If bucketing is OFF:
41. Rules applied in preparation
• EnsureRequirements
• ReuseExchange
• ...
41#UnifiedDataAnalytics #SparkAISummit
Spark Plan Executed Plan
Additional Rules
42. EnsureRequirements
● Fills Exchange & Sort operators to the plan
● How it works?
● Each operator in PP has:
○ outputPartitioning
○ outputOrdering
○ requiredChildDistribution (this is a requirement for partitioning)
○ requiredChildOrdering (this is a requirement for ordering)
● If the requirement is not met, Exchange or Sort are used
42#UnifiedDataAnalytics #SparkAISummit
45. 45#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning: Unknown
outputOrdering: None
outputPartitioning: Unknown
outputOrdering: None
46. 46#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning: Unknown
outputOrdering: None
outputPartitioning: Unknown
outputOrdering: None
49. Example with bucketing
● If the sources are bucketed tables into 50 buckets by id:
○ FileScan will have outputPartitioning as HashPartitioning(id, 50)
○ It will pass it to Project
○ This will happen in both branches of the Join
○ The requirement for partitioning is met => no Exchange needed
49#UnifiedDataAnalytics #SparkAISummit
50. 50#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
51. 51#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
52. 52#UnifiedDataAnalytics #SparkAISummit
SortMergeJoin (id)
Project Project
FileScan FileScan
requires:
HashPartitioning(id, 200)
order by id
requires:
HashPartitioning(id, 200)
order by id
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
outputPartitioning:
HashPartitioning(id, 50)
No need for Exchange
OK OK
53. What about the Sort ?
53#UnifiedDataAnalytics #SparkAISummit
dataDF
.write
.bucketBy(50, “id”)
.sortBy(“id”)
.option(“path”, outputPath)
.saveAsTable(tableName)
It depends on bucketing:
54. What about the Sort ?
54#UnifiedDataAnalytics #SparkAISummit
dataDF
.write
.bucketBy(50, “id”)
.sortBy(“id”)
.option(“path”, outputPath)
.saveAsTable(tableName)
It depends on bucketing: ● If one file per bucket is created
○ FileScan will pick up the information
about order
○ No need for Sort in the final plan
55. What about the Sort ?
55#UnifiedDataAnalytics #SparkAISummit
dataDF
.write
.bucketBy(50, “id”)
.sortBy(“id”)
.option(“path”, outputPath)
.saveAsTable(tableName)
It depends on bucketing: ● If one file per bucket is created
○ FileScan will pick up the information
about order
○ No need for Sort in the final plan
● If more files per bucket
○ FileScan will not use information
about order (data has to be sorted
globally)
○ Spark will add Sort to the final plan
58. ReuseExchange
● Result of each Exchange (shuffle) is written to disk (shuffle-write)
58#UnifiedDataAnalytics #SparkAISummit
● The rule checks if the plan has the same Exchange-branches
○ if the branches are the same, the shuffle-write will also be the same
● Spark can reuse it since the data (shuffle-write) is persisted on disks
61. ReuseExchange
● The branches must be the same
● Only Exchange can be reused
● Can be turned off by internal configuration
○ spark.sql.exchange.reuse
● Otherwise you can not control it directly but there is indirect way by
tweaking the query (see the example later)
● It reduces the I/O and network cost:
○ One scan over the data
○ One shuffle write
61#UnifiedDataAnalytics #SparkAISummit
62. Part I conclusion
● Physical Plan operators carry information about execution
○ it is useful to pair information from the query plan with info about stages/tasks
● ReuseExchange allows to reduce I/O and network cost
○ if Exchange sub-branches are the same Spark will compute it only once
● EnsureRequirements adds Exchange and Sort to the query plan
○ it makes sure that requirements of the operators are met
62#UnifiedDataAnalytics #SparkAISummit
67. Example I - exchange reuse
67#UnifiedDataAnalytics #SparkAISummit
● Take all profiles where
○ sum of interactions is bigger than 100
○ or sum of interactions is less than 20
71. 71#UnifiedDataAnalytics #SparkAISummit
val dfSumBig = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100)
No need to optimize
val dfSumSmall = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
dfSumBig.union(dfSumSmall)
.write(...)
This Exchange is
reused. The data is
scanned only once
In our example:
72. 72#UnifiedDataAnalytics #SparkAISummit
val dfSumBig = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100)
val dfSumSmall = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
dfSumBig.union(dfSumSmall)
.write(...)
Let’s suppose the assignment has changed:
● For dfSumSmall we want to consider
only specific profiles (profile_id is not
null)
73. 73#UnifiedDataAnalytics #SparkAISummit
dfSumBig.union(dfSumSmall)
.write(...)
val dfSumSmall = df
.filter($”profile_id”.isNotNull)
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
val dfSumBig = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100)
Let’s suppose the assignment has changed:
● For dfSumSmall we want to consider
only specific profiles (profile_id is not
null)
74. 74#UnifiedDataAnalytics #SparkAISummit
Now add additional filter to one DF:val dfSumBig = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” > 100)
val dfSumSmall = df
.filter($”profile_id”.isNotNull)
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
dfSumBig.union(dfSumSmall)
.write(...)
75. val dfSumSmall = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“sum”))
.filter($”sum” < 20)
.filter($”profile_id”.isNotNull)
● How can we optimize this?
● Just calling the filter on different place does not help
75#UnifiedDataAnalytics #SparkAISummit
Spark optimizer will move this filter back by
calling a rule PushDownPredicate
76. ● We can limit the optimizer to stop the rule
76#UnifiedDataAnalytics #SparkAISummit
spark.conf.set(
“spark.sql.optimizer.excludeRules”,
“org.apache.spark.sql.catalyst.optimizer.PushDownPredicate”
)
val dfSumSmall = df
.groupBy(“profile_id”)
.agg(sum(“interactions”).alias(“metricValue”))
.filter($”metricValue” < 20)
.filter($”profile_id”.isNotNull)
It has both the filter conditions
77. Limiting the optimizer
● Available since Spark 2.4
● Useful if you need to change the order of operators in the plan
○ reposition Filter, Exchange
● Queries:
○ one data source (which is expensive to read)
○ with multiple computations (using groupBy or Window)
■ combined together using Union or Join
77#UnifiedDataAnalytics #SparkAISummit
78. Reused computation
● Similar effect to ReuseExchange can be achieved also by caching
○ caching may not be that useful with large datasets (if it does not fit to the
caching layer)
○ caching evokes additional overhead while putting data to the caching
layer (memory or disk)
78#UnifiedDataAnalytics #SparkAISummit
79. Example II
● Get only records (posts) with max interactions for each profile
79#UnifiedDataAnalytics #SparkAISummit
post_id profile_id interactions
date
1 1 20
2019-01-01
2 1 50
2019-01-01
3 1 50
2019-02-01
4 2 0
2019-01-01
5 2 100
2019-03-01
post_id profile_id interactions
date
2 1 50
2019-01-01
3 1 50
2019-02-01
Assume profile_id has non-null values.
80. Example II
● Three common ways how to write the query:
○ Using Window function
○ Using groupBy + join
○ Using correlated subquery in SQL
80#UnifiedDataAnalytics #SparkAISummit
Which one is the most efficient?
81. 81#UnifiedDataAnalytics #SparkAISummit
val w = Window.partitionBy(“profile_id”)
posts
.withColumn(“maxCount”, max(“interactions”).over(w))
.filter($”interactions” === $”maxCount”)
Using Window:
82. 82#UnifiedDataAnalytics #SparkAISummit
val maxCount = posts
.groupBy(“profile_id”)
.agg(max(“interactions”).alias(“maxCount”))
posts.join(maxCount, Seq(“profile_id”))
.filter($”interactions” === $”maxCount”)
Using groupBy + join:
89. 89#UnifiedDataAnalytics #SparkAISummit
Query with window Query with join or
correlated subquery
SortMergeJoin BroadcastHashJoin
2 Exchanges +
1 Sort
Exchange + Sort
reduced shuffle
broadcast exchange
HashAggregate
90. Window vs join + groupBy
90#UnifiedDataAnalytics #SparkAISummit
Both tables are large
● Go with window
● SortMergeJoin will be
more expensive (3
exchanges, 2 sorts)
91. Window vs join + groupBy
91#UnifiedDataAnalytics #SparkAISummit
Both tables are large
● Go with window
● SortMergeJoin will be
more expensive (3
exchanges, 2 sorts)
Reduced table is much smaller
and can be broadcasted
● Go with broadcast join - it
will be much faster
92. Window vs join + groupBy
92#UnifiedDataAnalytics #SparkAISummit
Both tables are small and
comparable in size:
● It is not a big deal
● Broadcast will also
generate traffic
Both tables are large
● Go with window
● SortMergeJoin will be
more expensive (3
exchanges, 2 sorts)
Reduced table is much smaller
and can be broadcasted
● Go with broadcast join - it
will be much faster
93. Example III
● Sum interactions for each profile and each date
● Join additional table about profiles
93#UnifiedDataAnalytics #SparkAISummit
profile_id about
lang
1 “some string”
en
2 “some other
string” en
Table B: profiles (Facebook pages)
94. Example III
● Sum interactions for each profile and each date
● Join additional table about profiles
94#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
profile_id about
lang
1 “some string”
en
2 “some other
string” en
Table B: profiles (Facebook pages)
103. 103#UnifiedDataAnalytics #SparkAISummit
Add repartition and eliminate one shuffle
HashPartitioning (profile_id)
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Generated by strategy
104. 104#UnifiedDataAnalytics #SparkAISummit
Add repartition and eliminate one shuffle
HashPartitioning (profile_id)
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Generated by strategy
105. 105#UnifiedDataAnalytics #SparkAISummit
Add repartition and eliminate one shuffle
HashPartitioning (profile_id)
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Generated by strategy
OK with
106. 106#UnifiedDataAnalytics #SparkAISummit
Add repartition and eliminate one shuffle
HashPartitioning (profile_id)
posts
.repartition(“profile_id”)
.groupBy(“profile_id”, “date”)
.agg(sum(“interactions”))
.join(profiles, Seq(“profile_id”))
Generated by strategy
OK with
OK with
107. Adding repartition
● What is the cost of using it?
○ Now we shuffle all data
○ Before we shuffled reduced dataset
107#UnifiedDataAnalytics #SparkAISummit
110. Adding repartition
● What is the cost of using it?
○ Now we shuffle all data
○ Before we shuffled reduced dataset
● What is more efficient?
○ depends on properties of data
■ here the cardinality of distinct (profile_id, date)
110#UnifiedDataAnalytics #SparkAISummit
111. Adding repartition
111#UnifiedDataAnalytics #SparkAISummit
● Cardinality of (profile_id, date)
is comparable with row_count
● Each profile has only few
posts per date
● The data is not reduced much
by groupBy aggregation
● the reduced shuffle is
comparable with full shuffle
● Cardinality of (profile_id, date)
is much lower than row_count
● Each profile has many posts
per date
● The data is reduced a lot by
groupBy aggregation
● the reduced shuffle is much
lower than full shuffle
Use repartition to reduce
number of shuffles if:
Use the original plan if:
112. Useful tip I
● Filters are not pushed through nondeterministic expressions
112#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”)
.agg(collect_list(“interactions”))
.filter($”profile_id”.isNotNull )
113. Useful tip I
● Filters are not pushed through nondeterministic expressions
113#UnifiedDataAnalytics #SparkAISummit
posts
.groupBy(“profile_id”)
.agg(collect_list(“interactions”))
.filter($”profile_id”.isNotNull )
Exchange
114. Useful tip I
● Filters are not pushed through nondeterministic expressions
114#UnifiedDataAnalytics #SparkAISummit
posts
.filter($”profile_id”.isNotNull )
.groupBy(“profile_id”)
.agg(collect_list(“interactions”))
Make sure your filters are positioned at the right place to achieve
efficient execution
Exchange
116. Useful tip IIa
● Important settings related to BroadcastHashJoin:
116#UnifiedDataAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold
● Default value is 10MB
● Spark will broadcast if
○ Spark thinks that the size of the data is less
○ or you use broadcast hint
117. Useful tip IIa
● Important settings related to BroadcastHashJoin:
117#UnifiedDataAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold
● Default value is 10MB
● Spark will broadcast if
○ Spark thinks that the size of the data is less
○ or you use broadcast hint
Compute stats to make
good estimates
ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ...
118. Useful tip IIa
● Important settings related to BroadcastHashJoin:
118#UnifiedDataAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold
● Default value is 10MB
● Spark will broadcast if
○ Spark thinks that the size of the data is less
○ or you use broadcast hint
Compute stats to make
good estimates
ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col1, col2, ...
spark.sql.cbo.enabled
119. Useful tip IIb
● Important settings related to BroadcastHashJoin:
119#UnifiedDataAnalytics #SparkAISummit
spark.sql.broadcastTimeout
● Default value is 300s
120. Useful tip IIb
● Important settings related to BroadcastHashJoin:
120#UnifiedDataAnalytics #SparkAISummit
spark.sql.broadcastTimeout
● Default value is 300s
● If Spark does not make it:
SparkException: Could not execute broadcast in 300 secs. You can increase the
timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast
join by setting spark.sql.autoBroadcastJoinThreshold to -1
121. 3 basic solutions:
1. Disable the broadcast by setting the threshold to -1
2. Increase the timeout
3. Use caching
121#UnifiedDataAnalytics #SparkAISummit
122. 122#UnifiedDataAnalytics #SparkAISummit
val df = profiles
.some_udf_call
.groupBy(“profile_id”)
.agg(...some_aggregation...)
posts
.join(broadcast(df), Seq(“profile_id”))
Intense transformations:
● udf call
● aggregation, …
Computation may take longer than
5mins
If size of df is small we want to
broadcast
123. 123#UnifiedDataAnalytics #SparkAISummit
val df = profiles
.some_udf_call
.groupBy(“profile_id”)
.agg(...some_aggregation...)
.cache()
posts
.join(broadcast(df), Seq(“profile_id”))
df.count()
Three jobs:
1) count will write the data to memory
2) broadcast (fast because taken from RAM)
3) join - will leverage broadcasted data
If size of df is small we want to
broadcast
We can use caching (but we have
to materialize immediately)
124. Conclusion
● Understanding the physical plan is important
● Limiting the optimizer you can achieve the reused Exchange
● Choice between Window vs groupBy+join depends on data properties
● Adding repartition can avoid unnecessary Exchange
○ considering the data properties is important
● Be aware of nondeterministic expressions
● Fine-tune broadcast joins with configuration settings
○ make sure to have good size estimates using CBO
124#UnifiedDataAnalytics #SparkAISummit