Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
The document discusses tuning Spark parameters to optimize performance. It describes how to control Spark's resource usage through parameters like num-executors, executor-cores, and executor-memory. Advanced parameters like spark.shuffle.memoryFraction and spark.reducer.maxSizeInFlight are also covered. Dynamic allocation allows scaling resources up and down based on workload. Tips provided include tuning memory usage, choosing serialization and storage levels, setting parallelism, and avoiding operations like groupByKey. An example recommends tuning the collaborative filtering algorithm in the RW project, reducing runtime from 27 minutes to under 7 minutes.
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
The document discusses tuning Spark parameters to optimize performance. It describes how to control Spark's resource usage through parameters like num-executors, executor-cores, and executor-memory. Advanced parameters like spark.shuffle.memoryFraction and spark.reducer.maxSizeInFlight are also covered. Dynamic allocation allows scaling resources up and down based on workload. Tips provided include tuning memory usage, choosing serialization and storage levels, setting parallelism, and avoiding operations like groupByKey. An example recommends tuning the collaborative filtering algorithm in the RW project, reducing runtime from 27 minutes to under 7 minutes.
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
How to Automate Performance Tuning for Apache SparkDatabricks
Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?
These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
The document discusses 5 common mistakes people make when writing Spark applications:
1) Not properly sizing executors for memory and cores.
2) Having shuffle blocks larger than 2GB which can cause jobs to fail.
3) Not addressing data skew which can cause joins and shuffles to be very slow.
4) Not properly managing the DAG to minimize shuffles and stages.
5) Classpath conflicts from mismatched dependencies causing errors.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Structured Streaming provides stateful stream processing capabilities in Spark SQL through built-in operations like aggregations and joins as well as user-defined stateful transformations. It handles state automatically through watermarking to limit state size by dropping old data. For arbitrary stateful logic, MapGroupsWithState requires explicit state management by the user.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Making Structured Streaming Ready for ProductionDatabricks
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
Speaker: Tathagata Das
This talk was originally presented at Spark Summit East 2017.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
How to Automate Performance Tuning for Apache SparkDatabricks
Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?
These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
The document discusses 5 common mistakes people make when writing Spark applications:
1) Not properly sizing executors for memory and cores.
2) Having shuffle blocks larger than 2GB which can cause jobs to fail.
3) Not addressing data skew which can cause joins and shuffles to be very slow.
4) Not properly managing the DAG to minimize shuffles and stages.
5) Classpath conflicts from mismatched dependencies causing errors.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Structured Streaming provides stateful stream processing capabilities in Spark SQL through built-in operations like aggregations and joins as well as user-defined stateful transformations. It handles state automatically through watermarking to limit state size by dropping old data. For arbitrary stateful logic, MapGroupsWithState requires explicit state management by the user.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Making Structured Streaming Ready for ProductionDatabricks
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
Speaker: Tathagata Das
This talk was originally presented at Spark Summit East 2017.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
Spark is a fast and general engine for large-scale data processing. It was designed to be fast, easy to use and supports machine learning. Spark achieves high performance by keeping data in-memory as much as possible using its Resilient Distributed Datasets (RDDs) abstraction. RDDs allow data to be partitioned across nodes and operations are performed in parallel. The Spark architecture uses a master-slave model with a driver program coordinating execution across worker nodes. Transformations operate on RDDs to produce new RDDs while actions trigger job execution and return results.
This document provides recommendations for optimizing Spark jobs. It suggests reducing I/O by running the Spark cluster on the same machines as the data. It recommends avoiding functions that collect data to the driver to reduce memory I/O. It also suggests using caching to avoid read I/O. The document discusses configuring resources like memory and cores and tuning configurations like backpressure to improve performance of Spark streaming jobs. Finally, it recommends using efficient serialization formats like Kryo, Avro and Parquet.
In the session, we discussed the End-to-end working of Apache Spark that mainly focused on "Why What and How" factors. We discussed RDDs and the high-level API like Dataframe, Dataset. The session also took care of the internals of Spark.
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Vous avez récemment commencé à travailler sur Spark et vos jobs prennent une éternité pour se terminer ? Cette présentation est faite pour vous.
Himanshu Arora et Nitya Nand YADAV ont rassemblé de nombreuses bonnes pratiques, optimisations et ajustements qu'ils ont appliqué au fil des années en production pour rendre leurs jobs plus rapides et moins consommateurs de ressources.
Dans cette présentation, ils nous apprennent les techniques avancées d'optimisation de Spark, les formats de sérialisation des données, les formats de stockage, les optimisations hardware, contrôle sur la parallélisme, paramétrages de resource manager, meilleur data localité et l'optimisation du GC etc.
Ils nous font découvrir également l'utilisation appropriée de RDD, DataFrame et Dataset afin de bénéficier pleinement des optimisations internes apportées par Spark.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison
This document discusses techniques for improving the performance of Apache Spark applications. It describes optimizing the Java virtual machine by enhancing the just-in-time compiler, improving the object serializer, enabling faster I/O using technologies like RDMA networking and CAPI flash storage, and offloading tasks to graphics processors. The document provides examples of code style guidelines and specific Spark optimizations that further improve performance, such as leveraging hardware accelerators and tuning JVM heuristics.
EMR Spark tuning involves configuring Spark and YARN parameters like executor memory and cores to optimize performance. The default Spark configurations depend on the deployment method (Thrift, Zeppelin etc). YARN is used for resource management in cluster mode, and allocates resources to containers based on minimum and maximum thresholds. When tuning, factors like available cluster resources, executor instances and cores should be considered to avoid overcommitting resources.
Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.
1. The document discusses various technologies for building big data architectures, including NoSQL databases, distributed file systems, and data partitioning techniques.
2. Key-value stores, document databases, and graph databases are introduced as alternatives to relational databases for large, unstructured data.
3. The document also covers approaches for scaling databases horizontally, such as sharding, replication, and partitioning data across multiple servers.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
This document discusses programmatically tuning Spark jobs. It recommends collecting historical metrics like stage durations and task metrics from previous job runs. These metrics can then be used along with information about the execution environment and input data size to optimize configuration settings like memory, cores, partitions for new jobs. The document demonstrates using the Robin Sparkles library to save metrics and get an optimized configuration based on prior run data and metrics. Tuning goals include reducing out of memory errors, shuffle spills, and improving cluster utilization.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
Tech-talk at Bay Area Apache Spark Meetup.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Building upon ideas from modern compilers and MPP databases, and applying them to data processing queries, we have started an ongoing effort to dramatically improve Spark’s performance and bringing execution closer to bare metal. In this talk, we’ll take a deep dive into Apache Spark 2.0’s execution engine and discuss a number of architectural changes around whole-stage code generation/vectorization that have been instrumental in improving CPU efficiency and gaining performance.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.
This document summarizes an environmental impact assessment report for a proposed coal-based power plant in Gujarat, India. It describes the existing environmental conditions, including water quality, land use, flora and fauna in the area. It also outlines the potential impacts of the plant on water resources and biodiversity during construction and operation. Mitigation measures are proposed, such as developing green belts, controlling air emissions, and treating wastewater, to minimize adverse environmental impacts.
Summer traning report BRPNNL by Amit Raj 14CE10005Amit Raj
The document provides a training report on the construction of the Mithapur to Chiraiyatad flyover project in Bihar, India. It includes three sections: components of a bridge, construction procedures and dimensions, and quality control. The first section describes the foundation, substructure, and superstructure components. The second section details the construction process and specifications of piles, pile caps, piers, pier caps, I-girders, decks, bearings, and crash barriers. The third section discusses quality control tests for aggregates, cement, and performance testing of the bridge structures.
This document provides a training report on the construction of the Mithapur to Chiraiyatad flyover project in Bihar, India. It includes three sections: components of a bridge, construction procedures and dimensions, and quality control tests. The report describes the various components of the bridge, including foundations, substructures, and superstructures. It provides details on constructing piles, pile caps, piers, pier caps, I-girders, decks, bearings, and crash barriers. Dimensional details and construction procedures are outlined for each component. Finally, the report discusses quality control tests for aggregates, cement, and load testing of the completed structures.
The document provides a cost benefit analysis of the proposed Haripur Nuclear Power Plant in West Bengal, India. Key points:
- The plant was proposed in 2006 but faced public opposition and was suspended. It would have had a capacity of 10,000 MW generated from 6 reactors.
- The site at Haripur is a fertile agricultural and fishing area that supports many local livelihoods. Building the plant would have displaced over 80,000 people.
- The analysis identifies and quantifies the various costs and benefits of the proposed plant to determine if it would provide a net benefit to society. Factors like energy production, employment, and environmental impacts are considered.
- While the plant may have
This document contains information from a traffic study conducted on a road connecting various departments at a university. The study found that:
1) The 85th percentile speed on the road was higher than the posted speed limit of 25 kmph, especially in the straight section in front of the Biotechnology department.
2) Bicycles made up the majority of vehicles on the road, with generally low traffic during the day.
3) Speed breakers were not fully effective in reducing speeds to the limit, and adding another breaker in front of the Biotechnology department was recommended.
The document discusses the e-commerce market in India and strategies for small online retailers to succeed. It notes that a few major players like Flipkart, OLX, and Snapdeal dominate the online retail market. This creates enormous competition for small retailers to attract customers. However, small retailers can prosper by adopting innovative strategies like building their brand through aggressive initial marketing, making their site easy to use, and establishing credibility by exceeding customer expectations. The document also proposes an "umbrella website" business model that allows multiple small retailers to sell through one platform.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Rainfall intensity duration frequency curve statistical analysis and modeling...bijceesjournal
Using data from 41 years in Patna’ India’ the study’s goal is to analyze the trends of how often it rains on a weekly, seasonal, and annual basis (1981−2020). First, utilizing the intensity-duration-frequency (IDF) curve and the relationship by statistically analyzing rainfall’ the historical rainfall data set for Patna’ India’ during a 41 year period (1981−2020), was evaluated for its quality. Changes in the hydrologic cycle as a result of increased greenhouse gas emissions are expected to induce variations in the intensity, length, and frequency of precipitation events. One strategy to lessen vulnerability is to quantify probable changes and adapt to them. Techniques such as log-normal, normal, and Gumbel are used (EV-I). Distributions were created with durations of 1, 2, 3, 6, and 24 h and return times of 2, 5, 10, 25, and 100 years. There were also mathematical correlations discovered between rainfall and recurrence interval.
Findings: Based on findings, the Gumbel approach produced the highest intensity values, whereas the other approaches produced values that were close to each other. The data indicates that 461.9 mm of rain fell during the monsoon season’s 301st week. However, it was found that the 29th week had the greatest average rainfall, 92.6 mm. With 952.6 mm on average, the monsoon season saw the highest rainfall. Calculations revealed that the yearly rainfall averaged 1171.1 mm. Using Weibull’s method, the study was subsequently expanded to examine rainfall distribution at different recurrence intervals of 2, 5, 10, and 25 years. Rainfall and recurrence interval mathematical correlations were also developed. Further regression analysis revealed that short wave irrigation, wind direction, wind speed, pressure, relative humidity, and temperature all had a substantial influence on rainfall.
Originality and value: The results of the rainfall IDF curves can provide useful information to policymakers in making appropriate decisions in managing and minimizing floods in the study area.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Design and optimization of ion propulsion dronebjmsejournal
Electric propulsion technology is widely used in many kinds of vehicles in recent years, and aircrafts are no exception. Technically, UAVs are electrically propelled but tend to produce a significant amount of noise and vibrations. Ion propulsion technology for drones is a potential solution to this problem. Ion propulsion technology is proven to be feasible in the earth’s atmosphere. The study presented in this article shows the design of EHD thrusters and power supply for ion propulsion drones along with performance optimization of high-voltage power supply for endurance in earth’s atmosphere.
3. Introduction
● Apache Spark is Open Source, in-memory computation framework.
● It gives high performance for both batch as well as streaming job.
● It deals of big data processing.
● it is approx 100 times faster than mapreduce, because of in-memory computation
As it deals with the big data processing application it also involves lot of uses of resources such as
CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction.
In the upcoming 40 minute we will learn about the approaches which will help to do so.
4. Ways to Optimise
Code Level:-
Here we will learn the best practices to follow in order to achieve high performance in minimal
resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid
UDF, Filter Data at earliest , Reduce Shuffle
Beyond Code:-
Here we will learn to tune the config parameter cluster resources level tuning such as:-
File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval
5. Major Bottleneck
● CPU
● Network Bandwidth
● Memory
Our Goal is to optimise each of them as much as possible in order to reduce the resources used
and reduce the computation time to achieve optimum performance.
6. Caching
Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving
from a particular country and same is being used multiple times.
● Raw Data is in text file
● Reading Text File as DF1
● Grouping by origin country DF2
7. Caching
JOB1:- Now number of flights leaving US as DF3
JOB2:- number of flights leaving Singapore as DF4
JOB3:- number of flights leaving India as DF5
Execution plan for JOB1 :- DF1>DF2 >DF3
Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step.
Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step.
here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can
use it in another job to reduce computation resources and save time.
8. Broadcasting
Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with
task every time. which helps in reducing the network bandwidth and time consumption.
When to Use Broadcast Variable:-
Suppose we have a lookup data and that data need to be used by each executor while performing task.
We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition)
we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task).
But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be
sent.
Benefit= sending 100 copy vs sending 10 copy
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))
val broadcastStates = spark.sparkContext.broadcast(states)
val broadcastCountries = spark.sparkContext.broadcast(countries)
9. – - Continue
In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution.
Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.
10. Serialization
From the above diagram it is clear that serialization is needed when we write data in some storage.
De-Serialization is needed when we need to read from the some source.
In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuffling etc.
Hence it becomes very important to optimize the serialization process.
11. Serialization
Kyro serialization over Java serialization:-
kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires
to register the classes not supported by it.
val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate()
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class
it will store the class name with each object of it (for every row)
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(Array(classOf[Foo]))
12. DataSet/DataFrame over RDD
RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition
and shuffle, and we all know that Serialization and de-serialization are very expensive operations in spark.
On the other hand, DataFrame stores the data as binary using off-heap storage, no need for deserialization and serialization
of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD
13. Avoid UDF
When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence
whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible.
but by any chance we have to use it then first we have to define a function like a normal scala function and we
have to register it with spark udf class
● val plusOne = udf((x: Int) => x + 1) //defined function
● spark.udf.register("plusOne", plusOne) //register udf
● spark.sql("SELECT plusOne(5)").show() // calling udf
// |UDF(5)| // result
// +------+
// | 6|
14. Filter Data at Earliest
example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address,
pastexp, marital status, ……………………….. etc.
Bu we have to find number of employees belonging to a particular city. in this case we have to perform groupby operation on city column
and other column becomes irrelevant.
df.select(name,city).groupby(“city”).show()
df.groupby(“City”).select(“City”, “count”).show()
Scan
Aggregate
Filter
Scan
Aggregate
Filter
15. Shuffling
Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across
machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(),
reducebyKey(), join() on RDD and DataFrame. It involves
● Disk I/O
● Involves data serialization and deserialization
● Network I/O
16. Reduce Shuffle Operation
We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations
remove any unused operations.
Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property
you can improve Spark performance.
spark.conf.set("spark.sql.shuffle.partitions",100)
Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we
don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase
it to 200 or more.
17. File Format
Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database
As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades
from Database2 and perform calculation then writes in Databse3.
as Database2 involves writing the data into and reading the data from it.
In the above scenario we should prefer writing an intermediate file in Serialized and optimized formats like Avro, Parquet
e.t.c,
Any transformations on these formats performs better than text, CSV, and JSON.
Spark Job1
Spark Job2
DataBase2 Database3
DataBase1
18. Executor Config
● JOB > Stage > Task
● one job can have multiple Stage, One stage can have multiple task.
● And number of core = number of parallel task
● Here we have to give proper number of core to each executor in order to optimise the resources.
● Allocating more number of core to each executor will leads to more parallel task on each executor which can
lead to outofmemory(OOM) error.
● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor
memory will not be fully optimised.
● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of
parallelism and proper memory uses.
./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5
19. Memory Tuning
There are three considerations in tuning memory usage:
● the amount of memory used by your objects (you may want your entire dataset to fit in memory),
● the cost of accessing those objects, and
● the overhead of garbage collection
● String data types uses less storage space compared to Linked List and Map as these objects not only has a
header, but also pointers (typically 8 bytes each) to the next object in the list.
● We can also optimise the memory uses by storing data in a serialized format.
● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their fields.
● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage
collection cost. Broadcasting variable also help us in reducing GC.
20. Thank You !
Get in touch with us:
Amit Raj
Senior Data Engineer
IIT Kharagpur
amitraj.iitkgp@gmail.com / 7548095242