SlideShare a Scribd company logo
1 of 14
- Aparup Chatterjee
Introduction of Apache Spark 3.X Dynamic Partition
Pruning (DPP)
Spark 3.0 based Environment Details:
 Hadoop 3.2
 Spark 3.1
 Python 3.8
Spark 2.0 based Environment Details:
 Hadoop 2.9
 Spark 2.3
 Python 2.7.14
Source:- SPARK+AI SUMMIT NORTH AMERICA 2020 &
SPARK 3.X OFFICIAL DOCS & DATABRICKS
Another big improvement of Spark 3.X is Dynamic Partition Pruning(DPP).
Before going to in detail about DPP, I would like to explain main 2 key points of DPP
1. Filter Pushdown
2. Partition Pruning
Used GCP based Bigdata Component Details
Filter Pushdown/ Predicate Pushdown
Filter Pushdown or Predicate Pushdown is an Optimization in Spark Sql.
To improve spark sql query performance, one strategy is to reduce the amount of data read (I/O) which is transferred from
the data storage to executors.
Generally, when we use Filter/Where condition in our Spark Sql Query, Spark Catalyst Optimizer always attempts to “push
down” filtering operations to the data source layer to save I/O cost (pic. Filter Push-down) instead of reading full files and
send across executors (pic. Basic data-flow) then filtering.
Example:
Filter Pushdown Limitation
Filter Pushdown works based on Data Schema type, if filter condition requires casting the content of a field then cast
functions cannot be pushed down and will read full file.
example: if column age data type is String while reading from CSV/hive table
So as a Data Engineer we always need to cast column type if required while passing through filter condition so that
spark can use Filter PushDown for optimization.
You can achieve this via casting on the fly or add custom schema to dataframe.
Partition Pruning
Partition pruning is another optimization but it works based on PredicatePushDown/FilterPushDown methodology.
Filtering data on Partitioned Column, the catalyst optimizer pushes down the partition filters at Data Source level. The scan reads only the
directories(not actual content of data) that match the partition filters, thus reducing disk I/O.
example: suppose if a partition column contain records that match filter condition then there is no need to of PredicatePushDown instead of
that spark will PushDown only filter column to read that partition and rest partitioned will be pruned/skipped, thereby avoiding
unnecessary I/O.
Lets take same dataset where table partitioned by age
Lets have a look into Physical Plan:
As you can see spark will Pruned/skip those irrelevant partitions and will take
required partition which will further pushdown at data source level. Data
pruning/skipping allows for a big performance boost.
This is also called Static Partition Pruning
Quick Recap of Filter PushDown and Partition Pruning
Filter Pushdown : When you filter on some column that isn't in your partition, Spark will scan every part file in folder of that
parquet table. Only when you have pushdown filtering, Spark will use the footer of every part file (where min, max and count
statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will
skip the whole file, not costing you at least the full read.
When we filter off of df, the pushed filters are-
PushedFilters: [IsNotNull(age), GreaterThan(age,40)]
Partition Pruning: When you use filters on the columns which you did partition on, Spark will skip those files completely and it
wouldn't cost you any IO.
When we filter off of partitioned Df, the pushed filters are-
PartitionFilters: [isnotnull(age#102), (age#102 > 40)], PushedFilters: []
Spark doesn’t need to push the age filter when working off of partitioned DF because it can use a partition filter that is a lot
faster.
Dynamic Partition Pruning
 Dynamic partition pruning improves job performance by more accurately selecting the specific partitions within a table
that need to be read and processed for a specific query.
 Dynamic partition pruning allows the Spark engine to dynamically infer at runtime based on calculation column statistics
of the data in selected columns i.e which partitions need to be read and which can be safely eliminated.
 By reducing the amount of data read and processed, significant time is saved in job execution.
In Real life its very useful for Data Engineers with Star Schema Data Warehousing concept where we generally join big fact table with
comparatively small dimension table.
Let’s look into below example:
Tables: withPartition_fact (Fact table) - Partitioned by age withoutPartition_dim (Dimension table) – Non Partitioned
Physical Plan
Now its’s time to Deep Dive
Dimension Table Phase
Spark use Filter Pushdown/Predicate Pushdown method which will skip reading irrelevant data from dimension table
before actual data fetching phase.
Fact Table Phase
Here you can notice the Partition Filter applied, which is formed internally from dimension table filter and at last
dynamic pruning expression is formed.
Internal Architecture and Dag Level
From the above views we can see spark will pick particular partition from fact table which is relevant to dimension table
filter condition and Spark will pruned other partitions of fact table.
According to our Dataframe query , dimension table filter “city=Bangalore”, Now from fact table Datasets we can see only 2
records are matched with different age(i.e 2 partition).
Now from the DAG also we can see using Dynamic Partition Pruning Spark only fetched those 2 partitions( number of
files read:2) only and rest all irrelevant partitions are pruned out.
Spark 3 config that’s responsible for DPP :- spark.sql.optimizer.dynamicPartitionPruning.enabled – true (bydefault)
Spark Internal Dynamic Partition Pruning Steps
1. Spark builds a hash table and forms an inner subquery out of the dimension table, which is broadcasted across all the
executors. (Subquery Broadcast)
2. Using Subquery Broadcast , we are able to execute the join without requiring a shuffle.
3. Then spark will start probing that hash table with rows that come from the fact table on each worker node in its
scanning phase, so that it doesn’t carry any irrelevant data to the Join phase.
Spark Dynamic Partition Pruning Factors
 In Spark 3 by default its enabled.
 Fact/Big table must be partitioned by a Column key
 Its best suitable for Star Schema Data Warehousing Model.
Useful Resources
• https://www.youtube.com/watch?v=WSplTjBKijU - SPARK+AI SUMMIT NORTH AMERICA 2020
• https://spark.apache.org/releases/spark-release-3-0-0.html - Spark 3 Official Docs
• https://databricks.com/session_eu19/dynamic-partition-pruning-in-apache-spark - Databricks
Apache Spark 3 Dynamic Partition Pruning

More Related Content

What's hot

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningDatabricks
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 

What's hot (20)

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Sqoop
SqoopSqoop
Sqoop
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 

Similar to Apache Spark 3 Dynamic Partition Pruning

Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Guido Oswald
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
data stage-material
data stage-materialdata stage-material
data stage-materialRajesh Kv
 
123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informaticahomeworkping9
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
exadata-database-machine-kpis-3224944.pdf
exadata-database-machine-kpis-3224944.pdfexadata-database-machine-kpis-3224944.pdf
exadata-database-machine-kpis-3224944.pdfTricantinoLopezPerez
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSeeQuality.net
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xshradha ambekar
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingAmir Reza Hashemi
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxKnoldus Inc.
 

Similar to Apache Spark 3 Dynamic Partition Pruning (20)

Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
data stage-material
data stage-materialdata stage-material
data stage-material
 
123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informatica
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
exadata-database-machine-kpis-3224944.pdf
exadata-database-machine-kpis-3224944.pdfexadata-database-machine-kpis-3224944.pdf
exadata-database-machine-kpis-3224944.pdf
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Apache Spark 3 Dynamic Partition Pruning

  • 2. Introduction of Apache Spark 3.X Dynamic Partition Pruning (DPP) Spark 3.0 based Environment Details:  Hadoop 3.2  Spark 3.1  Python 3.8 Spark 2.0 based Environment Details:  Hadoop 2.9  Spark 2.3  Python 2.7.14 Source:- SPARK+AI SUMMIT NORTH AMERICA 2020 & SPARK 3.X OFFICIAL DOCS & DATABRICKS Another big improvement of Spark 3.X is Dynamic Partition Pruning(DPP). Before going to in detail about DPP, I would like to explain main 2 key points of DPP 1. Filter Pushdown 2. Partition Pruning Used GCP based Bigdata Component Details
  • 3. Filter Pushdown/ Predicate Pushdown Filter Pushdown or Predicate Pushdown is an Optimization in Spark Sql. To improve spark sql query performance, one strategy is to reduce the amount of data read (I/O) which is transferred from the data storage to executors. Generally, when we use Filter/Where condition in our Spark Sql Query, Spark Catalyst Optimizer always attempts to “push down” filtering operations to the data source layer to save I/O cost (pic. Filter Push-down) instead of reading full files and send across executors (pic. Basic data-flow) then filtering. Example:
  • 4. Filter Pushdown Limitation Filter Pushdown works based on Data Schema type, if filter condition requires casting the content of a field then cast functions cannot be pushed down and will read full file. example: if column age data type is String while reading from CSV/hive table So as a Data Engineer we always need to cast column type if required while passing through filter condition so that spark can use Filter PushDown for optimization. You can achieve this via casting on the fly or add custom schema to dataframe.
  • 5. Partition Pruning Partition pruning is another optimization but it works based on PredicatePushDown/FilterPushDown methodology. Filtering data on Partitioned Column, the catalyst optimizer pushes down the partition filters at Data Source level. The scan reads only the directories(not actual content of data) that match the partition filters, thus reducing disk I/O. example: suppose if a partition column contain records that match filter condition then there is no need to of PredicatePushDown instead of that spark will PushDown only filter column to read that partition and rest partitioned will be pruned/skipped, thereby avoiding unnecessary I/O. Lets take same dataset where table partitioned by age Lets have a look into Physical Plan: As you can see spark will Pruned/skip those irrelevant partitions and will take required partition which will further pushdown at data source level. Data pruning/skipping allows for a big performance boost. This is also called Static Partition Pruning
  • 6. Quick Recap of Filter PushDown and Partition Pruning Filter Pushdown : When you filter on some column that isn't in your partition, Spark will scan every part file in folder of that parquet table. Only when you have pushdown filtering, Spark will use the footer of every part file (where min, max and count statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will skip the whole file, not costing you at least the full read. When we filter off of df, the pushed filters are- PushedFilters: [IsNotNull(age), GreaterThan(age,40)] Partition Pruning: When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. When we filter off of partitioned Df, the pushed filters are- PartitionFilters: [isnotnull(age#102), (age#102 > 40)], PushedFilters: [] Spark doesn’t need to push the age filter when working off of partitioned DF because it can use a partition filter that is a lot faster.
  • 7. Dynamic Partition Pruning  Dynamic partition pruning improves job performance by more accurately selecting the specific partitions within a table that need to be read and processed for a specific query.  Dynamic partition pruning allows the Spark engine to dynamically infer at runtime based on calculation column statistics of the data in selected columns i.e which partitions need to be read and which can be safely eliminated.  By reducing the amount of data read and processed, significant time is saved in job execution. In Real life its very useful for Data Engineers with Star Schema Data Warehousing concept where we generally join big fact table with comparatively small dimension table. Let’s look into below example: Tables: withPartition_fact (Fact table) - Partitioned by age withoutPartition_dim (Dimension table) – Non Partitioned
  • 9. Now its’s time to Deep Dive Dimension Table Phase Spark use Filter Pushdown/Predicate Pushdown method which will skip reading irrelevant data from dimension table before actual data fetching phase. Fact Table Phase Here you can notice the Partition Filter applied, which is formed internally from dimension table filter and at last dynamic pruning expression is formed.
  • 11. From the above views we can see spark will pick particular partition from fact table which is relevant to dimension table filter condition and Spark will pruned other partitions of fact table. According to our Dataframe query , dimension table filter “city=Bangalore”, Now from fact table Datasets we can see only 2 records are matched with different age(i.e 2 partition). Now from the DAG also we can see using Dynamic Partition Pruning Spark only fetched those 2 partitions( number of files read:2) only and rest all irrelevant partitions are pruned out. Spark 3 config that’s responsible for DPP :- spark.sql.optimizer.dynamicPartitionPruning.enabled – true (bydefault)
  • 12. Spark Internal Dynamic Partition Pruning Steps 1. Spark builds a hash table and forms an inner subquery out of the dimension table, which is broadcasted across all the executors. (Subquery Broadcast) 2. Using Subquery Broadcast , we are able to execute the join without requiring a shuffle. 3. Then spark will start probing that hash table with rows that come from the fact table on each worker node in its scanning phase, so that it doesn’t carry any irrelevant data to the Join phase. Spark Dynamic Partition Pruning Factors  In Spark 3 by default its enabled.  Fact/Big table must be partitioned by a Column key  Its best suitable for Star Schema Data Warehousing Model.
  • 13. Useful Resources • https://www.youtube.com/watch?v=WSplTjBKijU - SPARK+AI SUMMIT NORTH AMERICA 2020 • https://spark.apache.org/releases/spark-release-3-0-0.html - Spark 3 Official Docs • https://databricks.com/session_eu19/dynamic-partition-pruning-in-apache-spark - Databricks