SlideShare a Scribd company logo
Spark Shuffle Deep Dive
Bo Yang
Content
• Overview
• Major Classes
• Shuffle Writer
• Spark Serializer
• Shuffle Reader
• External Shuffle Service
• Suggestions
Shuffle Overview
Mapper 1
Orange 3
Apple 2
Peach 5
Pear 1
Mapper 2
Peach 3
Banana 2
Grape 5
Reducer 1
Apple 2
Peach 8
Pear 1
Reducer 2
Grape 5
Orange 3
Reducer 3
Banana 2
High Level Abstraction
• Pluggable Interface: ShuffleManager
• registerShuffle(…)
• getWriter(…)
• getReader(…)
• Configurable: spark.shuffle.manager=xxx
• Mapper: ShuffleWriter
• write(records: Iterator)
• Reducer: ShuffleReader
• read(): Iterator
Implementations
• SortShuffleManager (extends ShuffleManager)
• Three Writers (optimized for different scenarios)
• SortShuffleWriter: uses ExternalSorter
• BypassMergeSortShuffleWriter: no sorter
• UnsafeShuffleWriter: uses ShuffleExternalSorter
• One Reader
• BlockStoreShuffleReader, uses
• ExternalAppendOnlyMap
• ExternalSorter (if ordering)
Writer Output Example (Shuffle Files)
Mapper 1
Data File
Index File
Reducer 1 Reducer 2 Reducer 3
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Number of Partitions == Number of Reducers
Three Shuffle Writers
• Different Writer Algorithms
• SortShuffleWriter
• BypassMergeSortShuffleWriter
• UnsafeShuffleWriter
• Used in different situations (optimizations)
• Things to consider
• Reduce total number of files
• Reduce serialization/deserialization when possible
When Different Writers Are Used?
• Small number of partitions?
---> BypassMergeSortShuffleWriter
• Able to sort record in serialized form?
---> UnsafeShuffleWriter
• Otherwise
---> SortShuffleWriter
BypassMergeSortShuffleWriter
One file for each partition, then merge them
Mapper
BypassMergeSort
ShuffleWriter
Temp File: Partition 0
…
Temp File: Partition X
Index File
Data File
merge
Temp File: Partition 1
write
BypassMergeSortShuffleWriter (cont’d)
Used when
• No map side combine
• Number of partitions < spark.shuffle.sort.bypassMergeThreshold
Pros
• Simple
Cons
• 1 to 1 mapping between temp file and partition
• Many temp files
SortShuffleWriter
• Why sort?
• Sort records by PartitionId, to separate records by different partitions
• Reduce number of files: number of spill files < number of partitions
• Buffer (in memory):
• PartitionedAppendOnlyMap (when there is map side combine)
• PartitionedPairBuffer (when there is no map side combine)
Mapper
SortShuffleWriter
ExternalSorter Buffer
Spill File (Sorted)
…
Spill File (Sorted)
Index File
Data File
merge
SortShuffleWriter (cont’d)
Used when
• Has map side combine, or, many partitions
• Serializer supports record relocation
Pros
• Flexible, support all shuffle situations
Cons
• Serialize/deserialize multiple times
Internal configure to control spill behavior
(inside Spillable.scala):
spark.shuffle.spill.initialMemoryThreshold
spark.shuffle.spill.numElementsForceSpillThreshold
UnsafeShuffleWriter
• Record serialized once, then stored in memory pages
• 8 bytes record pointer (pointing to: memory page + offset)
• All record pointers stored in a long array
• Sort record pointers (long array)
• Small memory footprint
• Better fit CPU cache
• Sorter class: ShuffleExternalSorter
Memory
Page 1
Memory
Page 2
Memory
Page xxx
Record 1 (8 bytes)
Record 2 (8 bytes)
…
Store/Sort as
Array
UnsafeShuffleWriter (cont’d)
Used when
• Serializer supports record relocation
• No aggregator
Pros
• Single serialization, no deserialization/serialization for merging spill files
• Sorting is CPU cache friendly
Cons
• Not supported when using default serializer (JavaSerializer), supported
when using KryoSerializer
Serializer: JavaSerializer
• Default serializer in Spark
• spark.serializer=org.apache.spark.serializer.JavaSerializer
• Use object reference in serialized stream
• Write reference instead of whole object for repeated (same) object
• Not support record relocation
• Cannot move record in serialized stream due to object reference
• Pros: support serialization in all situations
• Cons: performance not good
Serializer: KryoSerializer
• Use kryo library
• Not use object reference in serialized stream by default
• Support record relocation
• Because there is no object reference, and each serialized object is independent
• Need to explicitly register classes for serialization, otherwise, it will write
fully qualified class name for each serialized object
• Pros: performance is good for common classes and registered classes (see
KryoSerializer.scala
• Cons: performance is bad for custom classes if not registered, need to
explicitly register them
Shuffle Reader: BlockStoreShuffleReader
Mapper 1
Data File
Index File
Reducer: BlockStoreShuffleReader
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Aggregator
ExternalAppend
OnlyMap
Spill File
…
Spill File
Iterator
Use:
HashComparator ExternalSorter
Iterator
If ordering
by key
External Shuffle Service
• YarnShuffleService / MesosExternalShuffleService
• YarnShuffleService: running inside YARN Node Manager as an
AuxiliaryService
• Run on each machine in YARN/Mesos cluster
• Get shuffle files from local disk and stream to reducers
• Use file name convention to locate shuffle files
(ExternalShuffleBlockResolver)
• "shuffle_" + shuffleId + "_" + mapId + "_0.index”
• "shuffle_" + shuffleId + "_" + mapId + "_0.data"
Suggestions / Takeaway
• Shuffle is expensive, avoid unnecessary shuffle
• Shuffle vs Cache (Dataset.persist(…))
• Shuffle files provide full data set for next stage execution
• Cache may not necessary when there is shuffle (unless want cache replicas)
• Use KryoSerializer if possible
• Tune different configures
• spark.shuffle.sort.bypassMergeThreshold
• spark.shuffle.spill.initialMemoryThreshold
• spark.shuffle.spill.numElementsForceSpillThreshold

More Related Content

What's hot

Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 

What's hot (20)

Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 

Similar to Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

NSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - EnglishNSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - English
Florent Pillet
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
Ted Dunning
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
DenizDural2
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
MongoDB
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
KeerthiChukka
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
Rahul Borate
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data AnalysesAlaa Elhadba
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
Yves Goeleven
 
azure track -04- azure storage deep dive
azure track -04- azure storage deep diveazure track -04- azure storage deep dive
azure track -04- azure storage deep dive
ITProceed
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance Techniques
Brett Meyer
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
DoiT International
 
Driver development – memory management
Driver development – memory managementDriver development – memory management
Driver development – memory management
Vandana Salve
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
markgrover
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
DataStax Academy
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
Girish L
 
Buffer overflow
Buffer overflowBuffer overflow
Buffer overflow
Jacob Pimental
 

Similar to Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark (20)

NSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - EnglishNSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - English
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
 
Logstash
LogstashLogstash
Logstash
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
 
azure track -04- azure storage deep dive
azure track -04- azure storage deep diveazure track -04- azure storage deep dive
azure track -04- azure storage deep dive
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance Techniques
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
Driver development – memory management
Driver development – memory managementDriver development – memory management
Driver development – memory management
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
 
Buffer overflow
Buffer overflowBuffer overflow
Buffer overflow
 

Recently uploaded

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 

Recently uploaded (20)

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

  • 1. Spark Shuffle Deep Dive Bo Yang
  • 2. Content • Overview • Major Classes • Shuffle Writer • Spark Serializer • Shuffle Reader • External Shuffle Service • Suggestions
  • 3. Shuffle Overview Mapper 1 Orange 3 Apple 2 Peach 5 Pear 1 Mapper 2 Peach 3 Banana 2 Grape 5 Reducer 1 Apple 2 Peach 8 Pear 1 Reducer 2 Grape 5 Orange 3 Reducer 3 Banana 2
  • 4. High Level Abstraction • Pluggable Interface: ShuffleManager • registerShuffle(…) • getWriter(…) • getReader(…) • Configurable: spark.shuffle.manager=xxx • Mapper: ShuffleWriter • write(records: Iterator) • Reducer: ShuffleReader • read(): Iterator
  • 5. Implementations • SortShuffleManager (extends ShuffleManager) • Three Writers (optimized for different scenarios) • SortShuffleWriter: uses ExternalSorter • BypassMergeSortShuffleWriter: no sorter • UnsafeShuffleWriter: uses ShuffleExternalSorter • One Reader • BlockStoreShuffleReader, uses • ExternalAppendOnlyMap • ExternalSorter (if ordering)
  • 6. Writer Output Example (Shuffle Files) Mapper 1 Data File Index File Reducer 1 Reducer 2 Reducer 3 Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Mapper 2 Data File Index File Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Number of Partitions == Number of Reducers
  • 7. Three Shuffle Writers • Different Writer Algorithms • SortShuffleWriter • BypassMergeSortShuffleWriter • UnsafeShuffleWriter • Used in different situations (optimizations) • Things to consider • Reduce total number of files • Reduce serialization/deserialization when possible
  • 8. When Different Writers Are Used? • Small number of partitions? ---> BypassMergeSortShuffleWriter • Able to sort record in serialized form? ---> UnsafeShuffleWriter • Otherwise ---> SortShuffleWriter
  • 9. BypassMergeSortShuffleWriter One file for each partition, then merge them Mapper BypassMergeSort ShuffleWriter Temp File: Partition 0 … Temp File: Partition X Index File Data File merge Temp File: Partition 1 write
  • 10. BypassMergeSortShuffleWriter (cont’d) Used when • No map side combine • Number of partitions < spark.shuffle.sort.bypassMergeThreshold Pros • Simple Cons • 1 to 1 mapping between temp file and partition • Many temp files
  • 11. SortShuffleWriter • Why sort? • Sort records by PartitionId, to separate records by different partitions • Reduce number of files: number of spill files < number of partitions • Buffer (in memory): • PartitionedAppendOnlyMap (when there is map side combine) • PartitionedPairBuffer (when there is no map side combine) Mapper SortShuffleWriter ExternalSorter Buffer Spill File (Sorted) … Spill File (Sorted) Index File Data File merge
  • 12. SortShuffleWriter (cont’d) Used when • Has map side combine, or, many partitions • Serializer supports record relocation Pros • Flexible, support all shuffle situations Cons • Serialize/deserialize multiple times Internal configure to control spill behavior (inside Spillable.scala): spark.shuffle.spill.initialMemoryThreshold spark.shuffle.spill.numElementsForceSpillThreshold
  • 13. UnsafeShuffleWriter • Record serialized once, then stored in memory pages • 8 bytes record pointer (pointing to: memory page + offset) • All record pointers stored in a long array • Sort record pointers (long array) • Small memory footprint • Better fit CPU cache • Sorter class: ShuffleExternalSorter Memory Page 1 Memory Page 2 Memory Page xxx Record 1 (8 bytes) Record 2 (8 bytes) … Store/Sort as Array
  • 14. UnsafeShuffleWriter (cont’d) Used when • Serializer supports record relocation • No aggregator Pros • Single serialization, no deserialization/serialization for merging spill files • Sorting is CPU cache friendly Cons • Not supported when using default serializer (JavaSerializer), supported when using KryoSerializer
  • 15. Serializer: JavaSerializer • Default serializer in Spark • spark.serializer=org.apache.spark.serializer.JavaSerializer • Use object reference in serialized stream • Write reference instead of whole object for repeated (same) object • Not support record relocation • Cannot move record in serialized stream due to object reference • Pros: support serialization in all situations • Cons: performance not good
  • 16. Serializer: KryoSerializer • Use kryo library • Not use object reference in serialized stream by default • Support record relocation • Because there is no object reference, and each serialized object is independent • Need to explicitly register classes for serialization, otherwise, it will write fully qualified class name for each serialized object • Pros: performance is good for common classes and registered classes (see KryoSerializer.scala • Cons: performance is bad for custom classes if not registered, need to explicitly register them
  • 17. Shuffle Reader: BlockStoreShuffleReader Mapper 1 Data File Index File Reducer: BlockStoreShuffleReader Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Mapper 2 Data File Index File Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Aggregator ExternalAppend OnlyMap Spill File … Spill File Iterator Use: HashComparator ExternalSorter Iterator If ordering by key
  • 18. External Shuffle Service • YarnShuffleService / MesosExternalShuffleService • YarnShuffleService: running inside YARN Node Manager as an AuxiliaryService • Run on each machine in YARN/Mesos cluster • Get shuffle files from local disk and stream to reducers • Use file name convention to locate shuffle files (ExternalShuffleBlockResolver) • "shuffle_" + shuffleId + "_" + mapId + "_0.index” • "shuffle_" + shuffleId + "_" + mapId + "_0.data"
  • 19. Suggestions / Takeaway • Shuffle is expensive, avoid unnecessary shuffle • Shuffle vs Cache (Dataset.persist(…)) • Shuffle files provide full data set for next stage execution • Cache may not necessary when there is shuffle (unless want cache replicas) • Use KryoSerializer if possible • Tune different configures • spark.shuffle.sort.bypassMergeThreshold • spark.shuffle.spill.initialMemoryThreshold • spark.shuffle.spill.numElementsForceSpillThreshold

Editor's Notes

  1. ExternalAppendOnlyMap ExternalSorter (if ordering)