SlideShare a Scribd company logo
A Deeper Understanding of Spark’s Internals
Aaron Davidson"
07/01/2014
This Talk
•  Goal: Understanding how Spark runs, focus
on performance
•  Major core components:
– Execution Model
– The Shuffle
– Caching
This Talk
•  Goal: Understanding how Spark runs, focus
on performance
•  Major core components:
– Execution Model
– The Shuffle
– Caching
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, Set(Ahir, Andy))
 (P, Set(Pat))
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, 2)
 (P, 1)
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, 2)
 (P, 1)
(A, Andy)
(P, Pat)
(A, Ahir)
res0 = [(A, 2), (P, 1)]
Spark Execution Model
1.  Create DAG of RDDs to represent
computation
2.  Create logical execution plan for DAG
3.  Schedule and execute individual tasks
Step 1: Create RDDs
sc.textFile(“hdfs:/names”)
map(name => (name.charAt(0), name))
groupByKey()
mapValues(names => names.toSet.size)
collect()
Step 1: Create RDDs
HadoopRDD
map()
groupBy()
mapValues()
collect()
Step 2: Create execution plan
•  Pipeline as much as possible
•  Split into “stages” based on need to
reorganize data
Stage 1
 HadoopRDD
map()
groupBy()
mapValues()
collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, 2)
(A, Andy)
(P, Pat)
(A, Ahir)
res0 = [(A, 2), ...]
Step 2: Create execution plan
•  Pipeline as much as possible
•  Split into “stages” based on need to
reorganize data
Stage 1
Stage 2
HadoopRDD
map()
groupBy()
mapValues()
collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, 2)
 (P, 1)
(A, Andy)
(P, Pat)
(A, Ahir)
res0 = [(A, 2), (P, 1)]
•  Split each stage into tasks
•  A task is data + computation
•  Execute all tasks within a stage before
moving on

Step 3: Schedule tasks
Step 3: Schedule tasks
Computation
 Data
hdfs:/names/0.gz
hdfs:/names/1.gz
hdfs:/names/2.gz
Task 0
Task 1
Task 2
hdfs:/names/3.gz
…
Stage 1
HadoopRDD
map()
Task 3
hdfs:/names/0.gz
Task 0
HadoopRDD
map()
hdfs:/names/1.gz
Task 1
HadoopRDD
map()
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
/names/0.gz
HadoopRDD
map()
Time
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
The Shuffle
Stage 1
Stage 2
HadoopRDD
map()
groupBy()
mapValues()
collect()
The Shuffle
Stage	
  1	
  
Stage	
  2	
  
•  Redistributes data among partitions
•  Hash keys into buckets
•  Optimizations:
– Avoided when possible, if"
data is already properly"
partitioned
– Partial aggregation reduces"
data movement
The Shuffle
Disk	
  
Stage	
  2	
  
Stage	
  1	
  
•  Pull-based, not push-based
•  Write intermediate files to disk
Execution of a groupBy()
•  Build hash map within each partition
•  Note: Can spill across keys, but a single
key-value pair must fit in memory
A => [Arsalan, Aaron, Andrew, Andrew, Andy, Ahir, Ali, …],
E => [Erin, Earl, Ed, …]
…
Done!
Stage 1
Stage 2
HadoopRDD
map()
groupBy()
mapValues()
collect()
What went wrong?
•  Too few partitions to get good concurrency
•  Large per-key groupBy()
•  Shipped all data across the cluster
Common issue checklist
1.  Ensure enough partitions for concurrency
2.  Minimize memory consumption (esp. of
sorting and large keys in groupBys)
3.  Minimize amount of data shuffled
4.  Know the standard library
1 & 2 are about tuning number of partitions!
Importance of Partition Tuning
•  Main issue: too few partitions
–  Less concurrency
–  More susceptible to data skew
–  Increased memory pressure for groupBy,
reduceByKey, sortByKey, etc.
•  Secondary issue: too many partitions
•  Need “reasonable number” of partitions
–  Commonly between 100 and 10,000 partitions
–  Lower bound: At least ~2x number of cores in
cluster
–  Upper bound: Ensure tasks take at least 100ms
Memory Problems
•  Symptoms:
–  Inexplicably bad performance
–  Inexplicable executor/machine failures"
(can indicate too many shuffle files too)
•  Diagnosis:
–  Set spark.executor.extraJavaOptions to include 
•  -XX:+PrintGCDetails
•  -XX:+HeapDumpOnOutOfMemoryError
–  Check dmesg for oom-killer logs
•  Resolution:
–  Increase spark.executor.memory
–  Increase number of partitions
–  Re-evaluate program structure (!)
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.repartition(6)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.repartition(6)
.distinct()
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.repartition(6)
.distinct()
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.distinct(numPartitions = 6)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.distinct(numPartitions = 6)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.distinct(numPartitions = 6)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
Original:
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
Questions?

More Related Content

What's hot

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
Mathias Herberts
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
Majid Hajibaba
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
Cheng Lian
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 

What's hot (20)

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Scala+data
Scala+dataScala+data
Scala+data
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 

Viewers also liked

A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
Hadoop / Spark Conference Japan
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
Alexey Grishchenko
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
Alexey Grishchenko
 
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia
Databricks
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
Spark Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
Alexey Grishchenko
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 

Viewers also liked (15)

A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
PySaprk
PySaprkPySaprk
PySaprk
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similar to A deeper-understanding-of-spark-internals-aaron-davidson

Failing gracefully
Failing gracefullyFailing gracefully
Failing gracefully
Takuya UESHIN
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Assignment 1 MapReduce With Hadoop
Assignment 1  MapReduce With HadoopAssignment 1  MapReduce With Hadoop
Assignment 1 MapReduce With Hadoop
Allison Thompson
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
Jodok Batlogg
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Hadoop london
Hadoop londonHadoop london
Reading Cassandra Meetup Feb 2015: Apache Spark
Reading Cassandra Meetup Feb 2015: Apache SparkReading Cassandra Meetup Feb 2015: Apache Spark
Reading Cassandra Meetup Feb 2015: Apache Spark
Christopher Batey
 
Hadoop
HadoopHadoop
Hadoop
HadoopHadoop
Advanced geoprocessing with Python
Advanced geoprocessing with PythonAdvanced geoprocessing with Python
Advanced geoprocessing with Python
Chad Cooper
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 

Similar to A deeper-understanding-of-spark-internals-aaron-davidson (20)

Failing gracefully
Failing gracefullyFailing gracefully
Failing gracefully
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Assignment 1 MapReduce With Hadoop
Assignment 1  MapReduce With HadoopAssignment 1  MapReduce With Hadoop
Assignment 1 MapReduce With Hadoop
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Reading Cassandra Meetup Feb 2015: Apache Spark
Reading Cassandra Meetup Feb 2015: Apache SparkReading Cassandra Meetup Feb 2015: Apache Spark
Reading Cassandra Meetup Feb 2015: Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Advanced geoprocessing with Python
Advanced geoprocessing with PythonAdvanced geoprocessing with Python
Advanced geoprocessing with Python
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 

Recently uploaded

AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 

Recently uploaded (20)

AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 

A deeper-understanding-of-spark-internals-aaron-davidson

  • 1. A Deeper Understanding of Spark’s Internals Aaron Davidson" 07/01/2014
  • 2. This Talk •  Goal: Understanding how Spark runs, focus on performance •  Major core components: – Execution Model – The Shuffle – Caching
  • 3. This Talk •  Goal: Understanding how Spark runs, focus on performance •  Major core components: – Execution Model – The Shuffle – Caching
  • 4. Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect()
  • 5. Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir
  • 6. Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, Andy) (P, Pat) (A, Ahir)
  • 7. Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, Andy) (P, Pat) (A, Ahir)
  • 8. Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, Andy) (P, Pat) (A, Ahir)
  • 9. Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, Set(Ahir, Andy)) (P, Set(Pat)) (A, Andy) (P, Pat) (A, Ahir)
  • 10. Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, 2) (P, 1) (A, Andy) (P, Pat) (A, Ahir)
  • 11. Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, 2) (P, 1) (A, Andy) (P, Pat) (A, Ahir) res0 = [(A, 2), (P, 1)]
  • 12. Spark Execution Model 1.  Create DAG of RDDs to represent computation 2.  Create logical execution plan for DAG 3.  Schedule and execute individual tasks
  • 13. Step 1: Create RDDs sc.textFile(“hdfs:/names”) map(name => (name.charAt(0), name)) groupByKey() mapValues(names => names.toSet.size) collect()
  • 14. Step 1: Create RDDs HadoopRDD map() groupBy() mapValues() collect()
  • 15. Step 2: Create execution plan •  Pipeline as much as possible •  Split into “stages” based on need to reorganize data Stage 1 HadoopRDD map() groupBy() mapValues() collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, 2) (A, Andy) (P, Pat) (A, Ahir) res0 = [(A, 2), ...]
  • 16. Step 2: Create execution plan •  Pipeline as much as possible •  Split into “stages” based on need to reorganize data Stage 1 Stage 2 HadoopRDD map() groupBy() mapValues() collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, 2) (P, 1) (A, Andy) (P, Pat) (A, Ahir) res0 = [(A, 2), (P, 1)]
  • 17. •  Split each stage into tasks •  A task is data + computation •  Execute all tasks within a stage before moving on Step 3: Schedule tasks
  • 18. Step 3: Schedule tasks Computation Data hdfs:/names/0.gz hdfs:/names/1.gz hdfs:/names/2.gz Task 0 Task 1 Task 2 hdfs:/names/3.gz … Stage 1 HadoopRDD map() Task 3 hdfs:/names/0.gz Task 0 HadoopRDD map() hdfs:/names/1.gz Task 1 HadoopRDD map()
  • 19. Step 3: Schedule tasks /names/0.gz /names/3.gz /names/0.gz HadoopRDD map() Time HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS
  • 20. Step 3: Schedule tasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() Time
  • 21. Step 3: Schedule tasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() Time
  • 22. Step 3: Schedule tasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() Time
  • 23. Step 3: Schedule tasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() Time
  • 24. Step 3: Schedule tasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/2.gz HadoopRDD map() /names/1.gz HadoopRDD map() Time
  • 25. Step 3: Schedule tasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time
  • 26. Step 3: Schedule tasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 27. Step 3: Schedule tasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 28. Step 3: Schedule tasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 29. The Shuffle Stage 1 Stage 2 HadoopRDD map() groupBy() mapValues() collect()
  • 30. The Shuffle Stage  1   Stage  2   •  Redistributes data among partitions •  Hash keys into buckets •  Optimizations: – Avoided when possible, if" data is already properly" partitioned – Partial aggregation reduces" data movement
  • 31. The Shuffle Disk   Stage  2   Stage  1   •  Pull-based, not push-based •  Write intermediate files to disk
  • 32. Execution of a groupBy() •  Build hash map within each partition •  Note: Can spill across keys, but a single key-value pair must fit in memory A => [Arsalan, Aaron, Andrew, Andrew, Andy, Ahir, Ali, …], E => [Erin, Earl, Ed, …] …
  • 34. What went wrong? •  Too few partitions to get good concurrency •  Large per-key groupBy() •  Shipped all data across the cluster
  • 35. Common issue checklist 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of sorting and large keys in groupBys) 3.  Minimize amount of data shuffled 4.  Know the standard library 1 & 2 are about tuning number of partitions!
  • 36. Importance of Partition Tuning •  Main issue: too few partitions –  Less concurrency –  More susceptible to data skew –  Increased memory pressure for groupBy, reduceByKey, sortByKey, etc. •  Secondary issue: too many partitions •  Need “reasonable number” of partitions –  Commonly between 100 and 10,000 partitions –  Lower bound: At least ~2x number of cores in cluster –  Upper bound: Ensure tasks take at least 100ms
  • 37. Memory Problems •  Symptoms: –  Inexplicably bad performance –  Inexplicable executor/machine failures" (can indicate too many shuffle files too) •  Diagnosis: –  Set spark.executor.extraJavaOptions to include •  -XX:+PrintGCDetails •  -XX:+HeapDumpOnOutOfMemoryError –  Check dmesg for oom-killer logs •  Resolution: –  Increase spark.executor.memory –  Increase number of partitions –  Re-evaluate program structure (!)
  • 38. Fixing our mistakes sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues { names => names.toSet.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 39. Fixing our mistakes sc.textFile(“hdfs:/names”) .repartition(6) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues { names => names.toSet.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 40. Fixing our mistakes sc.textFile(“hdfs:/names”) .repartition(6) .distinct() .map(name => (name.charAt(0), name)) .groupByKey() .mapValues { names => names.toSet.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 41. Fixing our mistakes sc.textFile(“hdfs:/names”) .repartition(6) .distinct() .map(name => (name.charAt(0), name)) .groupByKey() .mapValues { names => names.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 42. Fixing our mistakes sc.textFile(“hdfs:/names”) .distinct(numPartitions = 6) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues { names => names.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 43. Fixing our mistakes sc.textFile(“hdfs:/names”) .distinct(numPartitions = 6) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 44. Fixing our mistakes sc.textFile(“hdfs:/names”) .distinct(numPartitions = 6) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect() Original: sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues { names => names.toSet.size } .collect()