SlideShare a Scribd company logo
1 of 51
Tuning and Debugging in Apache
Spark
Patrick Wendell @pwendell
February 20, 2015
About Me
Apache Spark committer and PMC, release
manager
Worked on Spark at UC Berkeley when the project
started
Today, managing Spark efforts at Databricks
2
About Databricks
Founded by creators of Spark in 2013
Donated Spark to ASF and remain largest
contributor
End-to-End hosted service: Databricks Cloud
3
Today’s Talk
Help you understand and debug Spark programs
Related talk this afternoon:
Assumes you know Spark core API concepts,
focused on internals
4
5
Spark’s Execution Model
6
The key to tuning
Spark apps is a
sound grasp of
Spark’s internal
mechanisms.
Key Question
How does a user program get translated into units
of physical execution: jobs, stages, and tasks:
7
?
RDD API Refresher
RDDs are a distributed collection of records
rdd = spark.parallelize(range(10000), 10)
Transformations create new RDDs from existing
ones
errors = rdd.filter(lambda line: “ERROR” in
line)
Actions materialize a value in the user program
size = errors.count() 8
RDD API Example
// Read input file
val input = sc.textFile("input.txt")
val tokenized = input
.map(line => line.split(" "))
.filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels
.map(words => (words(0), 1)).
.reduceByKey{ (a, b) => a + b, 2 }
9
INFO Server started
INFO Bound to port 8080
WARN Cannot find srv.conf
input.txt
RDD API Example
// Read input file
val input = sc.textFile( )
val tokenized = input
.map(line => line.split(" "))
.filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels
.map(words => (words(0), 1)).
.reduceByKey{ (a, b) => a + b }
10
Transformations
sc.textFile().map().filter().map().reduceByKey()
11
DAG View of RDD’s
textFile() map() filter() map()
reduceByKey()
12
Mapped
RDD
Partition 1
Partition 2
Partition 3
Filtered
RDD
Partition 1
Partition 2
Partition 3
Mapped
RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop
RDD
Partition 1
Partition 2
Partition 3
input tokenized counts
Transformations build up a DAG, but don’t “do
anything”
13
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget
them for a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD
to compute
partitions: Seq[Int], 2.
Which partitions
func: (Iterator[T]) => U)) 3. Fn to
produce results 14
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget
them for a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD
to compute
partitions: Seq[Int], 2.
Which partitions
func: (Iterator[T]) => U)) 3. Fn to
produce results 15
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget
them for a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD
to compute
partitions: Seq[Int], 2.
Which partitions
func: (Iterator[T]) => U)) 3. Fn to
produce results 16
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget
them for a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD
to compute
partitions: Seq[Int], 2.
Which partitions
func: (Iterator[T]) => U)) 3. Fn to
produce results 17
How runJob Works
Needs to compute my parents, parents, parents, etc all the
way back to an RDD with no dependencies (e.g.
HadoopRDD).
18
Mapped
RDD
Partition 1
Partition 2
Partition 3
Filtered
RDD
Partition 1
Partition 2
Partition 3
Mapped
RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop
RDD
Partition 1
Partition 2
Partition 3
input tokenized counts
runJob(counts)
Physical Optimizations
1. Certain types of transformations can be
pipelined.
1. If dependent RDD’s have already been cached
(or persisted in a shuffle) the graph can be
truncated.
Once pipelining and truncation occur, Spark
produces a a set of stages each stage is composed
of tasks
19
How runJob Works
Needs to compute my parents, parents, parents, etc all the
way back to an RDD with no dependencies (e.g.
HadoopRDD).
20
Mapped
RDD
Partition 1
Partition 2
Partition 3
Filtered
RDD
Partition 1
Partition 2
Partition 3
Mapped
RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop
RDD
Partition 1
Partition 2
Partition 3
input tokenized counts
runJob(counts)
How runJob Works
Needs to compute my parents, parents, parents, etc all the
way back to an RDD with no dependencies (e.g.
HadoopRDD).
21
input tokenized counts
Mapped
RDD
Partition 1
Partition 2
Partition 3
Filtered
RDD
Partition 1
Partition 2
Partition 3
Mapped
RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop
RDD
Partition 1
Partition 2
Partition 3
runJob(counts)
How runJob Works
Needs to compute my parents, parents, parents, etc all the
way back to an RDD with no dependencies (e.g.
HadoopRDD).
22
input tokenized counts
Mapped
RDD
Partition 1
Partition 2
Partition 3
Filtered
RDD
Partition 1
Partition 2
Partition 3
Mapped
RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop
RDD
Partition 1
Partition 2
Partition 3
runJob(counts)
Stage Graph
23
Task 1
Task 2
Task 3
Task 1
Task 2
Stage 1 Stage 2
Each task will:
1. Read
Hadoop
input
2. Perform
maps and
filters
3. Write
partial
sums
Each task
will:
1. Read
partial
sums
2. Invoke
user
function
passed to
runJob.Shuffle write Shuffle readInput
Units of Physical Execution
Jobs: Work required to compute RDD in runJob.
Stages: A wave of work within a job, corresponding
to one or more pipelined RDD’s.
Tasks: A unit of work within a stage,
corresponding to one RDD partition.
Shuffle: The transfer of data between stages.
24
Seeing this on your own
scala> counts.toDebugString
res84: String =
(2) ShuffledRDD[296] at reduceByKey at <console>:17
+-(3) MappedRDD[295] at map at <console>:17
| FilteredRDD[294] at filter at <console>:15
| MappedRDD[293] at map at <console>:15
| input.text MappedRDD[292] at textFile at <console>:13
| input.text HadoopRDD[291] at textFile at <console>:13
25
(indentations indicate a shuffle boundary)
Example: count() action
class RDD {
def count(): Long = {
results = sc.runJob(
this, 1. RDD
= self
0 until partitions.size, 2. Partitions = all
partitions
it => it.size() 3. Function
= size of the partition
)
return results.sum
} 26
Example: take(N) action
class RDD {
def take(n: Int) {
val results = new ArrayBuffer[T]
var partition = 0
while (results.size < n) {
result ++= sc.runJob(this, partition, it => it.toArray)
partition = partition + 1
}
return results.take(n)
}
}
27
Putting it All Together
28
Named after action calling runJob
Named after last RDD in pipeline
29
Determinants of Performance in
Spark
Quantity of Data Shuffled
In general, avoiding shuffle will make your program
run faster.
1. Use the built in aggregateByKey() operator
instead of writing your own aggregations.
2. Filter input earlier in the program rather than
later.
3. Go to this afternoon’s talk!
30
Degree of Parallelism
> input = sc.textFile("s3n://log-files/2014/*.log.gz") #matches thousands
of files
> input.getNumPartitions()
35154
> lines = input.filter(lambda line: line.startswith("2014-10-17 08:")) #
selective
> lines.getNumPartitions()
35154
> lines = lines.coalesce(5).cache() # We coalesce the lines RDD before
caching
> lines.getNumPartitions()
5
>>> lines.count() # occurs on coalesced RDD 31
Degree of Parallelism
If you have a huge number of mostly idle tasks (e.g.
10’s of thousands), then it’s often good to coalesce.
If you are not using all slots in your cluster,
repartition can increase parallelism.
32
Choice of Serializer
Serialization is sometimes a bottleneck when
shuffling and caching data. Using the Kryo
serializer is often faster.
val conf = new SparkConf()
conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
// Be strict about class registration
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(Array(classOf[MyClass],
classOf[MyOtherClass])) 33
Cache Format
By default Spark will cache() data using
MEMORY_ONLY level, deserialized JVM objects
MEMORY_ONLY_SER can help cut down on
GC
MEMORY_AND_DISK can avoid expensive
recompuations
34
Hardware
Spark scales horizontally, so more is better
Disk/Memory/Network balance depends on
workload: CPU intensive ML jobs vs IO intensive
ETL jobs
Good to keep executor heap size to 64GB or less
(can run multiple on each node)
35
Other Performance Tweaks
Switching to LZF compression can improve shuffle
performance (sacrifices some robustness for
massive shuffles):
conf.set(“spark.io.compression.codec”, “lzf”)
Turn on speculative execution to help prevent
stragglers
conf.set(“spark.speculation”, “true”)
36
Other Performance Tweaks
Make sure to give Spark as many disks as possible
to allow striping shuffle output
SPARK_LOCAL_DIRS in Mesos/Standalone
In YARN mode, inherits YARN’s local directories
37
38
One Weird Trick for Great
Performance
Use Higher Level API’s!
DataFrame APIs for core processing
Works across Scala, Java, Python and R
Spark ML for machine learning
Spark SQL for structured query processing
39
40
See also
Chapter 8: Tuning and
Debugging Spark.
Come to Spark Summit 2015!
41
June 15-17 in San
Francisco
Other Spark Happenings Today
Spark team “Ask Us Anything” at 2:20 in 211 B
Tips for writing better Spark programs at 4:00 in
230C
I’ll be around Databricks booth after this
42
Thank you.
Any questions?
43
Extra Slides
44
Internals of the RDD Interface
45
1) List of partitions
2) Set of dependencies on parent RDDs
3) Function to compute a partition, given parents
4) Optional partitioning info for k/v RDDs (Partitioner)
RDD
Partition
1
Partition
2
Partition
3
Example: Hadoop RDD
46
Partitions = 1 per HDFS block
Dependencies = None
compute(partition) = read corresponding HDFS block
Partitioner = None
> rdd =
spark.hadoopFile(“hdfs://click_logs/”)
Example: Filtered RDD
47
Partitions = parent partitions
Dependencies = a single parent
compute(partition) = call parent.compute(partition) and filter
Partitioner = parent partitioner
> filtered = rdd.filter(lambda x: x contains
“ERROR”)
Example: Joined RDD
48
Partitions = number chosen by user or heuristics
Dependencies = ShuffleDependency on two or more
parents
compute(partition) = read and join data from all parents
Partitioner = HashPartitioner(# partitions)
49
A More Complex DAG
Joined RDD
Partition
1
Partition
2
Partition
3
Filtered
RDDPartition
1
Partition
2
Mapped
RDDPartition
1
Partition
2
Hadoop
RDDPartition
1
Partition
2
JDBC RDD
Partition
1
Partition
2
Filtered
RDDPartition
1
Partition
2
Partition
3
.count()
50
A More Complex DAG
Stage 3
Task 1
Task 2
Task 3
Stage 2
Task 1
Task 2
Stage 1
Task 1
Task 2
Shuffle
Read
Shuffle
Write
51
RDD
Partition
1
Partition
2
Partition
3
Parent
Partition
1
Partition
2
Partition
3
Narrow and Wide Transformations
RDD
Partition
1
Partition
2
Partition
3
Parent 1
Partition
1
Partition
2
Parent 2
Partition
1
Partition
2
FilteredRDD JoinedRDD

More Related Content

What's hot

How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 

What's hot (20)

How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 

Viewers also liked

Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsSamir Bessalah
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Summit
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in SparkShiao-An Yuan
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 

Viewers also liked (20)

Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya Bida
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 

Similar to Tuning and Debugging in Apache Spark

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
DAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdfDAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdfJoeKibangu
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 

Similar to Tuning and Debugging in Apache Spark (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
DAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdfDAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdf
 
Spark
SparkSpark
Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 

Recently uploaded

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 

Recently uploaded (20)

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 

Tuning and Debugging in Apache Spark

  • 1. Tuning and Debugging in Apache Spark Patrick Wendell @pwendell February 20, 2015
  • 2. About Me Apache Spark committer and PMC, release manager Worked on Spark at UC Berkeley when the project started Today, managing Spark efforts at Databricks 2
  • 3. About Databricks Founded by creators of Spark in 2013 Donated Spark to ASF and remain largest contributor End-to-End hosted service: Databricks Cloud 3
  • 4. Today’s Talk Help you understand and debug Spark programs Related talk this afternoon: Assumes you know Spark core API concepts, focused on internals 4
  • 6. 6 The key to tuning Spark apps is a sound grasp of Spark’s internal mechanisms.
  • 7. Key Question How does a user program get translated into units of physical execution: jobs, stages, and tasks: 7 ?
  • 8. RDD API Refresher RDDs are a distributed collection of records rdd = spark.parallelize(range(10000), 10) Transformations create new RDDs from existing ones errors = rdd.filter(lambda line: “ERROR” in line) Actions materialize a value in the user program size = errors.count() 8
  • 9. RDD API Example // Read input file val input = sc.textFile("input.txt") val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b, 2 } 9 INFO Server started INFO Bound to port 8080 WARN Cannot find srv.conf input.txt
  • 10. RDD API Example // Read input file val input = sc.textFile( ) val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b } 10
  • 12. DAG View of RDD’s textFile() map() filter() map() reduceByKey() 12 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input tokenized counts
  • 13. Transformations build up a DAG, but don’t “do anything” 13
  • 14. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results 14
  • 15. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results 15
  • 16. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results 16
  • 17. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results 17
  • 18. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 18 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input tokenized counts runJob(counts)
  • 19. Physical Optimizations 1. Certain types of transformations can be pipelined. 1. If dependent RDD’s have already been cached (or persisted in a shuffle) the graph can be truncated. Once pipelining and truncation occur, Spark produces a a set of stages each stage is composed of tasks 19
  • 20. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 20 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input tokenized counts runJob(counts)
  • 21. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 21 input tokenized counts Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 runJob(counts)
  • 22. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 22 input tokenized counts Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 runJob(counts)
  • 23. Stage Graph 23 Task 1 Task 2 Task 3 Task 1 Task 2 Stage 1 Stage 2 Each task will: 1. Read Hadoop input 2. Perform maps and filters 3. Write partial sums Each task will: 1. Read partial sums 2. Invoke user function passed to runJob.Shuffle write Shuffle readInput
  • 24. Units of Physical Execution Jobs: Work required to compute RDD in runJob. Stages: A wave of work within a job, corresponding to one or more pipelined RDD’s. Tasks: A unit of work within a stage, corresponding to one RDD partition. Shuffle: The transfer of data between stages. 24
  • 25. Seeing this on your own scala> counts.toDebugString res84: String = (2) ShuffledRDD[296] at reduceByKey at <console>:17 +-(3) MappedRDD[295] at map at <console>:17 | FilteredRDD[294] at filter at <console>:15 | MappedRDD[293] at map at <console>:15 | input.text MappedRDD[292] at textFile at <console>:13 | input.text HadoopRDD[291] at textFile at <console>:13 25 (indentations indicate a shuffle boundary)
  • 26. Example: count() action class RDD { def count(): Long = { results = sc.runJob( this, 1. RDD = self 0 until partitions.size, 2. Partitions = all partitions it => it.size() 3. Function = size of the partition ) return results.sum } 26
  • 27. Example: take(N) action class RDD { def take(n: Int) { val results = new ArrayBuffer[T] var partition = 0 while (results.size < n) { result ++= sc.runJob(this, partition, it => it.toArray) partition = partition + 1 } return results.take(n) } } 27
  • 28. Putting it All Together 28 Named after action calling runJob Named after last RDD in pipeline
  • 30. Quantity of Data Shuffled In general, avoiding shuffle will make your program run faster. 1. Use the built in aggregateByKey() operator instead of writing your own aggregations. 2. Filter input earlier in the program rather than later. 3. Go to this afternoon’s talk! 30
  • 31. Degree of Parallelism > input = sc.textFile("s3n://log-files/2014/*.log.gz") #matches thousands of files > input.getNumPartitions() 35154 > lines = input.filter(lambda line: line.startswith("2014-10-17 08:")) # selective > lines.getNumPartitions() 35154 > lines = lines.coalesce(5).cache() # We coalesce the lines RDD before caching > lines.getNumPartitions() 5 >>> lines.count() # occurs on coalesced RDD 31
  • 32. Degree of Parallelism If you have a huge number of mostly idle tasks (e.g. 10’s of thousands), then it’s often good to coalesce. If you are not using all slots in your cluster, repartition can increase parallelism. 32
  • 33. Choice of Serializer Serialization is sometimes a bottleneck when shuffling and caching data. Using the Kryo serializer is often faster. val conf = new SparkConf() conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") // Be strict about class registration conf.set("spark.kryo.registrationRequired", "true") conf.registerKryoClasses(Array(classOf[MyClass], classOf[MyOtherClass])) 33
  • 34. Cache Format By default Spark will cache() data using MEMORY_ONLY level, deserialized JVM objects MEMORY_ONLY_SER can help cut down on GC MEMORY_AND_DISK can avoid expensive recompuations 34
  • 35. Hardware Spark scales horizontally, so more is better Disk/Memory/Network balance depends on workload: CPU intensive ML jobs vs IO intensive ETL jobs Good to keep executor heap size to 64GB or less (can run multiple on each node) 35
  • 36. Other Performance Tweaks Switching to LZF compression can improve shuffle performance (sacrifices some robustness for massive shuffles): conf.set(“spark.io.compression.codec”, “lzf”) Turn on speculative execution to help prevent stragglers conf.set(“spark.speculation”, “true”) 36
  • 37. Other Performance Tweaks Make sure to give Spark as many disks as possible to allow striping shuffle output SPARK_LOCAL_DIRS in Mesos/Standalone In YARN mode, inherits YARN’s local directories 37
  • 38. 38 One Weird Trick for Great Performance
  • 39. Use Higher Level API’s! DataFrame APIs for core processing Works across Scala, Java, Python and R Spark ML for machine learning Spark SQL for structured query processing 39
  • 40. 40 See also Chapter 8: Tuning and Debugging Spark.
  • 41. Come to Spark Summit 2015! 41 June 15-17 in San Francisco
  • 42. Other Spark Happenings Today Spark team “Ask Us Anything” at 2:20 in 211 B Tips for writing better Spark programs at 4:00 in 230C I’ll be around Databricks booth after this 42
  • 45. Internals of the RDD Interface 45 1) List of partitions 2) Set of dependencies on parent RDDs 3) Function to compute a partition, given parents 4) Optional partitioning info for k/v RDDs (Partitioner) RDD Partition 1 Partition 2 Partition 3
  • 46. Example: Hadoop RDD 46 Partitions = 1 per HDFS block Dependencies = None compute(partition) = read corresponding HDFS block Partitioner = None > rdd = spark.hadoopFile(“hdfs://click_logs/”)
  • 47. Example: Filtered RDD 47 Partitions = parent partitions Dependencies = a single parent compute(partition) = call parent.compute(partition) and filter Partitioner = parent partitioner > filtered = rdd.filter(lambda x: x contains “ERROR”)
  • 48. Example: Joined RDD 48 Partitions = number chosen by user or heuristics Dependencies = ShuffleDependency on two or more parents compute(partition) = read and join data from all parents Partitioner = HashPartitioner(# partitions)
  • 49. 49 A More Complex DAG Joined RDD Partition 1 Partition 2 Partition 3 Filtered RDDPartition 1 Partition 2 Mapped RDDPartition 1 Partition 2 Hadoop RDDPartition 1 Partition 2 JDBC RDD Partition 1 Partition 2 Filtered RDDPartition 1 Partition 2 Partition 3 .count()
  • 50. 50 A More Complex DAG Stage 3 Task 1 Task 2 Task 3 Stage 2 Task 1 Task 2 Stage 1 Task 1 Task 2 Shuffle Read Shuffle Write
  • 51. 51 RDD Partition 1 Partition 2 Partition 3 Parent Partition 1 Partition 2 Partition 3 Narrow and Wide Transformations RDD Partition 1 Partition 2 Partition 3 Parent 1 Partition 1 Partition 2 Parent 2 Partition 1 Partition 2 FilteredRDD JoinedRDD