SlideShare a Scribd company logo
1 of 50
Download to read offline
Tuning and Debugging in
Apache Spark
Patrick Wendell @pwendell
February 20, 2015
About Me
Apache Spark committer and PMC, release manager
Worked on Spark at UC Berkeley when the project started
Today, managing Spark efforts at Databricks
2
About Databricks
Founded by creators of Spark in 2013
Donated Spark to ASF and remain largest contributor
End-to-End hosted service: Databricks Cloud
3
Today’s Talk
Help you understand and debug Spark programs
Assumes you know Spark core API concepts, focused on
internals
4
5
Spark’s Execution Model
6
The	
  key	
  to	
  tuning	
  
Spark	
  apps	
  is	
  a	
  
sound	
  grasp	
  of	
  
Spark’s	
  internal	
  
mechanisms.	
  
Key Question
How does a user program get translated into units of
physical execution: jobs, stages, and tasks:
7
?	
  
RDD API Refresher
RDDs are a distributed collection of records
rdd = spark.parallelize(range(10000), 10)
Transformations create new RDDs from existing ones
errors = rdd.filter(lambda line: “ERROR” in line)
Actions materialize a value in the user program
size = errors.count()
8
RDD API Example
// Read input file
val input = sc.textFile("input.txt")
val tokenized = input
.map(line => line.split(" "))
.filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels
.map(words => (words(0), 1)).
.reduceByKey{ (a, b) => a + b, 2 }
9
INFO Server started!
INFO Bound to port 8080!
!
WARN Cannot find srv.conf!
input.txt!
RDD API Example
// Read input file
val input = sc.textFile( )
val tokenized = input
.map(line => line.split(" "))
.filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels
.map(words => (words(0), 1)).
.reduceByKey{ (a, b) => a + b }
10
Transformations
sc.textFile().map().filter().map().reduceByKey()
11
DAG View of RDD’s
textFile() map() filter() map() reduceByKey()
12
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input	
  	
   tokenized	
   counts	
  
Transformations build up a DAG, but don’t “do anything”
13
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget them for
a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD to compute
partitions: Seq[Int], 2. Which partitions
func: (Iterator[T]) => U)) 3. Fn to produce results
: Array[U] à results for each part.
14
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget them for
a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD to compute
partitions: Seq[Int], 2. Which partitions
func: (Iterator[T]) => U)) 3. Fn to produce results
: Array[U] à results for each part.
15
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget them for
a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD to compute
partitions: Seq[Int], 2. Which partitions
func: (Iterator[T]) => U)) 3. Fn to produce results
: Array[U] à results for each part.
16
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget them for
a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD to compute
partitions: Seq[Int], 2. Which partitions
func: (Iterator[T]) => U)) 3. Fn to produce results
: Array[U] à results for each part.
17
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back
to an RDD with no dependencies (e.g. HadoopRDD).
18
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input	
  	
   tokenized	
   counts	
  
runJob(counts)
Physical Optimizations
1.  Certain types of transformations can be pipelined.
2.  If dependent RDD’s have already been cached (or
persisted in a shuffle) the graph can be truncated.
Once pipelining and truncation occur, Spark produces a
a set of stages each stage is composed of tasks
19
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back
to an RDD with no dependencies (e.g. HadoopRDD).
20
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input	
  	
   tokenized	
   counts	
  
runJob(counts)
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back
to an RDD with no dependencies (e.g. HadoopRDD).
21
input	
  	
   tokenized	
   counts	
  
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
runJob(counts)
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back
to an RDD with no dependencies (e.g. HadoopRDD).
22
input	
  	
   tokenized	
   counts	
  
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
runJob(counts)
Stage Graph
23
Task 1
Task 2
Task 3
Task 1
Task 2
Stage 1 Stage 2
Each task will:
1.  Read
Hadoop
input
2.  Perform
maps and
filters
3.  Write partial
sums
Each task will:
1.  Read
partial
sums
2.  Invoke user
function
passed to
runJob.
Shuffle write Shuffle readInput read
Units of Physical Execution
Jobs: Work required to compute RDD in runJob.
Stages: A wave of work within a job, corresponding to
one or more pipelined RDD’s.
Tasks: A unit of work within a stage, corresponding to
one RDD partition.
Shuffle: The transfer of data between stages.
24
Seeing this on your own
scala> counts.toDebugString
res84: String =
(2) ShuffledRDD[296] at reduceByKey at <console>:17
+-(3) MappedRDD[295] at map at <console>:17
| FilteredRDD[294] at filter at <console>:15
| MappedRDD[293] at map at <console>:15
| input.text MappedRDD[292] at textFile at <console>:13
| input.text HadoopRDD[291] at textFile at <console>:13
25
(indentations indicate a shuffle boundary)
Example: count() action
class RDD {
def count(): Long = {
results = sc.runJob(
this, 1. RDD = self
0 until partitions.size, 2. Partitions = all partitions
it => it.size() 3. Function = size of the partition
)
return results.sum
}
}
26
Example: take(N) action
class RDD {
def take(n: Int) {
val results = new ArrayBuffer[T]
var partition = 0
while (results.size < n) {
result ++= sc.runJob(this, partition, it => it.toArray)
partition = partition + 1
}
return results.take(n)
}
}
27
Putting it All Together
28
Named after action calling runJob
Named after last RDD in pipeline
29
Determinants of Performance in Spark
Quantity of Data Shuffled
In general, avoiding shuffle will make your program run
faster.
1.  Use the built in aggregateByKey() operator instead of
writing your own aggregations.
2.  Filter input earlier in the program rather than later.
3.  Go to this afternoon’s talk!
30
Degree of Parallelism
> input = sc.textFile("s3n://log-files/2014/*.log.gz") #matches thousands of files
> input.getNumPartitions()
35154
> lines = input.filter(lambda line: line.startswith("2014-10-17 08:")) # selective
> lines.getNumPartitions()
35154
> lines = lines.coalesce(5).cache() # We coalesce the lines RDD before caching
> lines.getNumPartitions()
5
>>> lines.count() # occurs on coalesced RDD
31
Degree of Parallelism
If you have a huge number of mostly idle tasks (e.g. 10’s
of thousands), then it’s often good to coalesce.
If you are not using all slots in your cluster, repartition
can increase parallelism.
32
Choice of Serializer
Serialization is sometimes a bottleneck when shuffling
and caching data. Using the Kryo serializer is often faster.
val conf = new SparkConf()
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// Be strict about class registration
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(Array(classOf[MyClass],
classOf[MyOtherClass]))
33
Cache Format
By default Spark will cache() data using MEMORY_ONLY
level, deserialized JVM objects
MEMORY_ONLY_SER can help cut down on GC
MEMORY_AND_DISK can avoid expensive
recompuations
34
Hardware
Spark scales horizontally, so more is better
Disk/Memory/Network balance depends on workload:
CPU intensive ML jobs vs IO intensive ETL jobs
Good to keep executor heap size to 64GB or less (can run
multiple on each node)
35
Other Performance Tweaks
Switching to LZF compression can improve shuffle
performance (sacrifices some robustness for massive
shuffles):
conf.set(“spark.io.compression.codec”, “lzf”)
Turn on speculative execution to help prevent stragglers
conf.set(“spark.speculation”, “true”)
36
Other Performance Tweaks
Make sure to give Spark as many disks as possible to
allow striping shuffle output
SPARK_LOCAL_DIRS in Mesos/Standalone
In YARN mode, inherits YARN’s local directories
37
38
One Weird Trick for Great Performance
Use Higher Level API’s!
DataFrame APIs for core processing
Works across Scala, Java, Python and R
Spark ML for machine learning
Spark SQL for structured query processing
39
40
See also
Chapter 8: Tuning and
Debugging Spark.
Come to Spark Summit 2015!
41
June 15-17 in San Francisco
Thank you.
Any questions?
42
Extra Slides
43
Internals of the RDD Interface
44
1)  List of partitions
2)  Set of dependencies on parent RDDs
3)  Function to compute a partition, given parents
4)  Optional partitioning info for k/v RDDs (Partitioner)
RDD
Partition 1
Partition 2
Partition 3
Example: Hadoop RDD
45
Partitions = 1 per HDFS block
Dependencies = None
compute(partition) = read corresponding HDFS block
Partitioner = None
> rdd = spark.hadoopFile(“hdfs://click_logs/”)
Example: Filtered RDD
46
Partitions = parent partitions
Dependencies = a single parent
compute(partition) = call parent.compute(partition) and filter
Partitioner = parent partitioner
> filtered = rdd.filter(lambda x: x contains “ERROR”)
Example: Joined RDD
47
Partitions = number chosen by user or heuristics
Dependencies = ShuffleDependency on two or more parents
compute(partition) = read and join data from all parents
Partitioner = HashPartitioner(# partitions)
48
A More Complex DAG
Joined RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Mapped RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
JDBC RDD
Partition 1
Partition 2
Filtered RDD
Partition 1
Partition 2
Partition 3
.count()	
  
49
A More Complex DAG
Stage 3
Task 1
Task 2
Task 3
Stage 2
Task 1
Task 2
Stage 1
Task 1
Task 2
Shuffle
Read
Shuffle
Write
50
RDD
Partition 1
Partition 2
Partition 3
Parent
Partition 1
Partition 2
Partition 3
Narrow and Wide Transformations
RDD
Partition 1
Partition 2
Partition 3
Parent 1
Partition 1
Partition 2
Parent 2
Partition 1
Partition 2
FilteredRDD JoinedRDD

More Related Content

What's hot

Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
In-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportIn-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportAlexander Korotkov
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Alexis Seigneurin
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleColin Charles
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path HBaseCon
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger InternalsNorberto Leite
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogicalUmair Shahid
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Microsoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxMicrosoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxsamtakke1
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at FacebookDatabricks
 
YugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on KubernetesYugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on KubernetesDoKC
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 

What's hot (20)

Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
In-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportIn-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction support
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
Mongodb
MongodbMongodb
Mongodb
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogical
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Microsoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxMicrosoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptx
 
Get to know PostgreSQL!
Get to know PostgreSQL!Get to know PostgreSQL!
Get to know PostgreSQL!
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
YugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on KubernetesYugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on Kubernetes
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 

Viewers also liked

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015Databricks
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
 
TensorFlow User Group #1
TensorFlow User Group #1TensorFlow User Group #1
TensorFlow User Group #1陽平 山口
 
デブサミ2017 公募セッション募集要項
デブサミ2017 公募セッション募集要項デブサミ2017 公募セッション募集要項
デブサミ2017 公募セッション募集要項Developers Summit
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Hiroki Nakahara
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014Cloudera Japan
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 

Viewers also liked (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 
TensorFlow User Group #1
TensorFlow User Group #1TensorFlow User Group #1
TensorFlow User Group #1
 
デブサミ2017 公募セッション募集要項
デブサミ2017 公募セッション募集要項デブサミ2017 公募セッション募集要項
デブサミ2017 公募セッション募集要項
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Culture
CultureCulture
Culture
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 

Similar to Tuning and Debugging in Apache Spark

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 

Similar to Tuning and Debugging in Apache Spark (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark
SparkSpark
Spark
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 

Recently uploaded (20)

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 

Tuning and Debugging in Apache Spark

  • 1. Tuning and Debugging in Apache Spark Patrick Wendell @pwendell February 20, 2015
  • 2. About Me Apache Spark committer and PMC, release manager Worked on Spark at UC Berkeley when the project started Today, managing Spark efforts at Databricks 2
  • 3. About Databricks Founded by creators of Spark in 2013 Donated Spark to ASF and remain largest contributor End-to-End hosted service: Databricks Cloud 3
  • 4. Today’s Talk Help you understand and debug Spark programs Assumes you know Spark core API concepts, focused on internals 4
  • 6. 6 The  key  to  tuning   Spark  apps  is  a   sound  grasp  of   Spark’s  internal   mechanisms.  
  • 7. Key Question How does a user program get translated into units of physical execution: jobs, stages, and tasks: 7 ?  
  • 8. RDD API Refresher RDDs are a distributed collection of records rdd = spark.parallelize(range(10000), 10) Transformations create new RDDs from existing ones errors = rdd.filter(lambda line: “ERROR” in line) Actions materialize a value in the user program size = errors.count() 8
  • 9. RDD API Example // Read input file val input = sc.textFile("input.txt") val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b, 2 } 9 INFO Server started! INFO Bound to port 8080! ! WARN Cannot find srv.conf! input.txt!
  • 10. RDD API Example // Read input file val input = sc.textFile( ) val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b } 10
  • 12. DAG View of RDD’s textFile() map() filter() map() reduceByKey() 12 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input     tokenized   counts  
  • 13. Transformations build up a DAG, but don’t “do anything” 13
  • 14. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results : Array[U] à results for each part. 14
  • 15. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results : Array[U] à results for each part. 15
  • 16. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results : Array[U] à results for each part. 16
  • 17. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results : Array[U] à results for each part. 17
  • 18. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 18 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input     tokenized   counts   runJob(counts)
  • 19. Physical Optimizations 1.  Certain types of transformations can be pipelined. 2.  If dependent RDD’s have already been cached (or persisted in a shuffle) the graph can be truncated. Once pipelining and truncation occur, Spark produces a a set of stages each stage is composed of tasks 19
  • 20. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 20 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input     tokenized   counts   runJob(counts)
  • 21. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 21 input     tokenized   counts   Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 runJob(counts)
  • 22. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 22 input     tokenized   counts   Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 runJob(counts)
  • 23. Stage Graph 23 Task 1 Task 2 Task 3 Task 1 Task 2 Stage 1 Stage 2 Each task will: 1.  Read Hadoop input 2.  Perform maps and filters 3.  Write partial sums Each task will: 1.  Read partial sums 2.  Invoke user function passed to runJob. Shuffle write Shuffle readInput read
  • 24. Units of Physical Execution Jobs: Work required to compute RDD in runJob. Stages: A wave of work within a job, corresponding to one or more pipelined RDD’s. Tasks: A unit of work within a stage, corresponding to one RDD partition. Shuffle: The transfer of data between stages. 24
  • 25. Seeing this on your own scala> counts.toDebugString res84: String = (2) ShuffledRDD[296] at reduceByKey at <console>:17 +-(3) MappedRDD[295] at map at <console>:17 | FilteredRDD[294] at filter at <console>:15 | MappedRDD[293] at map at <console>:15 | input.text MappedRDD[292] at textFile at <console>:13 | input.text HadoopRDD[291] at textFile at <console>:13 25 (indentations indicate a shuffle boundary)
  • 26. Example: count() action class RDD { def count(): Long = { results = sc.runJob( this, 1. RDD = self 0 until partitions.size, 2. Partitions = all partitions it => it.size() 3. Function = size of the partition ) return results.sum } } 26
  • 27. Example: take(N) action class RDD { def take(n: Int) { val results = new ArrayBuffer[T] var partition = 0 while (results.size < n) { result ++= sc.runJob(this, partition, it => it.toArray) partition = partition + 1 } return results.take(n) } } 27
  • 28. Putting it All Together 28 Named after action calling runJob Named after last RDD in pipeline
  • 30. Quantity of Data Shuffled In general, avoiding shuffle will make your program run faster. 1.  Use the built in aggregateByKey() operator instead of writing your own aggregations. 2.  Filter input earlier in the program rather than later. 3.  Go to this afternoon’s talk! 30
  • 31. Degree of Parallelism > input = sc.textFile("s3n://log-files/2014/*.log.gz") #matches thousands of files > input.getNumPartitions() 35154 > lines = input.filter(lambda line: line.startswith("2014-10-17 08:")) # selective > lines.getNumPartitions() 35154 > lines = lines.coalesce(5).cache() # We coalesce the lines RDD before caching > lines.getNumPartitions() 5 >>> lines.count() # occurs on coalesced RDD 31
  • 32. Degree of Parallelism If you have a huge number of mostly idle tasks (e.g. 10’s of thousands), then it’s often good to coalesce. If you are not using all slots in your cluster, repartition can increase parallelism. 32
  • 33. Choice of Serializer Serialization is sometimes a bottleneck when shuffling and caching data. Using the Kryo serializer is often faster. val conf = new SparkConf() conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") // Be strict about class registration conf.set("spark.kryo.registrationRequired", "true") conf.registerKryoClasses(Array(classOf[MyClass], classOf[MyOtherClass])) 33
  • 34. Cache Format By default Spark will cache() data using MEMORY_ONLY level, deserialized JVM objects MEMORY_ONLY_SER can help cut down on GC MEMORY_AND_DISK can avoid expensive recompuations 34
  • 35. Hardware Spark scales horizontally, so more is better Disk/Memory/Network balance depends on workload: CPU intensive ML jobs vs IO intensive ETL jobs Good to keep executor heap size to 64GB or less (can run multiple on each node) 35
  • 36. Other Performance Tweaks Switching to LZF compression can improve shuffle performance (sacrifices some robustness for massive shuffles): conf.set(“spark.io.compression.codec”, “lzf”) Turn on speculative execution to help prevent stragglers conf.set(“spark.speculation”, “true”) 36
  • 37. Other Performance Tweaks Make sure to give Spark as many disks as possible to allow striping shuffle output SPARK_LOCAL_DIRS in Mesos/Standalone In YARN mode, inherits YARN’s local directories 37
  • 38. 38 One Weird Trick for Great Performance
  • 39. Use Higher Level API’s! DataFrame APIs for core processing Works across Scala, Java, Python and R Spark ML for machine learning Spark SQL for structured query processing 39
  • 40. 40 See also Chapter 8: Tuning and Debugging Spark.
  • 41. Come to Spark Summit 2015! 41 June 15-17 in San Francisco
  • 44. Internals of the RDD Interface 44 1)  List of partitions 2)  Set of dependencies on parent RDDs 3)  Function to compute a partition, given parents 4)  Optional partitioning info for k/v RDDs (Partitioner) RDD Partition 1 Partition 2 Partition 3
  • 45. Example: Hadoop RDD 45 Partitions = 1 per HDFS block Dependencies = None compute(partition) = read corresponding HDFS block Partitioner = None > rdd = spark.hadoopFile(“hdfs://click_logs/”)
  • 46. Example: Filtered RDD 46 Partitions = parent partitions Dependencies = a single parent compute(partition) = call parent.compute(partition) and filter Partitioner = parent partitioner > filtered = rdd.filter(lambda x: x contains “ERROR”)
  • 47. Example: Joined RDD 47 Partitions = number chosen by user or heuristics Dependencies = ShuffleDependency on two or more parents compute(partition) = read and join data from all parents Partitioner = HashPartitioner(# partitions)
  • 48. 48 A More Complex DAG Joined RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Mapped RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 JDBC RDD Partition 1 Partition 2 Filtered RDD Partition 1 Partition 2 Partition 3 .count()  
  • 49. 49 A More Complex DAG Stage 3 Task 1 Task 2 Task 3 Stage 2 Task 1 Task 2 Stage 1 Task 1 Task 2 Shuffle Read Shuffle Write
  • 50. 50 RDD Partition 1 Partition 2 Partition 3 Parent Partition 1 Partition 2 Partition 3 Narrow and Wide Transformations RDD Partition 1 Partition 2 Partition 3 Parent 1 Partition 1 Partition 2 Parent 2 Partition 1 Partition 2 FilteredRDD JoinedRDD