Apache spark? if only it worked

@mszymani#Devoxx #TantusData #spark
Apache Spark -
if only it worked
Marcin Szymaniuk
TantusData

http://blog.explainmydata.com/2014/05/

Agenda
• Execution model
• Sizing executors
• Skewed data
• Locality
• Caching
• Debugging tools
• Local run/tests
• Challenge and a draw!

What is Spark?
• Engine for distributed data processing
• Java, Scala, R, Python API
• SQL, Streaming, Machine Learning

…
…
…
… … … …
Stage 1
Stage N
…
…
…
… … … …
Stage 2
…
Shuﬄe
Shuﬄe
RDD

…
…
…
… … … …
Stage 1
Stage N
…
…
…
… … … …
Stage 2
…
Shuﬄe
Shuﬄe

.reduceByKey{case (x, y) => x + y}
.saveAsTextFile(output)
Block 1 Block 2
…
Task 1
sc.textFile(“/input/text/“)
.flatMap(line=>line.split(" "))
.map(word => (word, 1))
Stage 1
Stage 2
… … … …
…
…
…
… … … …
Stage 2
…
…
Stage 1
………

Sizing executors
Executor - JVM process able to run one or more tasks
…
Executor
Example conﬁg: --executor-cores 3 --executor-memory 10g
…
Executor
…
…
Executor
…
Pending Tasks
Complete Tasks
…
…
Executor
Job resources

…
Executor
…
Executor
…
…
Executor
…
Pending Tasks
Complete Tasks
…
…
Executor
Job resources
Driver
Sizing executors

Sizing executors
…
Executor
Executor
Executor
Executor
Executor
Node
…
Executor
Executor
Node
…
Node
…
Executor
Executor
Executor
Executor
Node
Executor
…
vs vs vs vs…

Sizing executors
• Spark can beneﬁt from running multiple tasks in the same JVM
• Many cores leads to problems: HDFS I/O, GC
• 1-4 CPUs should be good for start

Sizing executors
Executor heap
Container
Memory overhead
spark.executor.memory
spark.yarn.executor.memoryOverhead
spark.shuffle.memoryFraction
spark.storage.memoryFraction
Executor heap
Cache
Shuﬄe
User program

Sizing executors
• Keep in mind memory overhead
• Keep some resources for OS
• Determining memory consumption - cache an RDD
• Consider dynamic resource allocation

Shuffle zoom-in
Task 1 Task 2 Task 3
Stage 1
Stage 2

Stage 1
Stage 2
Shuffle zoom-in

Stage 1
Stage 2
BOOM!BOOM!
Shuffle zoom-in

We are up & running
• 2G block limit
• Timeouts
• GC overhead limit exceeded
• OOM
• ExecutorLostFailure

We are up & running
What to watch out for:
• Failing tasks
• GC heavy tasks
• Shufﬂe read/write sizes
• Long running tasks

We are up & running
• Too much data per task?
• Is parallelism level large enough?
• More memory for executor?
def groupByKey(numPartitions: Int)
def repartition(numPartitions: Int)

Skewed data
… … … …
…
…
…
… … … … … … … …
…
…
…
… … … …
You want: NOT:
FOO V1
FOO V2
FOO V3
FOO_1 V1
FOO_1 OUT1
SALT EXEC FOO OUTMERGE
V4BAR
FOO_2 V2
FOO_1 V3
BAR_1 V4
FOO_2 OUT2
BAR_1 OUT3
BAR OUT

Locality
Executor
Node 1
HDFS Block 1
Executor
Node 2
HDFS Block 2
Executor
Node 3
HDFS Block 3
Driver
sparkContext.textFile(“hdfs://…”)

Locality
• Increase number of executors
• For small jobs better to leave as is
• spark.locality.wait parameter

Caching
val rdd1=calculate1()
rdd1.
…
saveAsTextFile(…)
rdd1.
…
saveAsTextFile(…)
Executed twice!

val rdd1=calculate1()
rdd1.
…
saveAsTextFile(…)
rdd1.
…
saveAsTextFile(…)
Executed twice!
Caching

Caching
• Branch in execution plan is a candidate for caching
• Spark UI shows to see how much memory an RDD is taking
• You cannot control priority - it's LRU
• Don't cache to disk if computation is cheap
• Caching with RF - only when recreation is extremely costly
• Checkpointing vs caching
• Shufﬂe data is automatically persisted

Join zoom-in
Task 1 Task 2
Stage 1
Task 1 Task 2
Stage 2
Stage 3

Optimize shuffle
Use the same number of partition
Change map to mapValues

Broadcast Variable
Executor
RDD1_0
RDD2
RDD2
Join
Executor
RDD1_1
RDD2
Join
Executor
RDD1_2
RDD2
Join
Executor
RDD1_3
RDD2
Join
Broadcast

Optimize shuffle - recap
• Control number of partitions
• Use mapValues instead of map if you can
• Broadcast variables
• Filter before shufﬂe
• Avoid groupByKey, use reduceByKey

Debugging tools
• Spark UI
• HDFS monitoring
• Aggregate your logs
• Extra Java options - observe your GC
• Run and test locally

Challenge time!
goo.gl/7eTtvH

Use case
UUID
UUID
UUID
UUID
UUID
UUID
goo.gl/7eTtvH

General notes
• Tests vs no tests? - Test your code!
• You are probably not the only user of the cluster
• What are you optimizing for?
• Share the knowledge
• Spark actually works :)
goo.gl/7eTtvH

Q&A

Thank you!
@mszymani
marcin@tantusdata.com

Apache spark? if only it worked

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Apache spark? if only it worked

Similar to Apache spark? if only it worked (20)

Recently uploaded

Recently uploaded (20)

Apache spark? if only it worked