Spark Gotchas and Lessons Learned (2/20/20)

Spark Gotchas and Lessons Learned
Jen Waller, Ph.D.
Boulder/Denver Big Data Meetup
Feb 20, 2020
Boulder, CO

Overview
● Overall Dev Approach
● Useful Spark Built-Ins
● How to Fail at Scale
● Resource Utilization

“Strategery”
● Local machine; simulated
cluster
● Spark-shell/spark-submit
● Tiny subset of data (even
better: TDD w/
programmatically generated
data!)
● Real cluster
● Start tiny: Test functions/ configs
specific to cloud
● Bigger cluster for load testing
● Spark-shell = handy for quick
iteration on manual cluster configs,
load testing one fxn at a time

What about notebooks?
By Sam Shere (1905–1982) - Zeppelin-ramp de Hindenburg / Hindenburg
zeppelin disaster, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=19329337

Spark UI & Spark History Server
● Can access anywhere (local, cloud)
● Jobs/tasks, execution plans, memory
usage, conﬁgs
● Maximizing utility of metrics data
○ Set labels for task groups and jobs
using sparkContext
○ Break jobs and tasks apart by
repartitioning, even dumping to disk

REST API & Metrics Sink(s)
● REST API
○ curl http://localhost:4040/api/v1/applications
● Can configure a set of sinks for:
○ Master, applications, worker, executor,
driver, shuffleService, applicationMaster
(YARN)
● And send metrics to:
○ Console, CSV file, JMX console, within
Spark UI as JSON, Graphite node, slf4j,
StatsD node

Don’t Overload Data Store APIs
Avoid full scans of all partitions:
val df = spark
.read
.parquet(“s3://mybucket/mydata”)
.ﬁlter(col(“mycolumn”).equalTo(“someDate”))
You can still read in data as partitioned without scanning entire table:
val df = spark
.read
.option(“basePath”, “s3://mybucket/mydata”)
.parquet(“s3://mybucket/mydata/someDate”)

Use Built-In Optimizations for Reading Data
● Automatic detection of partitions and eﬃcient data read
○ Provide the basepath when reading in partitions
○ Always provide a schema to prevent repeated schema checking
● Columnar data: Parquet/ORC reader
○ Projection pushdown = only read the columns you need
○ Predicate/ﬁlter pushdown = use metadata to only read in the
rows you need

Beware the Shuﬄe!
● GroupBy, Join, Distinct…
● Amazon suggests avoiding shuﬄe
entirely.
● Do that! Find another way to aggregate
your data (i.e., aggregate it upstream in
Kafka/Kinesis/Flink, index it in
ElasticSearch - there are many good
options)

If you must shuﬄe… Know your data.
● Check for repeated values, nulls on join columns
○ Joining data with repeated values on both sides → gigantic result
○ Joining cols with nulls → massive skew.
■ Can “salt” nulls by pre-ﬁlling arbitrary values into empty cells
○ Cluster resource use could be throttling broadcast joins (check it!)
● Check for skew
○ Grouping by skewed column → Spark naively assigns rows to executors
based on level of skewed column
■ Application == dead (out of memory, network timeouts, lost nodes,
processes that never end)

Controlling Spark Shuﬄes
● Partition your data so it’s mapped across cluster evenly
○ Partition by unique ID
○ Avoid partitioning on cols with a lot of nulls, missing or skewed values
● Partition data to match job you’re running
○ Parallel transforms on many datasets: 200 partitions
○ Billions of pairwise comparisons: 4-10k partitions
○ Tests on single server/locally: 1 partition

Hacks for Shuﬄing Skewed Data
● Limit the job to a single level of skewed variable at a time
(serialize).
● Manually set a small broadcast blockSize to ﬁt the size of the
instance types in your cluster.
● Salt the data

EMR Cluster Resource Gotchas
● By default, Yarn assigns only one vCPU per executor.
● If maximizeResourceAllocation = true, you get only one executor on each
node (i.e., one Yarn container/executor per machine).
● Poor use of resources.
● Lack of parallelism = bad for things that beneﬁt from parallelism, like
broadcast joins.

This gets really messy if
multiple applications
running on one cluster.

How to get Spark to use >= 1 vCPU/machine?
Manually change memory
allocated to
executors/driver?
Nope.

How to get Spark to use >= 1 vCPU/machine?
Change Yarn conﬁgs!

Great! Except…
● Unless you manually set the
number of cores used by the
driver, it’s 1.
● Which is fine unless you switch
to larger instance types…
● Then you should manually
configure cluster resources.
Image credit: https://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/

Summary
● Spark is awesome, but can be tricky.
● Read the docs! Use those helpful Spark built-ins.
● Avoid/manage shuﬄing.
● Use the Hadoop UI to check your resource utilization.

Spark Gotchas and Lessons Learned (2/20/20)

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Spark Gotchas and Lessons Learned (2/20/20)

Similar to Spark Gotchas and Lessons Learned (2/20/20) (20)

Recently uploaded

Recently uploaded (20)

Spark Gotchas and Lessons Learned (2/20/20)