Sparklife - Life In The Trenches With Spark

Life in the Trenches with Apache
Spark

Who? What?
I work for Kogentix, Inc
Big Data projects, often with Spark (Core, Streaming, ML)
@carsondial on Twitter

Who is the talk for?
People using Apache Spark
People deciding whether to use Apache Spark
People who love hearing about Apache Spark
Will it get technical? A bit! (but not too much)

Our itinerary:
Good Things!
Bad Things!
Tips! Tricks! (hopefully the useful section!)
Other Frameworks Are Available

Good Things!
Developer-friendly API for batch and streaming
So much faster than MapReduce (in general!)
Fits in well with Hadoop Ecosystem
Batteries included (ML / GraphX / etc.)
Tons of Enterprise support

Bad Things!
Spark is not magic pixie dust!
Distributed systems are hard!
And…well, let's talk about the elephant…

The (Apparent) Motto of Apache
Spark Development:

In Memoriam
This slide is dedicated to all those that created a production
system using MLLib just as Spark added ML Pipelines

(and also…)
(btw, if you haven't upgraded to Spark 2.0 yet, be aware
that the change to MurmurHash3 for hashing means you'll
have to retrain all your models that use HashingTF…)

Keeping Up With The Joneses
The Spark development team moves pretty fast, especially for
a project in the Hadoop ecosystem
Sometimes Catalyst will fail on a query / job that worked
perfectly ﬁne in the previous version
New features (e.g. KafkaDirectStream/Datasets) can be
pushed before they're entirely ready.
Sometimes you get big surprises (Hello, Structured
Streaming!)
Hadoop distributions can lag (WHERE'S MY SHINY? IT
SHOULD HAVE THE SHINY!?)

(cont…)
Not backporting ﬁxes to earlier versions of Spark
e.g. Spark < 2.0 not having support for Kafka 0.9+
clients, despite it being a rather useful enterprise-y thing to
have) [SPARK-12177]
Reading the release notes is essential. Yes, I know you do
it, but still…

(and yet more…)
API Stability is a big issue
"I want to create a Spark Streaming Solution with Kafka
and Spark 2.0! How should I do it?"
"Er…"
Hold off on Structured Streaming for now…

ENOUGH NEGATIVITY!
The Spark maintainers are acutely aware of these issues
Recent on-going discussions focussed on improving things
Hurrah!
Now, onto tips and tricks

Developers!
Use Scala if you can.
Java 8 will preserve your sanity if you have to use Java

Why Not Python?
You will likely take a performance hit with PySpark
Python interpreter memory overhead on top of the JVM
Pickling RDDs and serializing them to Python and back to
the JVM
Often lags behind Scala / Java in new features

More Tips
I'm legally mandated mention to not use collect() or
groupByKey() in your code.
Be careful with your closures!

Which data structure to use?
Datasets (Catalyst and fancy Encoders!)
Dataframes (Catalyst!)
RDDs (you're on your own…)
Use Kryo serialization when you can if not Encoders

How many partitions should I
have?
Goldilocks Problem
Too few - not enough parallelism
Too many - too much parallelism and lose time in
scheduling
Remember repartition() is not a free operation

Partition Rule of Thumb
3x cores in your cluster as a baseline
Experiment until performance suffers (increase by a factor
of <2 each time)
gzipped ﬁles may not be your friend for partitions
Partitions in streaming often set by streaming source (e.g.
Kafka)

Map vs. MapPartitions
.map() works on an element-by-element basis
.mapPartitions() works on a partition basis
Can be very helpful when working with heavy objects /
connections /etc
Don't accidentally consume the iterator!
(e.g. converting to a list, using size()/count(), etc)

One more thing
Beware Java!
mapPartitions() in Java may bring the entire partition
into memory
Embrace Scala, JOIIIIIN USSSSSSS

Streaming Tips
You need to get your processing done within the batch
duration.
Backpressure!
Prefer mapWithState() over updateStateByKey(),
as it allows timeouts, modifying state without having to
iterate over the entire state space
See last year's ATO talk for more streaming tips!

Streaming Sources
Use Apache Kafka if you have an Ops team that can
support it
Kinesis if you don't or you can live with the restrictions of
Kinesis over Kafka (and you're in AWS)

SparkSQL
Don't rely on JSON schema reﬂection if you can help it
Large JSON schemas may break Hive
(or at least require you to alter things deep in the Hive
DDL)
Try to push ﬁlters down into the source layer when possible
Parquet ALL THE THINGS
Custom UDFs are (currently) opaque to Catalyst (non-JVM
languages are even worse here!)

Testing
It's hard! But do it anyway
spark-testing-base
Maintained by Holden Karau
Great set of tools for testing RDDs, Dataframes, Datasets,
and streaming operations

Test Strategies
Is it correct? (spark-testing-base provides approximate
matchers too!)
Is it correct under production level loads?
Consider a shadow cluster for streaming

Operations
The key to a successful Spark solution.
Don't ignore Ops
So many knobs to ﬁddle with!

Deploying Spark Jobs
Don't rely on spark-submit for too long (do you really
want users to have to log in to a production server to kick
off a new job?)
Use Livy or Spark-Job-Submit as soon as possible to solve
with another layer of indirection!

Upgrading Spark Streaming
Applications
Yay! I've turned checkpointing on and I'm super-resilient!
Now I'm going to upgrade my app!
Why has everything exploded?
Checkpointing only works with the same code. Change
anything…and boom.

THIS IS FINE.
Delete checkpoint and it'll work
But…offsets for streaming?
Store them in Zookeeper and obtain on start
(do you actually need checkpointing in that case? Possibly
not!)

Scheduling Jobs.
OR: We need to talk about Apache Oozie.
It can do anything you can throw at it.
Providing anything is in a Turing-complete DSL embedded
in XML
Which can often validate or not, even if written correctly.
And a web UI that sometimes rises to the level of
'tolerable'

Oozie. Poor Oozie.
But! Less hate!
It is not a sexy part of the Hadoop ecosystem
It really can handle almost any scenario
Also, what if you didn't have to write the XML?

Arbiter
Write your Oozie workﬂows in YAML
100% more hipster-compliant
Seriously, up to about 20% less typing and handles
dependencies for you.
Try it, and maybe you won't hate Oozie so much

Monitoring
WebUI is great, but perhaps it would be better fed into
your existing monitoring solution?
CODA HALE TO THE RESCUE!
Send metrics to CSV, JMX, slf4j, or Graphite

Graphite monitoring
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class",
"org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host",
graphiteHostName)
val sc = new spark.SparkContext(sparkConf)

Monitoring
Also, you're directing all your logs/metrics from your
executors and drivers to a central logging system, aren't
you? And Kafka?
Splunk / Datadog / ELK (Elastic, Logstash, Kibana) are
your friend
Include OS metrics too!

Debugging Issues
Spark WebUI is a good place to start for drilling down into
tasks
The OS is still important (e.g. memory, OOM killer, Xen
hypervisor networking, etc)
Distributed systems are hard!

flaaaaaamegraphs
Invented by Brendan Gregg (Netflix, Joyent, Sun)
Most common type is on-CPU flamegraph
Width of stack sample is how often that stack sample is on
CPU

Spark-flame
Simple and dirty Ansible playbook
Attaches to a Spark cluster running on YARN
Generates perf data and pulls back down for flamegraphs
One flamegraph per executor, can be combined into one
graph
https://github.com/falloutdurham/spark-flame

Flame Graph Search
Interpreter
[libjvm.so]
java/io..
[libjvm.so]
start_thread
Interpreter
[libj..
scala/collection/AbstractIterator:::foreach
[li..
[libjvm..
[libjava.so]
s..
[libjvm.so]
Interpreter
Interpreter
[l..
[lib..
[l..
java
[..
Interpreter
Interpr..
Ja..
[..
scala/collection/Iterator$class:::foreach
[libjvm.so]
[li..
[libjvm.so]
java/lang/StrictMat..
itable stub
$li..
[libjava.so]
[..
Interpreter
org/apach..
[l..
Interpreter
[libjvm.so]
[lib..
Interpreter
Interpreter
call_stub
[libjvm.so]
Interpreter
Interpreter
[libjvm.so]
jav..
java/io..
[libjvm.so]
[li..
[..
Interpreter
Interpreter
[..
[libjvm.so]
[li..
Interpreter
Interpreter
scala/collection/generic/Growable$class:::$plus$plus$eq
sc..
java/..
[..
Interpr..
[l..
s..
[..
[..
[..
[..
[l..
[l..
Interpreter
[libjvm.so]
[..
[libjvm.so]
Interpreter
Interpreter

Performance!
How Many Executors Should We Use?
How Much Memory Do We Need?
What About Garbage Collection?

Executors
Hello Goldilocks again!
Small numbers of large executors = long GC pauses, low
parallelism
Large number of small executors = Frequent GC, memory
errors, other Bad Things

So How Many?
Stay between 3-5 cores per executor
64GB is a decent upper memory limit per executor
Remember the driver needs memory and cores too!
Experiment, experiment, experiment

GC
Use G1GC!
Use UI to spot time spent in GC and then turn GC logging
on (or have it on anyway!)
Too many major GCs? Increase spark.memory.fraction or
Executor memory
Lots of minor GCs? Increase Eden space
Try other approaches before digging into the GC weeds

Other Frameworks Are Available
Apache Storm / Heron
Apache Flink
Apache Apex
Apache Beam
Apache Kafkaaaaaaaaa?

Apache Storm
Low-latency at a level Spark Streaming can't (currently)
touch
Lower-level API (build the DAG yourself, peasant!)
Deploying and HA story has not been wonderful in the
past, but is getting much better!
Mature and battle-tested at Twitter, Yahoo!, etc.
1.0.x series is very solid - lower memory use, much faster.
Has slightly undeserved reputation as 'old man' of stream
processing

Heron
Built by Twitter to work around issues with Storm
Storm-compatible API
Works with Apache Mesos (YARN support is coming)
Looks very promising as a next-gen Storm
(but Apache Storm has also solved a lot of the issues
Twitter did so shrug)

Apache Flink
Higher-level API like Spark
Based around streaming rather than Spark's 'batch' focus
Getting traction in places like Uber, Netﬂix
Deﬁnitely worth investigating

Apache Apex
Dark Horse of the DAG processing engines
Low-level API like Storm
FAST.
Comes with an amazing array of lego bricks to assemble
your pipelines (want to pipe data from FTP and Splunk into
HBase? Easy with Malhar!)
Documentation sometimes lacking
Used by Capital One

Apache Beam
One API to bring them all and in the darkness bind them.
Initiative from Google - write your code using the Apache
Beam API and then you can run that code on:
Google Cloud Dataﬂow
Apache Spark
Apache Flink
(more to come)

Apache Beam
In theory, this is great!
But…
The favoured API is, obviously, Google Cloud Dataﬂow.
Last time I checked, the Apache Spark runner operated
only in terms of RDDs, thus bypassing Catalyst/Datasets
and all the performance boosts associated with them
I'd recommend Beam if you're shopping around for a
framework!

Kafkaaaaaaaaaaaa?
Wait, wait, what?
Kafka Connect as an alternative to Spark Streaming for ETL
Kafka Streams for streaming processing
HA by using Kafka itself!
Streams is very new.
Should consider Connect for ETL rather than a Kafka /
Spark solution

Finally…
Apache Spark is great!
But can require stepping outside the box at scale
lots of tuning!
test things!
monitor things!
go do great things!

Zine
Inspired by Julia Evans (@b0rk), I've made a zine!
30 copies!
Includes bonus material!
PDF: http://snappishproductions.com/ﬁles/sparklife.pdf

Links
Apache Spark
MapWithState
Livy
Spark-Job-Server
Checkpoints
Spark Streaming ATO 2015
Arbiter
Spark-ﬂame

Sparklife - Life In The Trenches With Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sparklife - Life In The Trenches With Spark

Similar to Sparklife - Life In The Trenches With Spark (20)

Recently uploaded

Recently uploaded (20)

Sparklife - Life In The Trenches With Spark