4. Who? What?
I work for Kogentix, Inc
Big Data projects, often with Spark (Core, Streaming, ML)
@carsondial on Twitter
5. Who is the talk for?
People using Apache Spark
People deciding whether to use Apache Spark
People who love hearing about Apache Spark
Will it get technical? A bit! (but not too much)
7. Good Things!
Developer-friendly API for batch and streaming
So much faster than MapReduce (in general!)
Fits in well with Hadoop Ecosystem
Batteries included (ML / GraphX / etc.)
Tons of Enterprise support
8. Bad Things!
Spark is not magic pixie dust!
Distributed systems are hard!
And…well, let's talk about the elephant…
10. In Memoriam
This slide is dedicated to all those that created a production
system using MLLib just as Spark added ML Pipelines
11. (and also…)
(btw, if you haven't upgraded to Spark 2.0 yet, be aware
that the change to MurmurHash3 for hashing means you'll
have to retrain all your models that use HashingTF…)
12. Keeping Up With The Joneses
The Spark development team moves pretty fast, especially for
a project in the Hadoop ecosystem
Sometimes Catalyst will fail on a query / job that worked
perfectly fine in the previous version
New features (e.g. KafkaDirectStream/Datasets) can be
pushed before they're entirely ready.
Sometimes you get big surprises (Hello, Structured
Streaming!)
Hadoop distributions can lag (WHERE'S MY SHINY? IT
SHOULD HAVE THE SHINY!?)
13. (cont…)
Not backporting fixes to earlier versions of Spark
e.g. Spark < 2.0 not having support for Kafka 0.9+
clients, despite it being a rather useful enterprise-y thing to
have) [SPARK-12177]
Reading the release notes is essential. Yes, I know you do
it, but still…
14. (and yet more…)
API Stability is a big issue
"I want to create a Spark Streaming Solution with Kafka
and Spark 2.0! How should I do it?"
"Er…"
Hold off on Structured Streaming for now…
15. ENOUGH NEGATIVITY!
The Spark maintainers are acutely aware of these issues
Recent on-going discussions focussed on improving things
Hurrah!
Now, onto tips and tricks
17. Why Not Python?
You will likely take a performance hit with PySpark
Python interpreter memory overhead on top of the JVM
Pickling RDDs and serializing them to Python and back to
the JVM
Often lags behind Scala / Java in new features
18. More Tips
I'm legally mandated mention to not use collect() or
groupByKey() in your code.
Be careful with your closures!
19. Which data structure to use?
Datasets (Catalyst and fancy Encoders!)
Dataframes (Catalyst!)
RDDs (you're on your own…)
Use Kryo serialization when you can if not Encoders
20. How many partitions should I
have?
Goldilocks Problem
Too few - not enough parallelism
Too many - too much parallelism and lose time in
scheduling
Remember repartition() is not a free operation
21. Partition Rule of Thumb
3x cores in your cluster as a baseline
Experiment until performance suffers (increase by a factor
of <2 each time)
gzipped files may not be your friend for partitions
Partitions in streaming often set by streaming source (e.g.
Kafka)
22. Map vs. MapPartitions
.map() works on an element-by-element basis
.mapPartitions() works on a partition basis
Can be very helpful when working with heavy objects /
connections /etc
Don't accidentally consume the iterator!
(e.g. converting to a list, using size()/count(), etc)
23. One more thing
Beware Java!
mapPartitions() in Java may bring the entire partition
into memory
Embrace Scala, JOIIIIIN USSSSSSS
24. Streaming Tips
You need to get your processing done within the batch
duration.
Backpressure!
Prefer mapWithState() over updateStateByKey(),
as it allows timeouts, modifying state without having to
iterate over the entire state space
See last year's ATO talk for more streaming tips!
25. Streaming Sources
Use Apache Kafka if you have an Ops team that can
support it
Kinesis if you don't or you can live with the restrictions of
Kinesis over Kafka (and you're in AWS)
26. SparkSQL
Don't rely on JSON schema reflection if you can help it
Large JSON schemas may break Hive
(or at least require you to alter things deep in the Hive
DDL)
Try to push filters down into the source layer when possible
Parquet ALL THE THINGS
Custom UDFs are (currently) opaque to Catalyst (non-JVM
languages are even worse here!)
27. Testing
It's hard! But do it anyway
spark-testing-base
Maintained by Holden Karau
Great set of tools for testing RDDs, Dataframes, Datasets,
and streaming operations
28. Test Strategies
Is it correct? (spark-testing-base provides approximate
matchers too!)
Is it correct under production level loads?
Consider a shadow cluster for streaming
29. Operations
The key to a successful Spark solution.
Don't ignore Ops
So many knobs to fiddle with!
30. Deploying Spark Jobs
Don't rely on spark-submit for too long (do you really
want users to have to log in to a production server to kick
off a new job?)
Use Livy or Spark-Job-Submit as soon as possible to solve
with another layer of indirection!
31. Upgrading Spark Streaming
Applications
Yay! I've turned checkpointing on and I'm super-resilient!
Now I'm going to upgrade my app!
Why has everything exploded?
Checkpointing only works with the same code. Change
anything…and boom.
32. THIS IS FINE.
Delete checkpoint and it'll work
But…offsets for streaming?
Store them in Zookeeper and obtain on start
(do you actually need checkpointing in that case? Possibly
not!)
33. Scheduling Jobs.
OR: We need to talk about Apache Oozie.
It can do anything you can throw at it.
Providing anything is in a Turing-complete DSL embedded
in XML
Which can often validate or not, even if written correctly.
And a web UI that sometimes rises to the level of
'tolerable'
34. Oozie. Poor Oozie.
But! Less hate!
It is not a sexy part of the Hadoop ecosystem
It really can handle almost any scenario
Also, what if you didn't have to write the XML?
35. Arbiter
Write your Oozie workflows in YAML
100% more hipster-compliant
Seriously, up to about 20% less typing and handles
dependencies for you.
Try it, and maybe you won't hate Oozie so much
36. Monitoring
WebUI is great, but perhaps it would be better fed into
your existing monitoring solution?
CODA HALE TO THE RESCUE!
Send metrics to CSV, JMX, slf4j, or Graphite
37. Graphite monitoring
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class",
"org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host",
graphiteHostName)
val sc = new spark.SparkContext(sparkConf)
38. Monitoring
Also, you're directing all your logs/metrics from your
executors and drivers to a central logging system, aren't
you? And Kafka?
Splunk / Datadog / ELK (Elastic, Logstash, Kibana) are
your friend
Include OS metrics too!
39. Debugging Issues
Spark WebUI is a good place to start for drilling down into
tasks
The OS is still important (e.g. memory, OOM killer, Xen
hypervisor networking, etc)
Distributed systems are hard!
40. flaaaaaamegraphs
Invented by Brendan Gregg (Netflix, Joyent, Sun)
Most common type is on-CPU flamegraph
Width of stack sample is how often that stack sample is on
CPU
41. Spark-flame
Simple and dirty Ansible playbook
Attaches to a Spark cluster running on YARN
Generates perf data and pulls back down for flamegraphs
One flamegraph per executor, can be combined into one
graph
https://github.com/falloutdurham/spark-flame
44. Executors
Hello Goldilocks again!
Small numbers of large executors = long GC pauses, low
parallelism
Large number of small executors = Frequent GC, memory
errors, other Bad Things
45. So How Many?
Stay between 3-5 cores per executor
64GB is a decent upper memory limit per executor
Remember the driver needs memory and cores too!
Experiment, experiment, experiment
46. GC
Use G1GC!
Use UI to spot time spent in GC and then turn GC logging
on (or have it on anyway!)
Too many major GCs? Increase spark.memory.fraction or
Executor memory
Lots of minor GCs? Increase Eden space
Try other approaches before digging into the GC weeds
47. Other Frameworks Are Available
Apache Storm / Heron
Apache Flink
Apache Apex
Apache Beam
Apache Kafkaaaaaaaaa?
48. Apache Storm
Low-latency at a level Spark Streaming can't (currently)
touch
Lower-level API (build the DAG yourself, peasant!)
Deploying and HA story has not been wonderful in the
past, but is getting much better!
Mature and battle-tested at Twitter, Yahoo!, etc.
1.0.x series is very solid - lower memory use, much faster.
Has slightly undeserved reputation as 'old man' of stream
processing
49. Heron
Built by Twitter to work around issues with Storm
Storm-compatible API
Works with Apache Mesos (YARN support is coming)
Looks very promising as a next-gen Storm
(but Apache Storm has also solved a lot of the issues
Twitter did so shrug)
50. Apache Flink
Higher-level API like Spark
Based around streaming rather than Spark's 'batch' focus
Getting traction in places like Uber, Netflix
Definitely worth investigating
51. Apache Apex
Dark Horse of the DAG processing engines
Low-level API like Storm
FAST.
Comes with an amazing array of lego bricks to assemble
your pipelines (want to pipe data from FTP and Splunk into
HBase? Easy with Malhar!)
Documentation sometimes lacking
Used by Capital One
52. Apache Beam
One API to bring them all and in the darkness bind them.
Initiative from Google - write your code using the Apache
Beam API and then you can run that code on:
Google Cloud Dataflow
Apache Spark
Apache Flink
(more to come)
53. Apache Beam
In theory, this is great!
But…
The favoured API is, obviously, Google Cloud Dataflow.
Last time I checked, the Apache Spark runner operated
only in terms of RDDs, thus bypassing Catalyst/Datasets
and all the performance boosts associated with them
I'd recommend Beam if you're shopping around for a
framework!
54. Kafkaaaaaaaaaaaa?
Wait, wait, what?
Kafka Connect as an alternative to Spark Streaming for ETL
Kafka Streams for streaming processing
HA by using Kafka itself!
Streams is very new.
Should consider Connect for ETL rather than a Kafka /
Spark solution
55. Finally…
Apache Spark is great!
But can require stepping outside the box at scale
lots of tuning!
test things!
monitor things!
go do great things!
56. Zine
Inspired by Julia Evans (@b0rk), I've made a zine!
30 copies!
Includes bonus material!
PDF: http://snappishproductions.com/files/sparklife.pdf