SlideShare a Scribd company logo
1 of 57
Download to read offline
Life in the Trenches with Apache
Spark
Who? What?
I work for Kogentix, Inc
Big Data projects, often with Spark (Core, Streaming, ML)
@carsondial on Twitter
Who is the talk for?
People using Apache Spark
People deciding whether to use Apache Spark
People who love hearing about Apache Spark
Will it get technical? A bit! (but not too much)
Our itinerary:
Good Things!
Bad Things!
Tips! Tricks! (hopefully the useful section!)
Other Frameworks Are Available
Good Things!
Developer-friendly API for batch and streaming
So much faster than MapReduce (in general!)
Fits in well with Hadoop Ecosystem
Batteries included (ML / GraphX / etc.)
Tons of Enterprise support
Bad Things!
Spark is not magic pixie dust!
Distributed systems are hard!
And…well, let's talk about the elephant…
The (Apparent) Motto of Apache
Spark Development:
In Memoriam
This slide is dedicated to all those that created a production
system using MLLib just as Spark added ML Pipelines
(and also…)
(btw, if you haven't upgraded to Spark 2.0 yet, be aware
that the change to MurmurHash3 for hashing means you'll
have to retrain all your models that use HashingTF…)
Keeping Up With The Joneses
The Spark development team moves pretty fast, especially for
a project in the Hadoop ecosystem
Sometimes Catalyst will fail on a query / job that worked
perfectly fine in the previous version
New features (e.g. KafkaDirectStream/Datasets) can be
pushed before they're entirely ready.
Sometimes you get big surprises (Hello, Structured
Streaming!)
Hadoop distributions can lag (WHERE'S MY SHINY? IT
SHOULD HAVE THE SHINY!?)
(cont…)
Not backporting fixes to earlier versions of Spark
e.g. Spark < 2.0 not having support for Kafka 0.9+
clients, despite it being a rather useful enterprise-y thing to
have) [SPARK-12177]
Reading the release notes is essential. Yes, I know you do
it, but still…
(and yet more…)
API Stability is a big issue
"I want to create a Spark Streaming Solution with Kafka
and Spark 2.0! How should I do it?"
"Er…"
Hold off on Structured Streaming for now…
ENOUGH NEGATIVITY!
The Spark maintainers are acutely aware of these issues
Recent on-going discussions focussed on improving things
Hurrah!
Now, onto tips and tricks
Developers!
Use Scala if you can.
Java 8 will preserve your sanity if you have to use Java
Why Not Python?
You will likely take a performance hit with PySpark
Python interpreter memory overhead on top of the JVM
Pickling RDDs and serializing them to Python and back to
the JVM
Often lags behind Scala / Java in new features
More Tips
I'm legally mandated mention to not use collect() or
groupByKey() in your code.
Be careful with your closures!
Which data structure to use?
Datasets (Catalyst and fancy Encoders!)
Dataframes (Catalyst!)
RDDs (you're on your own…)
Use Kryo serialization when you can if not Encoders
How many partitions should I
have?
Goldilocks Problem
Too few - not enough parallelism
Too many - too much parallelism and lose time in
scheduling
Remember repartition() is not a free operation
Partition Rule of Thumb
3x cores in your cluster as a baseline
Experiment until performance suffers (increase by a factor
of <2 each time)
gzipped files may not be your friend for partitions
Partitions in streaming often set by streaming source (e.g.
Kafka)
Map vs. MapPartitions
.map() works on an element-by-element basis
.mapPartitions() works on a partition basis
Can be very helpful when working with heavy objects /
connections /etc
Don't accidentally consume the iterator!
(e.g. converting to a list, using size()/count(), etc)
One more thing
Beware Java!
mapPartitions() in Java may bring the entire partition
into memory
Embrace Scala, JOIIIIIN USSSSSSS
Streaming Tips
You need to get your processing done within the batch
duration.
Backpressure!
Prefer mapWithState() over updateStateByKey(),
as it allows timeouts, modifying state without having to
iterate over the entire state space
See last year's ATO talk for more streaming tips!
Streaming Sources
Use Apache Kafka if you have an Ops team that can
support it
Kinesis if you don't or you can live with the restrictions of
Kinesis over Kafka (and you're in AWS)
SparkSQL
Don't rely on JSON schema reflection if you can help it
Large JSON schemas may break Hive
(or at least require you to alter things deep in the Hive
DDL)
Try to push filters down into the source layer when possible
Parquet ALL THE THINGS
Custom UDFs are (currently) opaque to Catalyst (non-JVM
languages are even worse here!)
Testing
It's hard! But do it anyway
spark-testing-base
Maintained by Holden Karau
Great set of tools for testing RDDs, Dataframes, Datasets,
and streaming operations
Test Strategies
Is it correct? (spark-testing-base provides approximate
matchers too!)
Is it correct under production level loads?
Consider a shadow cluster for streaming
Operations
The key to a successful Spark solution.
Don't ignore Ops
So many knobs to fiddle with!
Deploying Spark Jobs
Don't rely on spark-submit for too long (do you really
want users to have to log in to a production server to kick
off a new job?)
Use Livy or Spark-Job-Submit as soon as possible to solve
with another layer of indirection!
Upgrading Spark Streaming
Applications
Yay! I've turned checkpointing on and I'm super-resilient!
Now I'm going to upgrade my app!
Why has everything exploded?
Checkpointing only works with the same code. Change
anything…and boom.
THIS IS FINE.
Delete checkpoint and it'll work
But…offsets for streaming?
Store them in Zookeeper and obtain on start
(do you actually need checkpointing in that case? Possibly
not!)
Scheduling Jobs.
OR: We need to talk about Apache Oozie.
It can do anything you can throw at it.
Providing anything is in a Turing-complete DSL embedded
in XML
Which can often validate or not, even if written correctly.
And a web UI that sometimes rises to the level of
'tolerable'
Oozie. Poor Oozie.
But! Less hate!
It is not a sexy part of the Hadoop ecosystem
It really can handle almost any scenario
Also, what if you didn't have to write the XML?
Arbiter
Write your Oozie workflows in YAML
100% more hipster-compliant
Seriously, up to about 20% less typing and handles
dependencies for you.
Try it, and maybe you won't hate Oozie so much
Monitoring
WebUI is great, but perhaps it would be better fed into
your existing monitoring solution?
CODA HALE TO THE RESCUE!
Send metrics to CSV, JMX, slf4j, or Graphite
Graphite monitoring
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class",
"org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host",
graphiteHostName)
val sc = new spark.SparkContext(sparkConf)
Monitoring
Also, you're directing all your logs/metrics from your
executors and drivers to a central logging system, aren't
you? And Kafka?
Splunk / Datadog / ELK (Elastic, Logstash, Kibana) are
your friend
Include OS metrics too!
Debugging Issues
Spark WebUI is a good place to start for drilling down into
tasks
The OS is still important (e.g. memory, OOM killer, Xen
hypervisor networking, etc)
Distributed systems are hard!
flaaaaaamegraphs
Invented by Brendan Gregg (Netflix, Joyent, Sun)
Most common type is on-CPU flamegraph
Width of stack sample is how often that stack sample is on
CPU
Spark-flame
Simple and dirty Ansible playbook
Attaches to a Spark cluster running on YARN
Generates perf data and pulls back down for flamegraphs
One flamegraph per executor, can be combined into one
graph
https://github.com/falloutdurham/spark-flame
Flame Graph Search
Interpreter
[libjvm.so]
java/io..
[libjvm.so]
start_thread
Interpreter
[libj..
scala/collection/AbstractIterator:::foreach
[li..
[libjvm..
[libjava.so]
s..
[libjvm.so]
Interpreter
Interpreter
[l..
[lib..
[l..
java
[..
Interpreter
Interpr..
Ja..
[..
scala/collection/Iterator$class:::foreach
[libjvm.so]
[li..
[libjvm.so]
java/lang/StrictMat..
itable stub
$li..
[libjava.so]
[..
Interpreter
org/apach..
[l..
Interpreter
[libjvm.so]
[lib..
Interpreter
Interpreter
call_stub
[libjvm.so]
Interpreter
Interpreter
[libjvm.so]
jav..
java/io..
[libjvm.so]
[li..
[..
Interpreter
Interpreter
[..
[libjvm.so]
[li..
Interpreter
Interpreter
scala/collection/generic/Growable$class:::$plus$plus$eq
sc..
java/..
[..
Interpr..
[l..
s..
[..
[..
[..
[..
[l..
[l..
Interpreter
[libjvm.so]
[..
[libjvm.so]
Interpreter
Interpreter
Performance!
How Many Executors Should We Use?
How Much Memory Do We Need?
What About Garbage Collection?
Executors
Hello Goldilocks again!
Small numbers of large executors = long GC pauses, low
parallelism
Large number of small executors = Frequent GC, memory
errors, other Bad Things
So How Many?
Stay between 3-5 cores per executor
64GB is a decent upper memory limit per executor
Remember the driver needs memory and cores too!
Experiment, experiment, experiment
GC
Use G1GC!
Use UI to spot time spent in GC and then turn GC logging
on (or have it on anyway!)
Too many major GCs? Increase spark.memory.fraction or
Executor memory
Lots of minor GCs? Increase Eden space
Try other approaches before digging into the GC weeds
Other Frameworks Are Available
Apache Storm / Heron
Apache Flink
Apache Apex
Apache Beam
Apache Kafkaaaaaaaaa?
Apache Storm
Low-latency at a level Spark Streaming can't (currently)
touch
Lower-level API (build the DAG yourself, peasant!)
Deploying and HA story has not been wonderful in the
past, but is getting much better!
Mature and battle-tested at Twitter, Yahoo!, etc.
1.0.x series is very solid - lower memory use, much faster.
Has slightly undeserved reputation as 'old man' of stream
processing
Heron
Built by Twitter to work around issues with Storm
Storm-compatible API
Works with Apache Mesos (YARN support is coming)
Looks very promising as a next-gen Storm
(but Apache Storm has also solved a lot of the issues
Twitter did so shrug)
Apache Flink
Higher-level API like Spark
Based around streaming rather than Spark's 'batch' focus
Getting traction in places like Uber, Netflix
Definitely worth investigating
Apache Apex
Dark Horse of the DAG processing engines
Low-level API like Storm
FAST.
Comes with an amazing array of lego bricks to assemble
your pipelines (want to pipe data from FTP and Splunk into
HBase? Easy with Malhar!)
Documentation sometimes lacking
Used by Capital One
Apache Beam
One API to bring them all and in the darkness bind them.
Initiative from Google - write your code using the Apache
Beam API and then you can run that code on:
Google Cloud Dataflow
Apache Spark
Apache Flink
(more to come)
Apache Beam
In theory, this is great!
But…
The favoured API is, obviously, Google Cloud Dataflow.
Last time I checked, the Apache Spark runner operated
only in terms of RDDs, thus bypassing Catalyst/Datasets
and all the performance boosts associated with them
I'd recommend Beam if you're shopping around for a
framework!
Kafkaaaaaaaaaaaa?
Wait, wait, what?
Kafka Connect as an alternative to Spark Streaming for ETL
Kafka Streams for streaming processing
HA by using Kafka itself!
Streams is very new.
Should consider Connect for ETL rather than a Kafka /
Spark solution
Finally…
Apache Spark is great!
But can require stepping outside the box at scale
lots of tuning!
test things!
monitor things!
go do great things!
Zine
Inspired by Julia Evans (@b0rk), I've made a zine!
30 copies!
Includes bonus material!
PDF: http://snappishproductions.com/files/sparklife.pdf
Links
Apache Spark
MapWithState
Livy
Spark-Job-Server
Checkpoints
Spark Streaming ATO 2015
Arbiter
Spark-flame

More Related Content

What's hot

Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
 
Jvm tuning in a rush! - Lviv JUG
Jvm tuning in a rush! - Lviv JUGJvm tuning in a rush! - Lviv JUG
Jvm tuning in a rush! - Lviv JUGTomek Borek
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Tim Bunce
 
Java tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy devJava tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy devTomek Borek
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauSpark Summit
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Polyglot Plugin Programming
Polyglot Plugin ProgrammingPolyglot Plugin Programming
Polyglot Plugin ProgrammingAtlassian
 
Testing Ember Apps
Testing Ember AppsTesting Ember Apps
Testing Ember Appsjo_liss
 
Apache Camel - The integration library
Apache Camel - The integration libraryApache Camel - The integration library
Apache Camel - The integration libraryClaus Ibsen
 
LSUG: How we (mostly) moved from Java to Scala
LSUG: How we (mostly) moved from Java to ScalaLSUG: How we (mostly) moved from Java to Scala
LSUG: How we (mostly) moved from Java to ScalaGraham Tackley
 
Integrating systems in the age of Quarkus and Camel
Integrating systems in the age of Quarkus and CamelIntegrating systems in the age of Quarkus and Camel
Integrating systems in the age of Quarkus and CamelClaus Ibsen
 
Scaling with Symfony - PHP UK
Scaling with Symfony - PHP UKScaling with Symfony - PHP UK
Scaling with Symfony - PHP UKRicard Clau
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
 
Getting Started with Apache Camel at DevNation 2014
Getting Started with Apache Camel at DevNation 2014Getting Started with Apache Camel at DevNation 2014
Getting Started with Apache Camel at DevNation 2014Claus Ibsen
 

What's hot (20)

Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
 
Jvm tuning in a rush! - Lviv JUG
Jvm tuning in a rush! - Lviv JUGJvm tuning in a rush! - Lviv JUG
Jvm tuning in a rush! - Lviv JUG
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
 
Java tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy devJava tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy dev
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Polyglot Plugin Programming
Polyglot Plugin ProgrammingPolyglot Plugin Programming
Polyglot Plugin Programming
 
Perl in Teh Cloud
Perl in Teh CloudPerl in Teh Cloud
Perl in Teh Cloud
 
Async await...oh wait!
Async await...oh wait!Async await...oh wait!
Async await...oh wait!
 
Testing Ember Apps
Testing Ember AppsTesting Ember Apps
Testing Ember Apps
 
Apache Camel - The integration library
Apache Camel - The integration libraryApache Camel - The integration library
Apache Camel - The integration library
 
LSUG: How we (mostly) moved from Java to Scala
LSUG: How we (mostly) moved from Java to ScalaLSUG: How we (mostly) moved from Java to Scala
LSUG: How we (mostly) moved from Java to Scala
 
Integrating systems in the age of Quarkus and Camel
Integrating systems in the age of Quarkus and CamelIntegrating systems in the age of Quarkus and Camel
Integrating systems in the age of Quarkus and Camel
 
Scaling with Symfony - PHP UK
Scaling with Symfony - PHP UKScaling with Symfony - PHP UK
Scaling with Symfony - PHP UK
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
 
Getting Started with Apache Camel at DevNation 2014
Getting Started with Apache Camel at DevNation 2014Getting Started with Apache Camel at DevNation 2014
Getting Started with Apache Camel at DevNation 2014
 
Why akka
Why akkaWhy akka
Why akka
 

Similar to Sparklife - Life In The Trenches With Spark

Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...Claus Ibsen
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtssiddharth30121
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails FinalRobert Postill
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and HowBigBlueHat
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Why scala for data science
Why scala for data scienceWhy scala for data science
Why scala for data scienceGuglielmo Iozzia
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data storeJ On The Beach
 

Similar to Sparklife - Life In The Trenches With Spark (20)

Empire: JPA for RDF & SPARQL
Empire: JPA for RDF & SPARQLEmpire: JPA for RDF & SPARQL
Empire: JPA for RDF & SPARQL
 
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
 
WebWorkersCamp 2010
WebWorkersCamp 2010WebWorkersCamp 2010
WebWorkersCamp 2010
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Function as a Service
Function as a ServiceFunction as a Service
Function as a Service
 
Devoxx
DevoxxDevoxx
Devoxx
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Not only SQL
Not only SQL Not only SQL
Not only SQL
 
Why scala for data science
Why scala for data scienceWhy scala for data science
Why scala for data science
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data store
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Sparklife - Life In The Trenches With Spark

  • 1. Life in the Trenches with Apache Spark
  • 2.
  • 3.
  • 4. Who? What? I work for Kogentix, Inc Big Data projects, often with Spark (Core, Streaming, ML) @carsondial on Twitter
  • 5. Who is the talk for? People using Apache Spark People deciding whether to use Apache Spark People who love hearing about Apache Spark Will it get technical? A bit! (but not too much)
  • 6. Our itinerary: Good Things! Bad Things! Tips! Tricks! (hopefully the useful section!) Other Frameworks Are Available
  • 7. Good Things! Developer-friendly API for batch and streaming So much faster than MapReduce (in general!) Fits in well with Hadoop Ecosystem Batteries included (ML / GraphX / etc.) Tons of Enterprise support
  • 8. Bad Things! Spark is not magic pixie dust! Distributed systems are hard! And…well, let's talk about the elephant…
  • 9. The (Apparent) Motto of Apache Spark Development:
  • 10. In Memoriam This slide is dedicated to all those that created a production system using MLLib just as Spark added ML Pipelines
  • 11. (and also…) (btw, if you haven't upgraded to Spark 2.0 yet, be aware that the change to MurmurHash3 for hashing means you'll have to retrain all your models that use HashingTF…)
  • 12. Keeping Up With The Joneses The Spark development team moves pretty fast, especially for a project in the Hadoop ecosystem Sometimes Catalyst will fail on a query / job that worked perfectly fine in the previous version New features (e.g. KafkaDirectStream/Datasets) can be pushed before they're entirely ready. Sometimes you get big surprises (Hello, Structured Streaming!) Hadoop distributions can lag (WHERE'S MY SHINY? IT SHOULD HAVE THE SHINY!?)
  • 13. (cont…) Not backporting fixes to earlier versions of Spark e.g. Spark < 2.0 not having support for Kafka 0.9+ clients, despite it being a rather useful enterprise-y thing to have) [SPARK-12177] Reading the release notes is essential. Yes, I know you do it, but still…
  • 14. (and yet more…) API Stability is a big issue "I want to create a Spark Streaming Solution with Kafka and Spark 2.0! How should I do it?" "Er…" Hold off on Structured Streaming for now…
  • 15. ENOUGH NEGATIVITY! The Spark maintainers are acutely aware of these issues Recent on-going discussions focussed on improving things Hurrah! Now, onto tips and tricks
  • 16. Developers! Use Scala if you can. Java 8 will preserve your sanity if you have to use Java
  • 17. Why Not Python? You will likely take a performance hit with PySpark Python interpreter memory overhead on top of the JVM Pickling RDDs and serializing them to Python and back to the JVM Often lags behind Scala / Java in new features
  • 18. More Tips I'm legally mandated mention to not use collect() or groupByKey() in your code. Be careful with your closures!
  • 19. Which data structure to use? Datasets (Catalyst and fancy Encoders!) Dataframes (Catalyst!) RDDs (you're on your own…) Use Kryo serialization when you can if not Encoders
  • 20. How many partitions should I have? Goldilocks Problem Too few - not enough parallelism Too many - too much parallelism and lose time in scheduling Remember repartition() is not a free operation
  • 21. Partition Rule of Thumb 3x cores in your cluster as a baseline Experiment until performance suffers (increase by a factor of <2 each time) gzipped files may not be your friend for partitions Partitions in streaming often set by streaming source (e.g. Kafka)
  • 22. Map vs. MapPartitions .map() works on an element-by-element basis .mapPartitions() works on a partition basis Can be very helpful when working with heavy objects / connections /etc Don't accidentally consume the iterator! (e.g. converting to a list, using size()/count(), etc)
  • 23. One more thing Beware Java! mapPartitions() in Java may bring the entire partition into memory Embrace Scala, JOIIIIIN USSSSSSS
  • 24. Streaming Tips You need to get your processing done within the batch duration. Backpressure! Prefer mapWithState() over updateStateByKey(), as it allows timeouts, modifying state without having to iterate over the entire state space See last year's ATO talk for more streaming tips!
  • 25. Streaming Sources Use Apache Kafka if you have an Ops team that can support it Kinesis if you don't or you can live with the restrictions of Kinesis over Kafka (and you're in AWS)
  • 26. SparkSQL Don't rely on JSON schema reflection if you can help it Large JSON schemas may break Hive (or at least require you to alter things deep in the Hive DDL) Try to push filters down into the source layer when possible Parquet ALL THE THINGS Custom UDFs are (currently) opaque to Catalyst (non-JVM languages are even worse here!)
  • 27. Testing It's hard! But do it anyway spark-testing-base Maintained by Holden Karau Great set of tools for testing RDDs, Dataframes, Datasets, and streaming operations
  • 28. Test Strategies Is it correct? (spark-testing-base provides approximate matchers too!) Is it correct under production level loads? Consider a shadow cluster for streaming
  • 29. Operations The key to a successful Spark solution. Don't ignore Ops So many knobs to fiddle with!
  • 30. Deploying Spark Jobs Don't rely on spark-submit for too long (do you really want users to have to log in to a production server to kick off a new job?) Use Livy or Spark-Job-Submit as soon as possible to solve with another layer of indirection!
  • 31. Upgrading Spark Streaming Applications Yay! I've turned checkpointing on and I'm super-resilient! Now I'm going to upgrade my app! Why has everything exploded? Checkpointing only works with the same code. Change anything…and boom.
  • 32. THIS IS FINE. Delete checkpoint and it'll work But…offsets for streaming? Store them in Zookeeper and obtain on start (do you actually need checkpointing in that case? Possibly not!)
  • 33. Scheduling Jobs. OR: We need to talk about Apache Oozie. It can do anything you can throw at it. Providing anything is in a Turing-complete DSL embedded in XML Which can often validate or not, even if written correctly. And a web UI that sometimes rises to the level of 'tolerable'
  • 34. Oozie. Poor Oozie. But! Less hate! It is not a sexy part of the Hadoop ecosystem It really can handle almost any scenario Also, what if you didn't have to write the XML?
  • 35. Arbiter Write your Oozie workflows in YAML 100% more hipster-compliant Seriously, up to about 20% less typing and handles dependencies for you. Try it, and maybe you won't hate Oozie so much
  • 36. Monitoring WebUI is great, but perhaps it would be better fed into your existing monitoring solution? CODA HALE TO THE RESCUE! Send metrics to CSV, JMX, slf4j, or Graphite
  • 37. Graphite monitoring val sparkConf = new spark.SparkConf() .set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink") .set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName) val sc = new spark.SparkContext(sparkConf)
  • 38. Monitoring Also, you're directing all your logs/metrics from your executors and drivers to a central logging system, aren't you? And Kafka? Splunk / Datadog / ELK (Elastic, Logstash, Kibana) are your friend Include OS metrics too!
  • 39. Debugging Issues Spark WebUI is a good place to start for drilling down into tasks The OS is still important (e.g. memory, OOM killer, Xen hypervisor networking, etc) Distributed systems are hard!
  • 40. flaaaaaamegraphs Invented by Brendan Gregg (Netflix, Joyent, Sun) Most common type is on-CPU flamegraph Width of stack sample is how often that stack sample is on CPU
  • 41. Spark-flame Simple and dirty Ansible playbook Attaches to a Spark cluster running on YARN Generates perf data and pulls back down for flamegraphs One flamegraph per executor, can be combined into one graph https://github.com/falloutdurham/spark-flame
  • 42. Flame Graph Search Interpreter [libjvm.so] java/io.. [libjvm.so] start_thread Interpreter [libj.. scala/collection/AbstractIterator:::foreach [li.. [libjvm.. [libjava.so] s.. [libjvm.so] Interpreter Interpreter [l.. [lib.. [l.. java [.. Interpreter Interpr.. Ja.. [.. scala/collection/Iterator$class:::foreach [libjvm.so] [li.. [libjvm.so] java/lang/StrictMat.. itable stub $li.. [libjava.so] [.. Interpreter org/apach.. [l.. Interpreter [libjvm.so] [lib.. Interpreter Interpreter call_stub [libjvm.so] Interpreter Interpreter [libjvm.so] jav.. java/io.. [libjvm.so] [li.. [.. Interpreter Interpreter [.. [libjvm.so] [li.. Interpreter Interpreter scala/collection/generic/Growable$class:::$plus$plus$eq sc.. java/.. [.. Interpr.. [l.. s.. [.. [.. [.. [.. [l.. [l.. Interpreter [libjvm.so] [.. [libjvm.so] Interpreter Interpreter
  • 43. Performance! How Many Executors Should We Use? How Much Memory Do We Need? What About Garbage Collection?
  • 44. Executors Hello Goldilocks again! Small numbers of large executors = long GC pauses, low parallelism Large number of small executors = Frequent GC, memory errors, other Bad Things
  • 45. So How Many? Stay between 3-5 cores per executor 64GB is a decent upper memory limit per executor Remember the driver needs memory and cores too! Experiment, experiment, experiment
  • 46. GC Use G1GC! Use UI to spot time spent in GC and then turn GC logging on (or have it on anyway!) Too many major GCs? Increase spark.memory.fraction or Executor memory Lots of minor GCs? Increase Eden space Try other approaches before digging into the GC weeds
  • 47. Other Frameworks Are Available Apache Storm / Heron Apache Flink Apache Apex Apache Beam Apache Kafkaaaaaaaaa?
  • 48. Apache Storm Low-latency at a level Spark Streaming can't (currently) touch Lower-level API (build the DAG yourself, peasant!) Deploying and HA story has not been wonderful in the past, but is getting much better! Mature and battle-tested at Twitter, Yahoo!, etc. 1.0.x series is very solid - lower memory use, much faster. Has slightly undeserved reputation as 'old man' of stream processing
  • 49. Heron Built by Twitter to work around issues with Storm Storm-compatible API Works with Apache Mesos (YARN support is coming) Looks very promising as a next-gen Storm (but Apache Storm has also solved a lot of the issues Twitter did so shrug)
  • 50. Apache Flink Higher-level API like Spark Based around streaming rather than Spark's 'batch' focus Getting traction in places like Uber, Netflix Definitely worth investigating
  • 51. Apache Apex Dark Horse of the DAG processing engines Low-level API like Storm FAST. Comes with an amazing array of lego bricks to assemble your pipelines (want to pipe data from FTP and Splunk into HBase? Easy with Malhar!) Documentation sometimes lacking Used by Capital One
  • 52. Apache Beam One API to bring them all and in the darkness bind them. Initiative from Google - write your code using the Apache Beam API and then you can run that code on: Google Cloud Dataflow Apache Spark Apache Flink (more to come)
  • 53. Apache Beam In theory, this is great! But… The favoured API is, obviously, Google Cloud Dataflow. Last time I checked, the Apache Spark runner operated only in terms of RDDs, thus bypassing Catalyst/Datasets and all the performance boosts associated with them I'd recommend Beam if you're shopping around for a framework!
  • 54. Kafkaaaaaaaaaaaa? Wait, wait, what? Kafka Connect as an alternative to Spark Streaming for ETL Kafka Streams for streaming processing HA by using Kafka itself! Streams is very new. Should consider Connect for ETL rather than a Kafka / Spark solution
  • 55. Finally… Apache Spark is great! But can require stepping outside the box at scale lots of tuning! test things! monitor things! go do great things!
  • 56. Zine Inspired by Julia Evans (@b0rk), I've made a zine! 30 copies! Includes bonus material! PDF: http://snappishproductions.com/files/sparklife.pdf