"Out of the box, Spark provides rich and extensive APIs for performing in memory, large-scale computation across data. Once a system has been built and tuned with Spark Datasets/Dataframes/RDDs, have you ever been left wondering if you could push the limits of Spark even further? In this session, we will cover some of the tips learned while building retail-scale systems at Target to maximize the parallelization that you can achieve from Spark in ways that may not be obvious from current documentation. Specifically, we will cover multithreading the Spark driver with Scala Futures to enable parallel job submission. We will talk about developing custom partitioners to leverage the ability to apply operations across understood chunks of data and what tradeoffs that entails. We will also dive into strategies for parallelizing scripts with Spark that might have nothing to with Spark to support environments where peers work in multiple languages or perhaps a different language/library is just the best thing to get the job done. Come learn how to squeeze every last drop out of your Spark job with strategies for parallelization that go off the beaten path.
"
3. What This Talk is About
• Tips for parallelizing with Spark
• Lots of (Scala) code examples
• Focus on Scala programming constructs
3#UnifiedAnalytics #SparkAISummit
4. 4#UnifiedAnalytics #SparkAISummit
Who am I
• Lead Data Engineer/Scientist at Target since 2016
• Deep love of all things Target
• Other Spark Summit talks:
o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A
Compiler
o 2019: Lessons In Linear Algebra At Scale With Apache Spark : Let’s Make The
Sparse Details A Bit More Dense
9. 9#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Let’s do some data exploration
• We have a system of Authors, Articles, and
Comments on those Articles
• We would like to do some simple data
exploration as part of a batch job
• We execute this code in a built jar through
spark-submit on a cluster with 100
executors, 5 executor cores, 10gb/driver,
and 10gb/executor.
• What happens in Spark when we kick off
the exploration?
10. 10#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
The Execution Starts
• One job is kicked off at a time.
• We asked a few independent questions
in our exploration. Why can’t they be
running at the same time?
11. 11#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
The Execution Completes
• All of our questions run as separate
jobs.
• Examining the timing demonstrates that
these jobs run serially.
13. 13#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Can we potentially speed up our
exploration?
• Spark turns our questions into 3 Jobs
• The Jobs run serially
• We notice that some of our questions are
independent. Can they be run at the same
time?
• The answer is yes. We can leverage Scala
Concurrency features and the Spark
Scheduler to achieve this…
14. 14#UnifiedAnalytics #SparkAISummit
Scala Futures
• A placeholder for a value that may not
exist.
• Asynchronous
• Requires an ExecutionContext
• Use Await to block
• Extremely flexible syntax. Supports for-
comprehension chaining to manage
dependencies.
Parallel Job Submission
and Schedulers
15. 15#UnifiedAnalytics #SparkAISummit
Parallel Job Submission
and Schedulers
Let’s rework our original code using
Scala Futures to parallelize Job
Submission
• We pull in a reference to an implicit
ExecutionContext
• We wrap each of our questions in a Future
block to be run asynchronously
• We block on our asynchronous questions all
being completed
• (Not seen) We properly shut down the
ExecutorService when the job is complete
16. 16#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Our questions are now
asked concurrently
• All of our questions run as separate
jobs.
• Examining the timing demonstrates that
these jobs are now running concurrently.
18. 18#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
A note about Spark Schedulers
• The default scheduler is FIFO
• Starting in Spark 0.8, Fair sharing became
available, aka the Fair Scheduler
• Fair Scheduling makes resources available to
all queued Jobs
• Turn on Fair Scheduling through SparkSession
config and supporting allocation pool config
• Threads that submit Spark Jobs should
specify what scheduler pool to use if it’s not
the default
Reference: https://spark.apache.org/docs/2.2.0/job-scheduling.html
20. 20#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Creating a DAG of Futures on the
Driver
• Scala Futures syntax enables for-
comprehensions to represent
dependencies in asynchronous operations
• Spark code can be structured with Futures
to represent a DAG of work on the Driver
• When reworking all code into futures, there
will be some redundancy with Spark’s role
in planning and optimizing, and Spark
handles all of this without issue
21. 21#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Why use this strategy?
• To maximize resource utilization in your
cluster
• To maximize the concurrency potential of your
job (and thus speed/efficiency)
• Fair Scheduling pools can support different
notions of priority of work in jobs
• Fair Scheduling pools can support multi-user
environments to enable more even resource
allocation in a shared cluster
Takeaways
• Actions trigger Spark to do things (i.e. create
Jobs)
• Spark can certainly handle running multiple
Jobs at once, you just have to tell it to
• This can be accomplished by multithreading
the driver. In Scala, this can be accomplished
using Futures.
• The way tasks are executed when multiple
jobs are running at once can be further
configured through either Spark’s FIFO or Fair
Scheduler with configured supporting pools.
24. 24#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
Getting started with partitioning
• .repartition() vs .coalesce()
• Custom partitioning is supported with the
RDD API only (specifically through
implicitly added PairRDDFunctions)
• Spark supports the HashPartitioner and
RangePartitioner out of the box
• One can create custom partitioners by
extending Partitioner to enable custom
strategies in grouping data
25. 25#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
How can non-standard partitioning be
useful?
#1 : Collocating data for joins
• We are joining datasets of Articles and Authors
together by the Author’s id.
• When we pull the raw Article dataset, author ids
are likely to be distributed somewhat randomly
throughout partitions.
• Joins can be considered wide transformations
depending on underlying data and could result in
full shuffles.
• We can cut down on the impact of the shuffle
stage by collocating data by the id to join on
within partitions so there is less cross chatter
during this phase.
27. 27#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
How can non-standard partitioning be
useful?
#2 : Grouping data to operate on partitions
as a whole
• We need to calculate an Author Summary report
that needs to have access to all Articles for an
Author to generate meaningful overall metrics
• We could leverage .map and .reduceByKey to
combine Articles for analysis in a pairwise fashion
or by gathering groups for processing
• Operating on a whole partition grouped by an
Author also accomplishes this goal
29. 29#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
Takeaways
• Partitioning can help even out data skew for
more reliable and performant processing.
• The RDD API supports more fine-grained
partitioning with Hash and Range Partitioners.
• One can implement a custom partitioner to
have even more control over how data is
grouped, which creates opportunity for more
performant joins and operations on partitions
as a whole.
• There is expense involved in repartitioning
that has to be balanced against the cost of an
operation on less organized data.
31. 31#UnifiedAnalytics #SparkAISummit
Typical Spark Usage Patterns
• Load data from a store into a
Dataset/Dataframe/RDD
• Apply various transformations and actions
to explore the data
• Build increasingly complex transformations
by leveraging Spark’s flexibility in
accepting functions into API calls
• What are the limits of these
transformations and how can we move
past them? Can we distribute more
complex computation?
Distributing More Than
Just Data
32. 32#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
#1: Distributing Scripts
• It is often useful to support third party libraries
or scripts from peers who like or need to work
in different languages to accomplish Data
Science goals.
• Often times, these scripts have nothing to do
with Spark and language bindings for libraries
might not work as well as expected when
called directly from Spark due to Serialization
constraints (among other things).
• One can distribute scripts to be executed
within Spark by leveraging Scala’s
scala.sys.process package and Spark’s file
moving capability.
33. 33#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
scala.sys.process
• A package that handles the execution of
external processes
• Provides a concise DSL for running and
chaining processes
• Blocks until the process is complete
• Scripts and commands can be interfaced
with through all of the usual means
(reading stdin, reading local files, writing
stdout, writing local files)
Reference: https://www.scala-lang.org/api/2.11.7/#scala.sys.process.ProcessBuilder
34. 34#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
Gotchas
• Make sure the resources provisioned on
executors are suitable to handle the external
process to be run.
• Make sure the external process is built for the
architecture that your cluster runs on and that all
necessary dependencies are available.
• When running more than one executor core,
make sure the external process can handle
having multiple instances of itself running on the
same container.
• When communicating with your process through
the file system, watch out for file system
collisions and be cognizant of cleaning up state .
35. 35#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
#2: Distributing Data Gathering
• APIs have become a very prevalent way of
providing data to systems
• A common operation is gathering data (often
in json) from an API for some entity
• It is possible to leverage Spark APIs to gather
this data due to the flexibility in design of
passing functions to transformations
36. 36#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
Gotchas
• Make sure to carefully manage the number
of concurrent connections being opened
to the APIs being used.
• There are always going to be intermittent
blips when hitting APIs at scale. Don’t
forget to use thorough error handling and
retry logic.
37. 37#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
Takeaways
• The flexibility of Spark’s APIs allow the
framework to be used for more than a
typical workflow of applying relatively small
functions to distributed data.
• One can distribute scripts to be run as
external processes using data contained in
partitions to build inputs and subsequently
pipe outputs back into Spark APIs.
• One can distribute calls to APIs to gather
data and use Spark mechanisms to control
the load on these external sources.
39. 39#UnifiedAnalytics #SparkAISummit
Come Work At Target
• We are hiring in Data Science and Data Engineering
• Solve real-world problems in domains ranging from
supply chain logistics to smart stores to
personalization and more
• Offices in…
o Sunnyvale, CA
o Minneapolis, MN
o Pittsburgh, PA
o Bangalore, India
work somewhere you
jobs.target.com
40. 40#UnifiedAnalytics #SparkAISummit
Target @ Spark+AI Summit
Check out our other talks…
2018
• Extending Apache Spark APIs Without Going Near Spark Source Or
A Compiler (Anna Holschuh)
2019
• Apache Spark Data Validation (Doug Balog and Patrick Pisciuneri)
• Lessons In Linear Algebra At Scale With Apache Spark: Let’s make
the sparse details a bit more dense (Anna Holschuh)