Parallelizing with Apache Spark in Unexpected Ways

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Anna Holschuh, Target
Parallelizing With Apache
Spark In Unexpected Ways
#UnifiedAnalytics #SparkAISummit

What This Talk is About
• Tips for parallelizing with Spark
• Lots of (Scala) code examples
• Focus on Scala programming constructs
3#UnifiedAnalytics #SparkAISummit

Who am I
• Lead Data Engineer/Scientist at Target since 2016
• Deep love of all things Target
• Other Spark Summit talks:
o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A
Compiler
o 2019: Lessons In Linear Algebra At Scale With Apache Spark : Let’s Make The
Sparse Details A Bit More Dense

Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data

Agenda
• Introduction

Introduction
> Hello, Spark!_
Application
Job
Stage
Task
Driver
Executor
Dataset
Dataframe
RDD
Partition
Action
Transformation
Shuffle

Agenda
• Introduction

Parallel Job Submission and Schedulers
Let’s do some data exploration
• We have a system of Authors, Articles, and
Comments on those Articles
• We would like to do some simple data
exploration as part of a batch job
• We execute this code in a built jar through
spark-submit on a cluster with 100
executors, 5 executor cores, 10gb/driver,
and 10gb/executor.
• What happens in Spark when we kick off
the exploration?

The Execution Starts
• One job is kicked off at a time.
• We asked a few independent questions
in our exploration. Why can’t they be
running at the same time?

The Execution Completes
• All of our questions run as separate
jobs.
• Examining the timing demonstrates that
these jobs run serially.

One more sanity check • All of our questions,
running serially.

Can we potentially speed up our
exploration?
• Spark turns our questions into 3 Jobs
• The Jobs run serially
• We notice that some of our questions are
independent. Can they be run at the same
time?
• The answer is yes. We can leverage Scala
Concurrency features and the Spark
Scheduler to achieve this…

Scala Futures
• A placeholder for a value that may not
exist.
• Asynchronous
• Requires an ExecutionContext
• Use Await to block
• Extremely flexible syntax. Supports for-
comprehension chaining to manage
dependencies.
Parallel Job Submission
and Schedulers

Parallel Job Submission
and Schedulers
Let’s rework our original code using
Scala Futures to parallelize Job
Submission
• We pull in a reference to an implicit
ExecutionContext
• We wrap each of our questions in a Future
block to be run asynchronously
• We block on our asynchronous questions all
being completed
• (Not seen) We properly shut down the
ExecutorService when the job is complete

Our questions are now
asked concurrently
• All of our questions run as separate
jobs.
• Examining the timing demonstrates that
these jobs are now running concurrently.

One more sanity check • All of our questions,
running concurrently.

A note about Spark Schedulers
• The default scheduler is FIFO
• Starting in Spark 0.8, Fair sharing became
available, aka the Fair Scheduler
• Fair Scheduling makes resources available to
all queued Jobs
• Turn on Fair Scheduling through SparkSession
config and supporting allocation pool config
• Threads that submit Spark Jobs should
specify what scheduler pool to use if it’s not
the default
Reference: https://spark.apache.org/docs/2.2.0/job-scheduling.html

The Fair Scheduler is
enabled

Creating a DAG of Futures on the
Driver
• Scala Futures syntax enables for-
comprehensions to represent
dependencies in asynchronous operations
• Spark code can be structured with Futures
to represent a DAG of work on the Driver
• When reworking all code into futures, there
will be some redundancy with Spark’s role
in planning and optimizing, and Spark
handles all of this without issue

Why use this strategy?
• To maximize resource utilization in your
cluster
• To maximize the concurrency potential of your
job (and thus speed/efficiency)
• Fair Scheduling pools can support different
notions of priority of work in jobs
• Fair Scheduling pools can support multi-user
environments to enable more even resource
allocation in a shared cluster
Takeaways
• Actions trigger Spark to do things (i.e. create
Jobs)
• Spark can certainly handle running multiple
Jobs at once, you just have to tell it to
• This can be accomplished by multithreading
the driver. In Scala, this can be accomplished
using Futures.
• The way tasks are executed when multiple
jobs are running at once can be further
configured through either Spark’s FIFO or Fair
Scheduler with configured supporting pools.

Agenda
• Introduction

Partitioning Strategies
A first experience with partitioning

Getting started with partitioning
• .repartition() vs .coalesce()
• Custom partitioning is supported with the
RDD API only (specifically through
implicitly added PairRDDFunctions)
• Spark supports the HashPartitioner and
RangePartitioner out of the box
• One can create custom partitioners by
extending Partitioner to enable custom
strategies in grouping data

How can non-standard partitioning be
useful?
#1 : Collocating data for joins
• We are joining datasets of Articles and Authors
together by the Author’s id.
• When we pull the raw Article dataset, author ids
are likely to be distributed somewhat randomly
throughout partitions.
• Joins can be considered wide transformations
depending on underlying data and could result in
full shuffles.
• We can cut down on the impact of the shuffle
stage by collocating data by the id to join on
within partitions so there is less cross chatter
during this phase.

#1: Collocating data for joins
{author_id: 1}
{author_id: 2}
{author_id: 3}
{author_id: 5}
{author_id: 1}
{author_id: 2}
{author_id: 3}
{author_id: 4}
{author_id: 1}
{author_id: 2}
{author_id: 4}
{author_id: 5}
{author_id: 1}
{author_id: 2}
{author_id: 3}
{author_id: 4}
{id: 1}
{id: 2}
{id: 4}
{id: 5}
{id: 3}
{author_id: 1}
{author_id: 1}
{author_id: 1}
{author_id: 1}
{author_id: 2}
{author_id: 2}
{author_id: 2}
{author_id: 2}
{author_id: 4}
{author_id: 4}
{author_id: 4}
{author_id: 5}
{author_id: 5}
{author_id: 3}
{author_id: 3}
{author_id: 3}
{id: 1}
{id: 2}
{id: 4}
{id: 5}
{id: 3}
AuthorsArticles Articles Authors

How can non-standard partitioning be
useful?
#2 : Grouping data to operate on partitions
as a whole
• We need to calculate an Author Summary report
that needs to have access to all Articles for an
Author to generate meaningful overall metrics
• We could leverage .map and .reduceByKey to
combine Articles for analysis in a pairwise fashion
or by gathering groups for processing
• Operating on a whole partition grouped by an
Author also accomplishes this goal

Implementing a Custom Partitioner

Takeaways
• Partitioning can help even out data skew for
more reliable and performant processing.
• The RDD API supports more fine-grained
partitioning with Hash and Range Partitioners.
• One can implement a custom partitioner to
have even more control over how data is
grouped, which creates opportunity for more
performant joins and operations on partitions
as a whole.
• There is expense involved in repartitioning
that has to be balanced against the cost of an
operation on less organized data.

Agenda
• Introduction

Typical Spark Usage Patterns
• Load data from a store into a
Dataset/Dataframe/RDD
• Apply various transformations and actions
to explore the data
• Build increasingly complex transformations
by leveraging Spark’s flexibility in
accepting functions into API calls
• What are the limits of these
transformations and how can we move
past them? Can we distribute more
complex computation?
Distributing More Than
Just Data

Distributing More Than Just Data
#1: Distributing Scripts
• It is often useful to support third party libraries
or scripts from peers who like or need to work
in different languages to accomplish Data
Science goals.
• Often times, these scripts have nothing to do
with Spark and language bindings for libraries
might not work as well as expected when
called directly from Spark due to Serialization
constraints (among other things).
• One can distribute scripts to be executed
within Spark by leveraging Scala’s
scala.sys.process package and Spark’s file
moving capability.

scala.sys.process
• A package that handles the execution of
external processes
• Provides a concise DSL for running and
chaining processes
• Blocks until the process is complete
• Scripts and commands can be interfaced
with through all of the usual means
(reading stdin, reading local files, writing
stdout, writing local files)
Reference: https://www.scala-lang.org/api/2.11.7/#scala.sys.process.ProcessBuilder

Gotchas
• Make sure the resources provisioned on
executors are suitable to handle the external
process to be run.
• Make sure the external process is built for the
architecture that your cluster runs on and that all
necessary dependencies are available.
• When running more than one executor core,
make sure the external process can handle
having multiple instances of itself running on the
same container.
• When communicating with your process through
the file system, watch out for file system
collisions and be cognizant of cleaning up state .

#2: Distributing Data Gathering
• APIs have become a very prevalent way of
providing data to systems
• A common operation is gathering data (often
in json) from an API for some entity
• It is possible to leverage Spark APIs to gather
this data due to the flexibility in design of
passing functions to transformations

Gotchas
• Make sure to carefully manage the number
of concurrent connections being opened
to the APIs being used.
• There are always going to be intermittent
blips when hitting APIs at scale. Don’t
forget to use thorough error handling and
retry logic.

Takeaways
• The flexibility of Spark’s APIs allow the
framework to be used for more than a
typical workflow of applying relatively small
functions to distributed data.
• One can distribute scripts to be run as
external processes using data contained in
partitions to build inputs and subsequently
pipe outputs back into Spark APIs.
• One can distribute calls to APIs to gather
data and use Spark mechanisms to control
the load on these external sources.

Agenda
• Introduction

Come Work At Target
• We are hiring in Data Science and Data Engineering
• Solve real-world problems in domains ranging from
supply chain logistics to smart stores to
personalization and more
• Offices in…
o Sunnyvale, CA
o Minneapolis, MN
o Pittsburgh, PA
o Bangalore, India
work somewhere you
jobs.target.com

Target @ Spark+AI Summit
Check out our other talks…
2018
• Extending Apache Spark APIs Without Going Near Spark Source Or
A Compiler (Anna Holschuh)
2019
• Apache Spark Data Validation (Doug Balog and Patrick Pisciuneri)
• Lessons In Linear Algebra At Scale With Apache Spark: Let’s make
the sparse details a bit more dense (Anna Holschuh)

Acknowledgements
• Thank you Spark Summit
• Thank you Target
• Thank you wonderful team members at Target
• Thank you vibrant Spark and Scala communities

QUESTIONS
anna.holschuh@target.com

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Parallelizing with Apache Spark in Unexpected Ways

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallelizing with Apache Spark in Unexpected Ways

Similar to Parallelizing with Apache Spark in Unexpected Ways (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Parallelizing with Apache Spark in Unexpected Ways