SlideShare a Scribd company logo
1 of 43
Download to read offline
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Anna Holschuh, Target
Parallelizing With Apache
Spark In Unexpected Ways
#UnifiedAnalytics #SparkAISummit
What This Talk is About
• Tips for parallelizing with Spark
• Lots of (Scala) code examples
• Focus on Scala programming constructs
3#UnifiedAnalytics #SparkAISummit
4#UnifiedAnalytics #SparkAISummit
Who am I
• Lead Data Engineer/Scientist at Target since 2016
• Deep love of all things Target
• Other Spark Summit talks:
o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A
Compiler
o 2019: Lessons In Linear Algebra At Scale With Apache Spark : Let’s Make The
Sparse Details A Bit More Dense
5#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
6#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
7#UnifiedAnalytics #SparkAISummit
Introduction
> Hello, Spark!_
Application
Job
Stage
Task
Driver
Executor
Dataset
Dataframe
RDD
Partition
Action
Transformation
Shuffle
8#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
9#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Let’s do some data exploration
• We have a system of Authors, Articles, and
Comments on those Articles
• We would like to do some simple data
exploration as part of a batch job
• We execute this code in a built jar through
spark-submit on a cluster with 100
executors, 5 executor cores, 10gb/driver,
and 10gb/executor.
• What happens in Spark when we kick off
the exploration?
10#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
The Execution Starts
• One job is kicked off at a time.
• We asked a few independent questions
in our exploration. Why can’t they be
running at the same time?
11#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
The Execution Completes
• All of our questions run as separate
jobs.
• Examining the timing demonstrates that
these jobs run serially.
12#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
One more sanity check • All of our questions,
running serially.
13#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Can we potentially speed up our
exploration?
• Spark turns our questions into 3 Jobs
• The Jobs run serially
• We notice that some of our questions are
independent. Can they be run at the same
time?
• The answer is yes. We can leverage Scala
Concurrency features and the Spark
Scheduler to achieve this…
14#UnifiedAnalytics #SparkAISummit
Scala Futures
• A placeholder for a value that may not
exist.
• Asynchronous
• Requires an ExecutionContext
• Use Await to block
• Extremely flexible syntax. Supports for-
comprehension chaining to manage
dependencies.
Parallel Job Submission
and Schedulers
15#UnifiedAnalytics #SparkAISummit
Parallel Job Submission
and Schedulers
Let’s rework our original code using
Scala Futures to parallelize Job
Submission
• We pull in a reference to an implicit
ExecutionContext
• We wrap each of our questions in a Future
block to be run asynchronously
• We block on our asynchronous questions all
being completed
• (Not seen) We properly shut down the
ExecutorService when the job is complete
16#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Our questions are now
asked concurrently
• All of our questions run as separate
jobs.
• Examining the timing demonstrates that
these jobs are now running concurrently.
17#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
One more sanity check • All of our questions,
running concurrently.
18#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
A note about Spark Schedulers
• The default scheduler is FIFO
• Starting in Spark 0.8, Fair sharing became
available, aka the Fair Scheduler
• Fair Scheduling makes resources available to
all queued Jobs
• Turn on Fair Scheduling through SparkSession
config and supporting allocation pool config
• Threads that submit Spark Jobs should
specify what scheduler pool to use if it’s not
the default
Reference: https://spark.apache.org/docs/2.2.0/job-scheduling.html
19#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
The Fair Scheduler is
enabled
20#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Creating a DAG of Futures on the
Driver
• Scala Futures syntax enables for-
comprehensions to represent
dependencies in asynchronous operations
• Spark code can be structured with Futures
to represent a DAG of work on the Driver
• When reworking all code into futures, there
will be some redundancy with Spark’s role
in planning and optimizing, and Spark
handles all of this without issue
21#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Why use this strategy?
• To maximize resource utilization in your
cluster
• To maximize the concurrency potential of your
job (and thus speed/efficiency)
• Fair Scheduling pools can support different
notions of priority of work in jobs
• Fair Scheduling pools can support multi-user
environments to enable more even resource
allocation in a shared cluster
Takeaways
• Actions trigger Spark to do things (i.e. create
Jobs)
• Spark can certainly handle running multiple
Jobs at once, you just have to tell it to
• This can be accomplished by multithreading
the driver. In Scala, this can be accomplished
using Futures.
• The way tasks are executed when multiple
jobs are running at once can be further
configured through either Spark’s FIFO or Fair
Scheduler with configured supporting pools.
22#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
23#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
A first experience with partitioning
24#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
Getting started with partitioning
• .repartition() vs .coalesce()
• Custom partitioning is supported with the
RDD API only (specifically through
implicitly added PairRDDFunctions)
• Spark supports the HashPartitioner and
RangePartitioner out of the box
• One can create custom partitioners by
extending Partitioner to enable custom
strategies in grouping data
25#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
How can non-standard partitioning be
useful?
#1 : Collocating data for joins
• We are joining datasets of Articles and Authors
together by the Author’s id.
• When we pull the raw Article dataset, author ids
are likely to be distributed somewhat randomly
throughout partitions.
• Joins can be considered wide transformations
depending on underlying data and could result in
full shuffles.
• We can cut down on the impact of the shuffle
stage by collocating data by the id to join on
within partitions so there is less cross chatter
during this phase.
26#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
#1: Collocating data for joins
{author_id: 1}
{author_id: 2}
{author_id: 3}
{author_id: 5}
{author_id: 1}
{author_id: 2}
{author_id: 3}
{author_id: 4}
{author_id: 1}
{author_id: 2}
{author_id: 4}
{author_id: 5}
{author_id: 1}
{author_id: 2}
{author_id: 3}
{author_id: 4}
{id: 1}
{id: 2}
{id: 4}
{id: 5}
{id: 3}
{author_id: 1}
{author_id: 1}
{author_id: 1}
{author_id: 1}
{author_id: 2}
{author_id: 2}
{author_id: 2}
{author_id: 2}
{author_id: 4}
{author_id: 4}
{author_id: 4}
{author_id: 5}
{author_id: 5}
{author_id: 3}
{author_id: 3}
{author_id: 3}
{id: 1}
{id: 2}
{id: 4}
{id: 5}
{id: 3}
AuthorsArticles Articles Authors
27#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
How can non-standard partitioning be
useful?
#2 : Grouping data to operate on partitions
as a whole
• We need to calculate an Author Summary report
that needs to have access to all Articles for an
Author to generate meaningful overall metrics
• We could leverage .map and .reduceByKey to
combine Articles for analysis in a pairwise fashion
or by gathering groups for processing
• Operating on a whole partition grouped by an
Author also accomplishes this goal
28#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
Implementing a Custom Partitioner
29#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
Takeaways
• Partitioning can help even out data skew for
more reliable and performant processing.
• The RDD API supports more fine-grained
partitioning with Hash and Range Partitioners.
• One can implement a custom partitioner to
have even more control over how data is
grouped, which creates opportunity for more
performant joins and operations on partitions
as a whole.
• There is expense involved in repartitioning
that has to be balanced against the cost of an
operation on less organized data.
30#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
31#UnifiedAnalytics #SparkAISummit
Typical Spark Usage Patterns
• Load data from a store into a
Dataset/Dataframe/RDD
• Apply various transformations and actions
to explore the data
• Build increasingly complex transformations
by leveraging Spark’s flexibility in
accepting functions into API calls
• What are the limits of these
transformations and how can we move
past them? Can we distribute more
complex computation?
Distributing More Than
Just Data
32#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
#1: Distributing Scripts
• It is often useful to support third party libraries
or scripts from peers who like or need to work
in different languages to accomplish Data
Science goals.
• Often times, these scripts have nothing to do
with Spark and language bindings for libraries
might not work as well as expected when
called directly from Spark due to Serialization
constraints (among other things).
• One can distribute scripts to be executed
within Spark by leveraging Scala’s
scala.sys.process package and Spark’s file
moving capability.
33#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
scala.sys.process
• A package that handles the execution of
external processes
• Provides a concise DSL for running and
chaining processes
• Blocks until the process is complete
• Scripts and commands can be interfaced
with through all of the usual means
(reading stdin, reading local files, writing
stdout, writing local files)
Reference: https://www.scala-lang.org/api/2.11.7/#scala.sys.process.ProcessBuilder
34#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
Gotchas
• Make sure the resources provisioned on
executors are suitable to handle the external
process to be run.
• Make sure the external process is built for the
architecture that your cluster runs on and that all
necessary dependencies are available.
• When running more than one executor core,
make sure the external process can handle
having multiple instances of itself running on the
same container.
• When communicating with your process through
the file system, watch out for file system
collisions and be cognizant of cleaning up state .
35#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
#2: Distributing Data Gathering
• APIs have become a very prevalent way of
providing data to systems
• A common operation is gathering data (often
in json) from an API for some entity
• It is possible to leverage Spark APIs to gather
this data due to the flexibility in design of
passing functions to transformations
36#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
Gotchas
• Make sure to carefully manage the number
of concurrent connections being opened
to the APIs being used.
• There are always going to be intermittent
blips when hitting APIs at scale. Don’t
forget to use thorough error handling and
retry logic.
37#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
Takeaways
• The flexibility of Spark’s APIs allow the
framework to be used for more than a
typical workflow of applying relatively small
functions to distributed data.
• One can distribute scripts to be run as
external processes using data contained in
partitions to build inputs and subsequently
pipe outputs back into Spark APIs.
• One can distribute calls to APIs to gather
data and use Spark mechanisms to control
the load on these external sources.
38#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
39#UnifiedAnalytics #SparkAISummit
Come Work At Target
• We are hiring in Data Science and Data Engineering
• Solve real-world problems in domains ranging from
supply chain logistics to smart stores to
personalization and more
• Offices in…
o Sunnyvale, CA
o Minneapolis, MN
o Pittsburgh, PA
o Bangalore, India
work somewhere you
jobs.target.com
40#UnifiedAnalytics #SparkAISummit
Target @ Spark+AI Summit
Check out our other talks…
2018
• Extending Apache Spark APIs Without Going Near Spark Source Or
A Compiler (Anna Holschuh)
2019
• Apache Spark Data Validation (Doug Balog and Patrick Pisciuneri)
• Lessons In Linear Algebra At Scale With Apache Spark: Let’s make
the sparse details a bit more dense (Anna Holschuh)
41#UnifiedAnalytics #SparkAISummit
Acknowledgements
• Thank you Spark Summit
• Thank you Target
• Thank you wonderful team members at Target
• Thank you vibrant Spark and Scala communities
42#UnifiedAnalytics #SparkAISummit
QUESTIONS
anna.holschuh@target.com
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 

What's hot (20)

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 

Similar to Parallelizing with Apache Spark in Unexpected Ways

Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
Databricks
 

Similar to Parallelizing with Apache Spark in Unexpected Ways (20)

Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Recently uploaded (20)

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Parallelizing with Apache Spark in Unexpected Ways

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Anna Holschuh, Target Parallelizing With Apache Spark In Unexpected Ways #UnifiedAnalytics #SparkAISummit
  • 3. What This Talk is About • Tips for parallelizing with Spark • Lots of (Scala) code examples • Focus on Scala programming constructs 3#UnifiedAnalytics #SparkAISummit
  • 4. 4#UnifiedAnalytics #SparkAISummit Who am I • Lead Data Engineer/Scientist at Target since 2016 • Deep love of all things Target • Other Spark Summit talks: o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A Compiler o 2019: Lessons In Linear Algebra At Scale With Apache Spark : Let’s Make The Sparse Details A Bit More Dense
  • 5. 5#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 6. 6#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 7. 7#UnifiedAnalytics #SparkAISummit Introduction > Hello, Spark!_ Application Job Stage Task Driver Executor Dataset Dataframe RDD Partition Action Transformation Shuffle
  • 8. 8#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 9. 9#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Let’s do some data exploration • We have a system of Authors, Articles, and Comments on those Articles • We would like to do some simple data exploration as part of a batch job • We execute this code in a built jar through spark-submit on a cluster with 100 executors, 5 executor cores, 10gb/driver, and 10gb/executor. • What happens in Spark when we kick off the exploration?
  • 10. 10#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers The Execution Starts • One job is kicked off at a time. • We asked a few independent questions in our exploration. Why can’t they be running at the same time?
  • 11. 11#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers The Execution Completes • All of our questions run as separate jobs. • Examining the timing demonstrates that these jobs run serially.
  • 12. 12#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers One more sanity check • All of our questions, running serially.
  • 13. 13#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Can we potentially speed up our exploration? • Spark turns our questions into 3 Jobs • The Jobs run serially • We notice that some of our questions are independent. Can they be run at the same time? • The answer is yes. We can leverage Scala Concurrency features and the Spark Scheduler to achieve this…
  • 14. 14#UnifiedAnalytics #SparkAISummit Scala Futures • A placeholder for a value that may not exist. • Asynchronous • Requires an ExecutionContext • Use Await to block • Extremely flexible syntax. Supports for- comprehension chaining to manage dependencies. Parallel Job Submission and Schedulers
  • 15. 15#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Let’s rework our original code using Scala Futures to parallelize Job Submission • We pull in a reference to an implicit ExecutionContext • We wrap each of our questions in a Future block to be run asynchronously • We block on our asynchronous questions all being completed • (Not seen) We properly shut down the ExecutorService when the job is complete
  • 16. 16#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Our questions are now asked concurrently • All of our questions run as separate jobs. • Examining the timing demonstrates that these jobs are now running concurrently.
  • 17. 17#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers One more sanity check • All of our questions, running concurrently.
  • 18. 18#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers A note about Spark Schedulers • The default scheduler is FIFO • Starting in Spark 0.8, Fair sharing became available, aka the Fair Scheduler • Fair Scheduling makes resources available to all queued Jobs • Turn on Fair Scheduling through SparkSession config and supporting allocation pool config • Threads that submit Spark Jobs should specify what scheduler pool to use if it’s not the default Reference: https://spark.apache.org/docs/2.2.0/job-scheduling.html
  • 19. 19#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers The Fair Scheduler is enabled
  • 20. 20#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Creating a DAG of Futures on the Driver • Scala Futures syntax enables for- comprehensions to represent dependencies in asynchronous operations • Spark code can be structured with Futures to represent a DAG of work on the Driver • When reworking all code into futures, there will be some redundancy with Spark’s role in planning and optimizing, and Spark handles all of this without issue
  • 21. 21#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Why use this strategy? • To maximize resource utilization in your cluster • To maximize the concurrency potential of your job (and thus speed/efficiency) • Fair Scheduling pools can support different notions of priority of work in jobs • Fair Scheduling pools can support multi-user environments to enable more even resource allocation in a shared cluster Takeaways • Actions trigger Spark to do things (i.e. create Jobs) • Spark can certainly handle running multiple Jobs at once, you just have to tell it to • This can be accomplished by multithreading the driver. In Scala, this can be accomplished using Futures. • The way tasks are executed when multiple jobs are running at once can be further configured through either Spark’s FIFO or Fair Scheduler with configured supporting pools.
  • 22. 22#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 24. 24#UnifiedAnalytics #SparkAISummit Partitioning Strategies Getting started with partitioning • .repartition() vs .coalesce() • Custom partitioning is supported with the RDD API only (specifically through implicitly added PairRDDFunctions) • Spark supports the HashPartitioner and RangePartitioner out of the box • One can create custom partitioners by extending Partitioner to enable custom strategies in grouping data
  • 25. 25#UnifiedAnalytics #SparkAISummit Partitioning Strategies How can non-standard partitioning be useful? #1 : Collocating data for joins • We are joining datasets of Articles and Authors together by the Author’s id. • When we pull the raw Article dataset, author ids are likely to be distributed somewhat randomly throughout partitions. • Joins can be considered wide transformations depending on underlying data and could result in full shuffles. • We can cut down on the impact of the shuffle stage by collocating data by the id to join on within partitions so there is less cross chatter during this phase.
  • 26. 26#UnifiedAnalytics #SparkAISummit Partitioning Strategies #1: Collocating data for joins {author_id: 1} {author_id: 2} {author_id: 3} {author_id: 5} {author_id: 1} {author_id: 2} {author_id: 3} {author_id: 4} {author_id: 1} {author_id: 2} {author_id: 4} {author_id: 5} {author_id: 1} {author_id: 2} {author_id: 3} {author_id: 4} {id: 1} {id: 2} {id: 4} {id: 5} {id: 3} {author_id: 1} {author_id: 1} {author_id: 1} {author_id: 1} {author_id: 2} {author_id: 2} {author_id: 2} {author_id: 2} {author_id: 4} {author_id: 4} {author_id: 4} {author_id: 5} {author_id: 5} {author_id: 3} {author_id: 3} {author_id: 3} {id: 1} {id: 2} {id: 4} {id: 5} {id: 3} AuthorsArticles Articles Authors
  • 27. 27#UnifiedAnalytics #SparkAISummit Partitioning Strategies How can non-standard partitioning be useful? #2 : Grouping data to operate on partitions as a whole • We need to calculate an Author Summary report that needs to have access to all Articles for an Author to generate meaningful overall metrics • We could leverage .map and .reduceByKey to combine Articles for analysis in a pairwise fashion or by gathering groups for processing • Operating on a whole partition grouped by an Author also accomplishes this goal
  • 29. 29#UnifiedAnalytics #SparkAISummit Partitioning Strategies Takeaways • Partitioning can help even out data skew for more reliable and performant processing. • The RDD API supports more fine-grained partitioning with Hash and Range Partitioners. • One can implement a custom partitioner to have even more control over how data is grouped, which creates opportunity for more performant joins and operations on partitions as a whole. • There is expense involved in repartitioning that has to be balanced against the cost of an operation on less organized data.
  • 30. 30#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 31. 31#UnifiedAnalytics #SparkAISummit Typical Spark Usage Patterns • Load data from a store into a Dataset/Dataframe/RDD • Apply various transformations and actions to explore the data • Build increasingly complex transformations by leveraging Spark’s flexibility in accepting functions into API calls • What are the limits of these transformations and how can we move past them? Can we distribute more complex computation? Distributing More Than Just Data
  • 32. 32#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data #1: Distributing Scripts • It is often useful to support third party libraries or scripts from peers who like or need to work in different languages to accomplish Data Science goals. • Often times, these scripts have nothing to do with Spark and language bindings for libraries might not work as well as expected when called directly from Spark due to Serialization constraints (among other things). • One can distribute scripts to be executed within Spark by leveraging Scala’s scala.sys.process package and Spark’s file moving capability.
  • 33. 33#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data scala.sys.process • A package that handles the execution of external processes • Provides a concise DSL for running and chaining processes • Blocks until the process is complete • Scripts and commands can be interfaced with through all of the usual means (reading stdin, reading local files, writing stdout, writing local files) Reference: https://www.scala-lang.org/api/2.11.7/#scala.sys.process.ProcessBuilder
  • 34. 34#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data Gotchas • Make sure the resources provisioned on executors are suitable to handle the external process to be run. • Make sure the external process is built for the architecture that your cluster runs on and that all necessary dependencies are available. • When running more than one executor core, make sure the external process can handle having multiple instances of itself running on the same container. • When communicating with your process through the file system, watch out for file system collisions and be cognizant of cleaning up state .
  • 35. 35#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data #2: Distributing Data Gathering • APIs have become a very prevalent way of providing data to systems • A common operation is gathering data (often in json) from an API for some entity • It is possible to leverage Spark APIs to gather this data due to the flexibility in design of passing functions to transformations
  • 36. 36#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data Gotchas • Make sure to carefully manage the number of concurrent connections being opened to the APIs being used. • There are always going to be intermittent blips when hitting APIs at scale. Don’t forget to use thorough error handling and retry logic.
  • 37. 37#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data Takeaways • The flexibility of Spark’s APIs allow the framework to be used for more than a typical workflow of applying relatively small functions to distributed data. • One can distribute scripts to be run as external processes using data contained in partitions to build inputs and subsequently pipe outputs back into Spark APIs. • One can distribute calls to APIs to gather data and use Spark mechanisms to control the load on these external sources.
  • 38. 38#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 39. 39#UnifiedAnalytics #SparkAISummit Come Work At Target • We are hiring in Data Science and Data Engineering • Solve real-world problems in domains ranging from supply chain logistics to smart stores to personalization and more • Offices in… o Sunnyvale, CA o Minneapolis, MN o Pittsburgh, PA o Bangalore, India work somewhere you jobs.target.com
  • 40. 40#UnifiedAnalytics #SparkAISummit Target @ Spark+AI Summit Check out our other talks… 2018 • Extending Apache Spark APIs Without Going Near Spark Source Or A Compiler (Anna Holschuh) 2019 • Apache Spark Data Validation (Doug Balog and Patrick Pisciuneri) • Lessons In Linear Algebra At Scale With Apache Spark: Let’s make the sparse details a bit more dense (Anna Holschuh)
  • 41. 41#UnifiedAnalytics #SparkAISummit Acknowledgements • Thank you Spark Summit • Thank you Target • Thank you wonderful team members at Target • Thank you vibrant Spark and Scala communities
  • 43. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT