Optimizing spark based data pipelines - are you up for it?

Optimizing Spark-based data pipelines -
Are you up for it?
Etti Gur
Nielsen

@ettigur
Introduction
Etti Gur
● Senior Big Data Developer @ Nielsen
Marketing Cloud
● Dealing with Big Data challenges since
2012, building data pipelines using Spark,
Kafka, Druid, Airflow and more

@ettigur
What will you learn?
How we optimized our spark data pipelines using:
● Optimizing Spark resource allocation & utilization
● Parallelizing Spark output phase with dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application

@ettigur
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting audiences
● Business decisions

@ettigur
Nielsen Marketing Cloud in numbers
S3
>10B events/day 60TB/day
S3
6000 nodes/day
10’s of TB
ingested/day
druid

@ettigur
The challenges
Scalability
Cost Efficiency
Fault-tolerance

@ettigur
What is Spark?
● An analytics engine for large-scale data processing
● Distributed and highly scalable
● A unified framework for batch, streaming, machine learning,
etc.

@ettigur
Basic Spark terminology
● Driver
● Executor
● Cluster manager
○ Mesos, YARN or Standalone
● Managed Spark on public clouds
○ AWS EMR, GCP Dataproc, etc.

@ettigur
What are the logical phases of an advertising campaign?
The business use-case - measure campaigns in-flight
success!

@ettigur
What does a funnel look like?
PRODUCT PAGE
10M
CHECKOUT
3M
HOMEPAGE
15M
7M Drop-
off
5M Drop-
off
AD EXPOSURE
100M
85M
Drop-off
Our use-case - measure campaigns success in-flight
Purchase
Awareness
Consideration
Intent

@ettigur
In-flight analytics pipeline - high-level architecture
date=2020-12-06
date=2020-12-07
date=2020-12-08
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher

@ettigur
The Problem Metric
Growing execution time >24 hours/day
Stability Sporadic failures
High costs $33,000/month
Exhausting recovery Many hours/incident
(“babysitting”)
In-flight analytics pipeline - problems

@ettigur
In-flight analytics pipeline - Mart Generator
date=2020-12-06
date=2020-12-07
date=2020-12-08
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign

@ettigur
Mart Generator problems
● Execution time: ran for over 7 hours
● Stability: experienced sporadic OOM failures

@ettigur
Digging deeper into resource allocation & utilization
There are various ways to examine Spark resource allocation and utilization:
● Spark UI (e.g Executors Tab)
● Spark metrics system, e.g:
○ JMX
○ Graphite
● YARN UI (if applicable)
● Cluster-wide monitoring tools, e.g Ganglia

@ettigur
Resource allocation - SPARK UI

@ettigur
Resource allocation - YARN UI

@ettigur
Resource utilization - Ganglia

@ettigur
Mart Generator - initial resource allocation
● EMR cluster with 32 X i3.8xlarge worker nodes
○ Each with 32 cores, 244GB RAM each, NVMe SSD
● spark.executor.cores=6
● spark.executor.memory=40g
● spark.executor.memoryOverhead=4g (0.10 * executorMemory)
● Executors per node=32/6=5(2)
● Unused resources per node=24GB mem, 2 cores
● Unused resources across the cluster=768GB mem, 64 cores
○ Remember our OOM failures?

@ettigur
How to better allocate resources?
Ec2 instance
type
Optimized
for
Spark.executor.cores
Cores/executor
Memory
/executor
Overhead
/executor
Executors
/node
i3.8xlarge
32 vCore,
244 GiB mem (236)
4 x 1,900 NVMe SSD
Memory &
storage
8 54g 5g 32/8 = 4
executors/node
i3.8xlarge Memory &
storage
4 27g 2.5g 32/4 = 8
executors/node
r5.8xlarge
32 vCore,
256 GiB mem
Memory 8 60g 8g 32/8 = 4
executors/node
c5.9xlarge
36 vCore,
72 GiB mem
Compute 6 10g 2g 36/6=6
Number of available executors = (total cores/num-cores-per-executor)

@ettigur
How to better allocate resources?
I3.8xlarge node - example
Number of available executors = (total cores/num-cores-per-executor)
spark.executor.cores=4 spark.executor.cores=8 spark.executor.cores=6
236gb / 8 ->
27gb + 2.5gb
236gb / 4 ->
54gb + 5gb
5 * (40gb + 4gb) =
220
24 GB Unused
244gb - 8gb = 236 memory

@ettigur
Mart Generator - better resource allocation

@ettigur
Mart Generator - better resource utilization, but...

@ettigur
Mart Generator requirement - overwrite latest date only
date=2020-12-06
date=2020-12-07
date=2020-12-08
1.
Read files
of last day
Data
Lake
Mart Generator
Overwrite
2.
Write files by
campaign,date
Campaigns’ marts

@ettigur
Overwrite partitions - the “trivial” Spark
implementation
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting the entire root folder - Data loss

@ettigur
Overwrite specific partitions - our “naive”
implementation
dataframesMap is of type <campaignCode, campaignDataframe>
dataframesMap.foreach(campaign => {
val outputPath = rootPath+"campaign="+campaign.code+"/date="+date
campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath)
})
The result:
● Overwriting only relevant folders
● An extremely long tail (w.r.t execution time)

@ettigur
Overwrite specific partitions - Spark 2.3 implementation
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting only relevant folders

@ettigur
Mart Generator - optimal resource utilization

@ettigur
Mart Generator - summary
● Better resource allocation & utilization
● Execution time decreased from 7+ hours to ~30 minutes
● No sporadic OOM failures
● Overwriting only relevant folders (i.e partitions)

@ettigur
In-flight analytics pipeline - Enricher
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign

@ettigur
Enricher problem - execution time
● Grew from 9 hours to 18 hours
● Sometimes took more than 20 hours

@ettigur
Enricher - initial resource utilization

@ettigur
Running multiple Spark “jobs” within a single Spark application
● Create one spark application with one sparkContext
● Create a parallelized process (such as threadpool)
● Each thread should execute a separate spark “job” in parallel (i.e action)
● “Jobs” are waiting in a queue and are executed based on available
resources
○ This is managed by Spark’s scheduler

@ettigur
Running multiple Spark “jobs” within a single Spark application
val campaigns: List[CampaignArguments]
.
.
.
val completeCampaigns = campaigns.par.map {campaign => {
try {
val ans = processCampaignSpark(campaign, appConf, log)
return Result(campaign.code, ans))
} catch {
case e: Exception => {
log.info("Some thread caused exception : " + e.getMessage)
Result("", "", false, false)
}
}
}
}
processCampaign - (spark.read -> spark process -> spark.write)

@ettigur
Spark UI - one spark job example

@ettigur
Spark UI - multiple Spark “jobs” within a single Spark application

@ettigur
Enricher - optimal resource utilization

@ettigur
Enricher - summary
● Running multiple Spark “jobs” within a single Spark app
● Better resource utilization
● Execution time decreased from 20+ hours to ~30 minutes

@ettigur
The Problem Before After
Growing execution time >24 hours/day 1 hour/day
Stability Sporadic failures Improved
High costs $33,000/month $2000/month
(“babysitting”)
30 min/incident
In-flight analytics pipeline - before & after

@ettigur
The Problem Before After
Growing execution time >24 hours/day 1 hour/day
Stability Sporadic failures Improved
High costs $33,000/month $2000/month
(“babysitting”)
30 min/incident
In-flight analytics pipeline - before & after
> 90%
improvement
372K$ / Year !

@ettigur
What have we learned?
● You too can optimize Spark resource allocation & utilization
○ Leverage the tools at hand to deep-dive into your cluster
● Spark output phase can be parallelized even when overwriting specific partitions
○ Use dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application can be useful
● Optimizing data pipelines is an ongoing effort (not a one-off thing)

@ettigur
DRUID
ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field
○ 30+ chapters and 17,000+ members world-wide
○ Everyone can join (regardless of gender), so find a chapter near you -
https://www.womeninbigdata.org/wibd-structure/
Talks:
● Funnel Analysis with Spark and Druid - tinyurl.com/y5qboqpj
● Spark Dynamic Partition Inserts part 1 - tinyurl.com/yd94ztz5
● Spark Dynamic Partition Inserts Part 2 - tinyurl.com/y8uembml
● Our Tech Blog - medium.com/nmc-techblog

THANK YOU
Etti Gur Etti Gur
etti.gur@nielsen.com

Optimizing spark based data pipelines - are you up for it?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optimizing spark based data pipelines - are you up for it?

Similar to Optimizing spark based data pipelines - are you up for it? (20)

Recently uploaded

Recently uploaded (20)

Optimizing spark based data pipelines - are you up for it?

Editor's Notes