Etti Gur from Israel, Senior Big Data Engineer @ Nielsen, will talk about Optimizing spark-based data pipelines - are you up for it?
In Nielsen, we ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner. In this talk, we will discuss how we significantly optimized our Spark-based in-flight analytics daily pipeline, reducing its total execution time from over 20 hours down to 1 hour, resulting in a huge cost reduction.
Topics include:
* Ways to identify Spark optimization opportunities;
* Optimizing Spark resource allocation;
* Parallelizing Spark output phase with dynamic partition inserts;
* Running multiple Spark ''jobs' in parallel within a single Spark application;
2. @ettigur
Introduction
Etti Gur
● Senior Big Data Developer @ Nielsen
Marketing Cloud
● Dealing with Big Data challenges since
2012, building data pipelines using Spark,
Kafka, Druid, Airflow and more
3. @ettigur
What will you learn?
How we optimized our spark data pipelines using:
● Optimizing Spark resource allocation & utilization
● Parallelizing Spark output phase with dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application
4. @ettigur
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting audiences
● Business decisions
8. @ettigur
What is Spark?
● An analytics engine for large-scale data processing
● Distributed and highly scalable
● A unified framework for batch, streaming, machine learning,
etc.
9. @ettigur
Basic Spark terminology
● Driver
● Executor
● Cluster manager
○ Mesos, YARN or Standalone
● Managed Spark on public clouds
○ AWS EMR, GCP Dataproc, etc.
10. @ettigur
What are the logical phases of an advertising campaign?
The business use-case - measure campaigns in-flight
success!
11. @ettigur
What does a funnel look like?
PRODUCT PAGE
10M
CHECKOUT
3M
HOMEPAGE
15M
7M Drop-
off
5M Drop-
off
AD EXPOSURE
100M
85M
Drop-off
Our use-case - measure campaigns success in-flight
Purchase
Awareness
Consideration
Intent
12. @ettigur
In-flight analytics pipeline - high-level architecture
date=2020-12-06
date=2020-12-07
date=2020-12-08
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
13. @ettigur
The Problem Metric
Growing execution time >24 hours/day
Stability Sporadic failures
High costs $33,000/month
Exhausting recovery Many hours/incident
(“babysitting”)
In-flight analytics pipeline - problems
14. @ettigur
In-flight analytics pipeline - Mart Generator
date=2020-12-06
date=2020-12-07
date=2020-12-08
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
27. @ettigur
Mart Generator requirement - overwrite latest date only
date=2020-12-06
date=2020-12-07
date=2020-12-08
1.
Read files
of last day
Data
Lake
Mart Generator
Overwrite
2.
Write files by
campaign,date
Campaigns’ marts
28. @ettigur
Overwrite partitions - the “trivial” Spark
implementation
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting the entire root folder - Data loss
29. @ettigur
Overwrite specific partitions - our “naive”
implementation
dataframesMap is of type <campaignCode, campaignDataframe>
dataframesMap.foreach(campaign => {
val outputPath = rootPath+"campaign="+campaign.code+"/date="+date
campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath)
})
The result:
● Overwriting only relevant folders
● An extremely long tail (w.r.t execution time)
30. @ettigur
Overwrite specific partitions - Spark 2.3 implementation
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting only relevant folders
32. @ettigur
Mart Generator - summary
● Better resource allocation & utilization
● Execution time decreased from 7+ hours to ~30 minutes
● No sporadic OOM failures
● Overwriting only relevant folders (i.e partitions)
33. @ettigur
In-flight analytics pipeline - Enricher
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
34. @ettigur
Enricher problem - execution time
● Grew from 9 hours to 18 hours
● Sometimes took more than 20 hours
36. @ettigur
Running multiple Spark “jobs” within a single Spark application
● Create one spark application with one sparkContext
● Create a parallelized process (such as threadpool)
● Each thread should execute a separate spark “job” in parallel (i.e action)
● “Jobs” are waiting in a queue and are executed based on available
resources
○ This is managed by Spark’s scheduler
37. @ettigur
Running multiple Spark “jobs” within a single Spark application
val campaigns: List[CampaignArguments]
.
.
.
val completeCampaigns = campaigns.par.map {campaign => {
try {
val ans = processCampaignSpark(campaign, appConf, log)
return Result(campaign.code, ans))
} catch {
case e: Exception => {
log.info("Some thread caused exception : " + e.getMessage)
Result("", "", false, false)
}
}
}
}
processCampaign - (spark.read -> spark process -> spark.write)
41. @ettigur
Enricher - summary
● Running multiple Spark “jobs” within a single Spark app
● Better resource utilization
● Execution time decreased from 20+ hours to ~30 minutes
42. @ettigur
The Problem Before After
Growing execution time >24 hours/day 1 hour/day
Stability Sporadic failures Improved
High costs $33,000/month $2000/month
Exhausting recovery Many hours/incident
(“babysitting”)
30 min/incident
In-flight analytics pipeline - before & after
43. @ettigur
The Problem Before After
Growing execution time >24 hours/day 1 hour/day
Stability Sporadic failures Improved
High costs $33,000/month $2000/month
Exhausting recovery Many hours/incident
(“babysitting”)
30 min/incident
In-flight analytics pipeline - before & after
> 90%
improvement
372K$ / Year !
44. @ettigur
What have we learned?
● You too can optimize Spark resource allocation & utilization
○ Leverage the tools at hand to deep-dive into your cluster
● Spark output phase can be parallelized even when overwriting specific partitions
○ Use dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application can be useful
● Optimizing data pipelines is an ongoing effort (not a one-off thing)
45. @ettigur
DRUID
ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field
○ 30+ chapters and 17,000+ members world-wide
○ Everyone can join (regardless of gender), so find a chapter near you -
https://www.womeninbigdata.org/wibd-structure/
Talks:
● Funnel Analysis with Spark and Druid - tinyurl.com/y5qboqpj
● Spark Dynamic Partition Inserts part 1 - tinyurl.com/yd94ztz5
● Spark Dynamic Partition Inserts Part 2 - tinyurl.com/y8uembml
● Our Tech Blog - medium.com/nmc-techblog
Thank you for joining our session - optimizing ...
We will try to make it interesting and valuable for you
Questions - at the end of the session
Ways to identify Spark optimization opportunities;
How we optimized our spark data pipelines using ideas as
Optimizing Spark resource allocation & utilization
Parallelizing Spark output phase with dynamic partition inserts
Running multiple Spark "jobs" within a single Spark application
Apache Spark is an cluster-computing framework.
data items distributed over a cluster of machines - over distributed shared memory
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
In memory - faster
Faster than hadoop mapreduce
requires a cluster manager and a distributed storage system ( Hadoop YARN, HDFS)
A group inside Nielsen
Data Company - collect data of devices from our partners in various ways Online and Offline
Process the data
enrich the data - using machine learning models
Help our clients:
Target the relevant audiences
make better business decisions
enrich the data using machine learning models in order to create more relevant quality insights
Data Company - which means that we get or buy data from our partners in various ways Online and Offline
We enrich the data - which in our case is generating attributes
Attribute - that we assign to a device based on the data that we have, for example Sports Fan, Eats Organic food etc..
The enriched data that we generate help support our clients’ business decision and also allows them to Target the relevant audiences
Nielsen marketing cloud or NMC in short
-A group inside Nielsen,
-Born from eXelate company that was acquired by Nielsen on March 2015
-Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate
-Data company meaning
-Buying and onboarding data into NMC from data providers, customers and Nielsen data
-We have huge high quality dataset
-enrich the data using machine learning models in order to create more relevant quality insights
-categorize and sell according to a need
-Helping brands to take intelligence business decisions
-E.g. Targeting in the digital marketing world
-Meaning help fit ads to viewers
For example street sign can fit to a very small % of people who see it vs
Online ads that can fit the profile of the individual that sees it
-More interesting to the user
-More chances he will click the ad
-Better ROI for the marketer
Stream on kafka
Translates into Parquet files on our data lake on S3 (AWS cloud)
The processing with spark - thousands of nodes around the clock
Analytical DB - druid - ingest 10’s TB /DAY
This present us with many challenges
Such as - - - and more
These challenges are relevant when building a new pipeline in production
But are also true for existing pipelines that have been running for a while
otherwise:...
in this presentation, I'll share with you some tips and guidelines on how to do it,
so YOU can do it too for your pipelines
We will get to the stage -
I optimized my pipeline and now I have to do it again because:
The data keeps growing
Code is changing
Infrastructure (e.g hardware, frameworks, etc.) keeps evolving
Conclusion - Optimizing data pipelines is an iterative process (ongoing effort, not a one-off)
Since we want to talk about spark optimizations let’s just mention
Driver - your application’s main program
Executors - distributed worker processes responsible for running the individual tasks (i.e units of work)
The driver and each executor are running on separate Java processes (one-to-one relation)
A driver and its executors are together termed a Spark application.
A Spark application is launched on a set of machines using an external service called a cluster manager
The data pipeline we will talk about today
Analysing advertising campaigns while they are active (in flight)
a campaign has 4 stages
1 - User is aware of my product (seeing an ad - impression )
2. User expresses interest - clicks the ad, goes into home page
3. Shows intent - goes into product page
4. If the Advertising was successful - the user is convinced to make the purchase
In the AD Tech world this process is called funnel. Advertising strategy effectiveness
Num of ppl that express interest in each one of the stages is getting smaller.
In this example: 100M saw the ad, out of them 15M, … finally only 3M made the purchase. - on each stage we get less ppl
We want to measure the impact of different advertising strategies of a given campaign, on users’ conversion rate. XXXX
How many ppl that seen the ad - go forward in the funnel with intent + conversion (purchase)
Look at the funnel
we can measure which ad was more successful
How do we do that?
Our customers can define the meta-data for their campaigns.
A campaign includes one or more tactics (which take place in a form of ads).
A campaign also includes one or more stages. A stage is a specific place within a funnel (e.g landing page, registration, etc.) .
For each tactic, we measure the number of users that were exposed to this tactic and reached each stage in the funnel
High level - Architecture of the data pipeline:
S3 data lake partitioned by date
Mart generator - spark - reads data from the data lake
Creates campaign mart for the latest date and Writes campaigns mart per campaign per date
A “data mart” can either be a subset of the raw data or a variation / transformation / aggregation of the data
2nd part - Enricher - spark - for each campaign reads the campaigns events and builds the funnel structure we talked about before.
Writes the enriched data to S3 for loading to the analytic database - druid
We’re going to focus on 2 components - the Mart Generator and the Enricher
This feature is in production for a few years now, and as time went by,
the data grew and it reached to a point we needed to optimize the resources and time consumed,
to make it sustainable
4 sec
Let’s start with the first spark process:
Mart generator - remind you it is reading raw events and produces mart per campaign per date on s3.
Spark application on EMR
Apache Spark is an cluster-computing framework.
data items distributed over a cluster of machines - over distributed shared memory
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
In memory - faster
Faster than hadoop mapreduce
requires a cluster manager and a distributed storage system ( Hadoop YARN, HDFS)
we will see a couple of examples in the next slides
When we want to investigate such problems:
We can look at:
Spark ui - has several dashboards - one of them is Executors tab - can see each executor mem
Spark produces metrics and we can sink them into tools such as JMX & Graphite and inspect trends over time
If you run on yarn - then you can look at yarn UI that shows resource manager
Cluster wide tools like
Apache Spark is an cluster-computing framework.
data items distributed over a cluster of machines - over distributed shared memory
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
In memory - faster
Faster than hadoop mapreduce
requires a cluster manager and a distributed storage system ( Hadoop YARN, HDFS)
Druid - database
2 secs - this is how it looks
The Executors tab displays summary information about
the executors that were created for the application,
including memory and disk usage and task and shuffle information.
The Storage Memory column shows the amount of memory used and reserved for caching data.
The Executors tab provides not only resource information (amount of memory, disk, and cores used by each executor)
but also performance information (GC time and shuffle information).
2 secs - this is how it looks
5 secs
We can see idle cores - 9/16 not used
This is an unrelated example (not from this specific pipeline)
2 sec
Graph of cpu
Very idle - let's see how we can figure out why the cluster was so under-utilized
Very big machines
The config was -
Overhead - default
2 unused cores per node
24 GB unused
Over the cluster - unused cores & memory over 800 GB not allocated
After you chose the right machine for your cluster - optimized for your process needs (high mem / storage fast storage etc)
We want to configure num of cores to use all the cores
Applies to on-prem/any cloud as well
Applies to on-prem/any cloud as well - XXX visualize
Consider duplicating row 1
On yarn ui we can see now the application uses the whole cluster
Changing the settings mentioned above fixed the utilization and the failures
and brought the cluster to this utilization.
Here you can see that the cluster is utilized during the main processing phase,
but we have a very long tail where the cluster is still significantly under-utilized. - 6 hours - problem
This long tail is the phase where output is written to S3,
and it was written sequentially (i.e one campaign at a time), we’ll explain why in the next slide.
Reminder
We needed to add/overwrite the latest date for each campaign,
We want to overwrite for case of failure
and keep the older dates.
Usually, when you work with such folder structure
we would use this code to write the output in parallel,
of the complete dataset + partition by.
But when using this method,
the whole root folder (mart folder) was overwritten (including dates that we did not have data in our current Dataset).
What was running on production - is the naive solution for this problem
The naive solution was to write the output for each campaign+date separately (using “overwrite” option).
This is the cause of the long tail we saw in the above screenshot (where the cluster spent 6 out of 7 hours of execution time on writing the output).
BTW, the old code created a map of datasets (filtered from red day just the events that were related to a certain campaign)
To solve this problem - we wanted to go back to the better solution of the partition by (very quick)
Explain what this new feature (in Spark 2.3) does
Explain the side-effect (i.e the driver is moving the files in the .spark-staging-XXX-XXX-XXX to the final destination)
What solved this issue
Enables dynamic partition inserts when dynamic Default: static When INSERT OVERWRITE a partitioned data source table with dynamic partition columns,
Spark SQL supports two modes (case-insensitive):
• static - Spark deletes all the partitions that match the partition specification (e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting
• dynamic - Spark doesn’t delete partitions ahead, and only overwrites those partitions that have data written into it The default (STATIC) is to keep the same behavior of Spark prior to 2.3.
Note that this config doesn’t affect Hive serde tables, as they are always overwritten with dynamic mode.
Use SQLConf.partitionOverwriteMode method to access the current value.
well, that was really great, but... We also have the second part of the pipeline, let's look at that now
we've finished with the MartGen and now we're moving to the next part, which is the Enricher
In order to run spark process for each campaign -
There was a python script that sequentially submits a spark application for each campaign.
Apache Spark is an cluster-computing framework.
data items distributed over a cluster of machines - over distributed shared memory
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
In memory - faster
Faster than hadoop mapreduce
requires a cluster manager and a distributed storage system ( Hadoop YARN, HDFS)
Enricher takes the latest mart and all older marts per campaign
Creates pairs of tactic - stage with tactic date (tactic happened before the stage)
It takes the older data to match older tactics with new data
Cluster was also under utilized
This used to be a python script that sequentially submits a spark application for each campaign.
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Each process will read campaign latest & old marts and will re-create an enriched snapshot of the campaign XXX visualize the par of spark