SlideShare a Scribd company logo
1 of 46
Optimizing Spark-based data pipelines -
Are you up for it?
Etti Gur
Nielsen
@ettigur
Introduction
Etti Gur
● Senior Big Data Developer @ Nielsen
Marketing Cloud
● Dealing with Big Data challenges since
2012, building data pipelines using Spark,
Kafka, Druid, Airflow and more
@ettigur
What will you learn?
How we optimized our spark data pipelines using:
● Optimizing Spark resource allocation & utilization
● Parallelizing Spark output phase with dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application
@ettigur
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting audiences
● Business decisions
@ettigur
Nielsen Marketing Cloud in numbers
S3
>10B events/day 60TB/day
S3
6000 nodes/day
10’s of TB
ingested/day
druid
@ettigur
The challenges
Scalability
Cost Efficiency
Fault-tolerance
@ettigur
The challenges
Scalability
Cost Efficiency
Fault-tolerance
@ettigur
What is Spark?
● An analytics engine for large-scale data processing
● Distributed and highly scalable
● A unified framework for batch, streaming, machine learning,
etc.
@ettigur
Basic Spark terminology
● Driver
● Executor
● Cluster manager
○ Mesos, YARN or Standalone
● Managed Spark on public clouds
○ AWS EMR, GCP Dataproc, etc.
@ettigur
What are the logical phases of an advertising campaign?
The business use-case - measure campaigns in-flight
success!
@ettigur
What does a funnel look like?
PRODUCT PAGE
10M
CHECKOUT
3M
HOMEPAGE
15M
7M Drop-
off
5M Drop-
off
AD EXPOSURE
100M
85M
Drop-off
Our use-case - measure campaigns success in-flight
Purchase
Awareness
Consideration
Intent
@ettigur
In-flight analytics pipeline - high-level architecture
date=2020-12-06
date=2020-12-07
date=2020-12-08
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ettigur
The Problem Metric
Growing execution time >24 hours/day
Stability Sporadic failures
High costs $33,000/month
Exhausting recovery Many hours/incident
(“babysitting”)
In-flight analytics pipeline - problems
@ettigur
In-flight analytics pipeline - Mart Generator
date=2020-12-06
date=2020-12-07
date=2020-12-08
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ettigur
Mart Generator problems
● Execution time: ran for over 7 hours
● Stability: experienced sporadic OOM failures
@ettigur
Digging deeper into resource allocation & utilization
There are various ways to examine Spark resource allocation and utilization:
● Spark UI (e.g Executors Tab)
● Spark metrics system, e.g:
○ JMX
○ Graphite
● YARN UI (if applicable)
● Cluster-wide monitoring tools, e.g Ganglia
@ettigur
Resource allocation - SPARK UI
@ettigur
Resource allocation - YARN UI
@ettigur
Resource allocation - YARN UI
@ettigur
Resource utilization - Ganglia
@ettigur
Resource utilization - Ganglia
@ettigur
Mart Generator - initial resource allocation
● EMR cluster with 32 X i3.8xlarge worker nodes
○ Each with 32 cores, 244GB RAM each, NVMe SSD
● spark.executor.cores=6
● spark.executor.memory=40g
● spark.executor.memoryOverhead=4g (0.10 * executorMemory)
● Executors per node=32/6=5(2)
● Unused resources per node=24GB mem, 2 cores
● Unused resources across the cluster=768GB mem, 64 cores
○ Remember our OOM failures?
@ettigur
How to better allocate resources?
Ec2 instance
type
Optimized
for
Spark.executor.cores
Cores/executor
Memory
/executor
Overhead
/executor
Executors
/node
i3.8xlarge
32 vCore,
244 GiB mem (236)
4 x 1,900 NVMe SSD
Memory &
storage
8 54g 5g 32/8 = 4
executors/node
i3.8xlarge Memory &
storage
4 27g 2.5g 32/4 = 8
executors/node
r5.8xlarge
32 vCore,
256 GiB mem
Memory 8 60g 8g 32/8 = 4
executors/node
c5.9xlarge
36 vCore,
72 GiB mem
Compute 6 10g 2g 36/6=6
Number of available executors = (total cores/num-cores-per-executor)
@ettigur
How to better allocate resources?
I3.8xlarge node - example
Number of available executors = (total cores/num-cores-per-executor)
spark.executor.cores=4 spark.executor.cores=8 spark.executor.cores=6
236gb / 8 ->
27gb + 2.5gb
236gb / 4 ->
54gb + 5gb
5 * (40gb + 4gb) =
220
24 GB Unused
244gb - 8gb = 236 memory
@ettigur
Mart Generator - better resource allocation
@ettigur
Mart Generator - better resource utilization, but...
@ettigur
Mart Generator requirement - overwrite latest date only
date=2020-12-06
date=2020-12-07
date=2020-12-08
1.
Read files
of last day
Data
Lake
Mart Generator
Overwrite
2.
Write files by
campaign,date
Campaigns’ marts
@ettigur
Overwrite partitions - the “trivial” Spark
implementation
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting the entire root folder - Data loss
@ettigur
Overwrite specific partitions - our “naive”
implementation
dataframesMap is of type <campaignCode, campaignDataframe>
dataframesMap.foreach(campaign => {
val outputPath = rootPath+"campaign="+campaign.code+"/date="+date
campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath)
})
The result:
● Overwriting only relevant folders
● An extremely long tail (w.r.t execution time)
@ettigur
Overwrite specific partitions - Spark 2.3 implementation
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting only relevant folders
@ettigur
Mart Generator - optimal resource utilization
@ettigur
Mart Generator - summary
● Better resource allocation & utilization
● Execution time decreased from 7+ hours to ~30 minutes
● No sporadic OOM failures
● Overwriting only relevant folders (i.e partitions)
@ettigur
In-flight analytics pipeline - Enricher
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ettigur
Enricher problem - execution time
● Grew from 9 hours to 18 hours
● Sometimes took more than 20 hours
@ettigur
Enricher - initial resource utilization
@ettigur
Running multiple Spark “jobs” within a single Spark application
● Create one spark application with one sparkContext
● Create a parallelized process (such as threadpool)
● Each thread should execute a separate spark “job” in parallel (i.e action)
● “Jobs” are waiting in a queue and are executed based on available
resources
○ This is managed by Spark’s scheduler
@ettigur
Running multiple Spark “jobs” within a single Spark application
val campaigns: List[CampaignArguments]
.
.
.
val completeCampaigns = campaigns.par.map {campaign => {
try {
val ans = processCampaignSpark(campaign, appConf, log)
return Result(campaign.code, ans))
} catch {
case e: Exception => {
log.info("Some thread caused exception : " + e.getMessage)
Result("", "", false, false)
}
}
}
}
processCampaign - (spark.read -> spark process -> spark.write)
@ettigur
Spark UI - one spark job example
@ettigur
Spark UI - multiple Spark “jobs” within a single Spark application
@ettigur
Enricher - optimal resource utilization
@ettigur
Enricher - summary
● Running multiple Spark “jobs” within a single Spark app
● Better resource utilization
● Execution time decreased from 20+ hours to ~30 minutes
@ettigur
The Problem Before After
Growing execution time >24 hours/day 1 hour/day
Stability Sporadic failures Improved
High costs $33,000/month $2000/month
Exhausting recovery Many hours/incident
(“babysitting”)
30 min/incident
In-flight analytics pipeline - before & after
@ettigur
The Problem Before After
Growing execution time >24 hours/day 1 hour/day
Stability Sporadic failures Improved
High costs $33,000/month $2000/month
Exhausting recovery Many hours/incident
(“babysitting”)
30 min/incident
In-flight analytics pipeline - before & after
> 90%
improvement
372K$ / Year !
@ettigur
What have we learned?
● You too can optimize Spark resource allocation & utilization
○ Leverage the tools at hand to deep-dive into your cluster
● Spark output phase can be parallelized even when overwriting specific partitions
○ Use dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application can be useful
● Optimizing data pipelines is an ongoing effort (not a one-off thing)
@ettigur
DRUID
ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field
○ 30+ chapters and 17,000+ members world-wide
○ Everyone can join (regardless of gender), so find a chapter near you -
https://www.womeninbigdata.org/wibd-structure/
Talks:
● Funnel Analysis with Spark and Druid - tinyurl.com/y5qboqpj
● Spark Dynamic Partition Inserts part 1 - tinyurl.com/yd94ztz5
● Spark Dynamic Partition Inserts Part 2 - tinyurl.com/y8uembml
● Our Tech Blog - medium.com/nmc-techblog
THANK YOU
Etti Gur Etti Gur
etti.gur@nielsen.com

More Related Content

What's hot

Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programmingphanleson
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?Eyal Ben Ivri
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
 
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairsphanleson
 
Koalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkDatabricks
 

What's hot (20)

Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Spark
SparkSpark
Spark
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Internals
InternalsInternals
Internals
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Koalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache Spark
 

Similar to Optimizing spark based data pipelines - are you up for it?

Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Sagemaker Brownbag
Sagemaker BrownbagSagemaker Brownbag
Sagemaker BrownbagRay Hilton
 
Spark / Mesos Cluster Optimization
Spark / Mesos Cluster OptimizationSpark / Mesos Cluster Optimization
Spark / Mesos Cluster Optimizationebiznext
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 

Similar to Optimizing spark based data pipelines - are you up for it? (20)

Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
Log analytics with ELK stack
Log analytics with ELK stackLog analytics with ELK stack
Log analytics with ELK stack
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Sagemaker Brownbag
Sagemaker BrownbagSagemaker Brownbag
Sagemaker Brownbag
 
Spark / Mesos Cluster Optimization
Spark / Mesos Cluster OptimizationSpark / Mesos Cluster Optimization
Spark / Mesos Cluster Optimization
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 

Recently uploaded

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 

Recently uploaded (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 

Optimizing spark based data pipelines - are you up for it?

  • 1. Optimizing Spark-based data pipelines - Are you up for it? Etti Gur Nielsen
  • 2. @ettigur Introduction Etti Gur ● Senior Big Data Developer @ Nielsen Marketing Cloud ● Dealing with Big Data challenges since 2012, building data pipelines using Spark, Kafka, Druid, Airflow and more
  • 3. @ettigur What will you learn? How we optimized our spark data pipelines using: ● Optimizing Spark resource allocation & utilization ● Parallelizing Spark output phase with dynamic partition inserts ● Running multiple Spark "jobs" within a single Spark application
  • 4. @ettigur Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Targeting audiences ● Business decisions
  • 5. @ettigur Nielsen Marketing Cloud in numbers S3 >10B events/day 60TB/day S3 6000 nodes/day 10’s of TB ingested/day druid
  • 8. @ettigur What is Spark? ● An analytics engine for large-scale data processing ● Distributed and highly scalable ● A unified framework for batch, streaming, machine learning, etc.
  • 9. @ettigur Basic Spark terminology ● Driver ● Executor ● Cluster manager ○ Mesos, YARN or Standalone ● Managed Spark on public clouds ○ AWS EMR, GCP Dataproc, etc.
  • 10. @ettigur What are the logical phases of an advertising campaign? The business use-case - measure campaigns in-flight success!
  • 11. @ettigur What does a funnel look like? PRODUCT PAGE 10M CHECKOUT 3M HOMEPAGE 15M 7M Drop- off 5M Drop- off AD EXPOSURE 100M 85M Drop-off Our use-case - measure campaigns success in-flight Purchase Awareness Consideration Intent
  • 12. @ettigur In-flight analytics pipeline - high-level architecture date=2020-12-06 date=2020-12-07 date=2020-12-08 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 13. @ettigur The Problem Metric Growing execution time >24 hours/day Stability Sporadic failures High costs $33,000/month Exhausting recovery Many hours/incident (“babysitting”) In-flight analytics pipeline - problems
  • 14. @ettigur In-flight analytics pipeline - Mart Generator date=2020-12-06 date=2020-12-07 date=2020-12-08 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 15. @ettigur Mart Generator problems ● Execution time: ran for over 7 hours ● Stability: experienced sporadic OOM failures
  • 16. @ettigur Digging deeper into resource allocation & utilization There are various ways to examine Spark resource allocation and utilization: ● Spark UI (e.g Executors Tab) ● Spark metrics system, e.g: ○ JMX ○ Graphite ● YARN UI (if applicable) ● Cluster-wide monitoring tools, e.g Ganglia
  • 22. @ettigur Mart Generator - initial resource allocation ● EMR cluster with 32 X i3.8xlarge worker nodes ○ Each with 32 cores, 244GB RAM each, NVMe SSD ● spark.executor.cores=6 ● spark.executor.memory=40g ● spark.executor.memoryOverhead=4g (0.10 * executorMemory) ● Executors per node=32/6=5(2) ● Unused resources per node=24GB mem, 2 cores ● Unused resources across the cluster=768GB mem, 64 cores ○ Remember our OOM failures?
  • 23. @ettigur How to better allocate resources? Ec2 instance type Optimized for Spark.executor.cores Cores/executor Memory /executor Overhead /executor Executors /node i3.8xlarge 32 vCore, 244 GiB mem (236) 4 x 1,900 NVMe SSD Memory & storage 8 54g 5g 32/8 = 4 executors/node i3.8xlarge Memory & storage 4 27g 2.5g 32/4 = 8 executors/node r5.8xlarge 32 vCore, 256 GiB mem Memory 8 60g 8g 32/8 = 4 executors/node c5.9xlarge 36 vCore, 72 GiB mem Compute 6 10g 2g 36/6=6 Number of available executors = (total cores/num-cores-per-executor)
  • 24. @ettigur How to better allocate resources? I3.8xlarge node - example Number of available executors = (total cores/num-cores-per-executor) spark.executor.cores=4 spark.executor.cores=8 spark.executor.cores=6 236gb / 8 -> 27gb + 2.5gb 236gb / 4 -> 54gb + 5gb 5 * (40gb + 4gb) = 220 24 GB Unused 244gb - 8gb = 236 memory
  • 25. @ettigur Mart Generator - better resource allocation
  • 26. @ettigur Mart Generator - better resource utilization, but...
  • 27. @ettigur Mart Generator requirement - overwrite latest date only date=2020-12-06 date=2020-12-07 date=2020-12-08 1. Read files of last day Data Lake Mart Generator Overwrite 2. Write files by campaign,date Campaigns’ marts
  • 28. @ettigur Overwrite partitions - the “trivial” Spark implementation dataframe.write .partitionBy("campaign", "date") .mode(SaveMode.Overwrite) .parquet(folderPath) The result: ● Output written in parallel ● Overwriting the entire root folder - Data loss
  • 29. @ettigur Overwrite specific partitions - our “naive” implementation dataframesMap is of type <campaignCode, campaignDataframe> dataframesMap.foreach(campaign => { val outputPath = rootPath+"campaign="+campaign.code+"/date="+date campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath) }) The result: ● Overwriting only relevant folders ● An extremely long tail (w.r.t execution time)
  • 30. @ettigur Overwrite specific partitions - Spark 2.3 implementation sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") dataframe.write .partitionBy("campaign", "date") .mode(SaveMode.Overwrite) .parquet(folderPath) The result: ● Output written in parallel ● Overwriting only relevant folders
  • 31. @ettigur Mart Generator - optimal resource utilization
  • 32. @ettigur Mart Generator - summary ● Better resource allocation & utilization ● Execution time decreased from 7+ hours to ~30 minutes ● No sporadic OOM failures ● Overwriting only relevant folders (i.e partitions)
  • 33. @ettigur In-flight analytics pipeline - Enricher date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 34. @ettigur Enricher problem - execution time ● Grew from 9 hours to 18 hours ● Sometimes took more than 20 hours
  • 35. @ettigur Enricher - initial resource utilization
  • 36. @ettigur Running multiple Spark “jobs” within a single Spark application ● Create one spark application with one sparkContext ● Create a parallelized process (such as threadpool) ● Each thread should execute a separate spark “job” in parallel (i.e action) ● “Jobs” are waiting in a queue and are executed based on available resources ○ This is managed by Spark’s scheduler
  • 37. @ettigur Running multiple Spark “jobs” within a single Spark application val campaigns: List[CampaignArguments] . . . val completeCampaigns = campaigns.par.map {campaign => { try { val ans = processCampaignSpark(campaign, appConf, log) return Result(campaign.code, ans)) } catch { case e: Exception => { log.info("Some thread caused exception : " + e.getMessage) Result("", "", false, false) } } } } processCampaign - (spark.read -> spark process -> spark.write)
  • 38. @ettigur Spark UI - one spark job example
  • 39. @ettigur Spark UI - multiple Spark “jobs” within a single Spark application
  • 40. @ettigur Enricher - optimal resource utilization
  • 41. @ettigur Enricher - summary ● Running multiple Spark “jobs” within a single Spark app ● Better resource utilization ● Execution time decreased from 20+ hours to ~30 minutes
  • 42. @ettigur The Problem Before After Growing execution time >24 hours/day 1 hour/day Stability Sporadic failures Improved High costs $33,000/month $2000/month Exhausting recovery Many hours/incident (“babysitting”) 30 min/incident In-flight analytics pipeline - before & after
  • 43. @ettigur The Problem Before After Growing execution time >24 hours/day 1 hour/day Stability Sporadic failures Improved High costs $33,000/month $2000/month Exhausting recovery Many hours/incident (“babysitting”) 30 min/incident In-flight analytics pipeline - before & after > 90% improvement 372K$ / Year !
  • 44. @ettigur What have we learned? ● You too can optimize Spark resource allocation & utilization ○ Leverage the tools at hand to deep-dive into your cluster ● Spark output phase can be parallelized even when overwriting specific partitions ○ Use dynamic partition inserts ● Running multiple Spark "jobs" within a single Spark application can be useful ● Optimizing data pipelines is an ongoing effort (not a one-off thing)
  • 45. @ettigur DRUID ES Want to know more? ● Women in Big Data ○ A world-wide program that aims : ■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field ○ 30+ chapters and 17,000+ members world-wide ○ Everyone can join (regardless of gender), so find a chapter near you - https://www.womeninbigdata.org/wibd-structure/ Talks: ● Funnel Analysis with Spark and Druid - tinyurl.com/y5qboqpj ● Spark Dynamic Partition Inserts part 1 - tinyurl.com/yd94ztz5 ● Spark Dynamic Partition Inserts Part 2 - tinyurl.com/y8uembml ● Our Tech Blog - medium.com/nmc-techblog
  • 46. THANK YOU Etti Gur Etti Gur etti.gur@nielsen.com

Editor's Notes

  1. Thank you for joining our session - optimizing ... We will try to make it interesting and valuable for you
  2. Questions - at the end of the session
  3. Ways to identify Spark optimization opportunities; How we optimized our spark data pipelines using ideas as Optimizing Spark resource allocation & utilization Parallelizing Spark output phase with dynamic partition inserts Running multiple Spark "jobs" within a single Spark application Apache Spark is an cluster-computing framework. data items distributed over a cluster of machines - over distributed shared memory Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance In memory - faster Faster than hadoop mapreduce requires a cluster manager and a distributed storage system ( Hadoop YARN, HDFS)
  4. A group inside Nielsen Data Company - collect data of devices from our partners in various ways Online and Offline Process the data enrich the data - using machine learning models Help our clients: Target the relevant audiences make better business decisions enrich the data using machine learning models in order to create more relevant quality insights Data Company - which means that we get or buy data from our partners in various ways Online and Offline We enrich the data - which in our case is generating attributes Attribute - that we assign to a device based on the data that we have, for example Sports Fan, Eats Organic food etc.. The enriched data that we generate help support our clients’ business decision and also allows them to Target the relevant audiences Nielsen marketing cloud or NMC in short -A group inside Nielsen, -Born from eXelate company that was acquired by Nielsen on March 2015 -Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate -Data company meaning -Buying and onboarding data into NMC from data providers, customers and Nielsen data -We have huge high quality dataset -enrich the data using machine learning models in order to create more relevant quality insights -categorize and sell according to a need -Helping brands to take intelligence business decisions -E.g. Targeting in the digital marketing world -Meaning help fit ads to viewers For example street sign can fit to a very small % of people who see it vs Online ads that can fit the profile of the individual that sees it -More interesting to the user -More chances he will click the ad -Better ROI for the marketer
  5. Stream on kafka Translates into Parquet files on our data lake on S3 (AWS cloud) The processing with spark - thousands of nodes around the clock Analytical DB - druid - ingest 10’s TB /DAY
  6. This present us with many challenges Such as - - - and more These challenges are relevant when building a new pipeline in production But are also true for existing pipelines that have been running for a while otherwise:...
  7. in this presentation, I'll share with you some tips and guidelines on how to do it, so YOU can do it too for your pipelines We will get to the stage - I optimized my pipeline and now I have to do it again because: The data keeps growing Code is changing Infrastructure (e.g hardware, frameworks, etc.) keeps evolving Conclusion - Optimizing data pipelines is an iterative process (ongoing effort, not a one-off)
  8. Since we want to talk about spark optimizations let’s just mention
  9. Driver - your application’s main program Executors - distributed worker processes responsible for running the individual tasks (i.e units of work) The driver and each executor are running on separate Java processes (one-to-one relation) A driver and its executors are together termed a Spark application. A Spark application is launched on a set of machines using an external service called a cluster manager
  10. The data pipeline we will talk about today Analysing advertising campaigns while they are active (in flight) a campaign has 4 stages 1 - User is aware of my product (seeing an ad - impression ) 2. User expresses interest - clicks the ad, goes into home page 3. Shows intent - goes into product page 4. If the Advertising was successful - the user is convinced to make the purchase
  11. In the AD Tech world this process is called funnel. Advertising strategy effectiveness Num of ppl that express interest in each one of the stages is getting smaller. In this example: 100M saw the ad, out of them 15M, … finally only 3M made the purchase. - on each stage we get less ppl We want to measure the impact of different advertising strategies of a given campaign, on users’ conversion rate. XXXX How many ppl that seen the ad - go forward in the funnel with intent + conversion (purchase) Look at the funnel we can measure which ad was more successful How do we do that? Our customers can define the meta-data for their campaigns. A campaign includes one or more tactics (which take place in a form of ads). A campaign also includes one or more stages. A stage is a specific place within a funnel (e.g landing page, registration, etc.) . For each tactic, we measure the number of users that were exposed to this tactic and reached each stage in the funnel
  12. High level - Architecture of the data pipeline: S3 data lake partitioned by date Mart generator - spark - reads data from the data lake Creates campaign mart for the latest date and Writes campaigns mart per campaign per date A “data mart” can either be a subset of the raw data or a variation / transformation / aggregation of the data 2nd part - Enricher - spark - for each campaign reads the campaigns events and builds the funnel structure we talked about before. Writes the enriched data to S3 for loading to the analytic database - druid We’re going to focus on 2 components - the Mart Generator and the Enricher
  13. This feature is in production for a few years now, and as time went by, the data grew and it reached to a point we needed to optimize the resources and time consumed, to make it sustainable
  14. 4 sec Let’s start with the first spark process: Mart generator - remind you it is reading raw events and produces mart per campaign per date on s3. Spark application on EMR Apache Spark is an cluster-computing framework. data items distributed over a cluster of machines - over distributed shared memory Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance In memory - faster Faster than hadoop mapreduce requires a cluster manager and a distributed storage system ( Hadoop YARN, HDFS)
  15. we will see a couple of examples in the next slides When we want to investigate such problems: We can look at: Spark ui - has several dashboards - one of them is Executors tab - can see each executor mem Spark produces metrics and we can sink them into tools such as JMX & Graphite and inspect trends over time If you run on yarn - then you can look at yarn UI that shows resource manager Cluster wide tools like Apache Spark is an cluster-computing framework. data items distributed over a cluster of machines - over distributed shared memory Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance In memory - faster Faster than hadoop mapreduce requires a cluster manager and a distributed storage system ( Hadoop YARN, HDFS) Druid - database
  16. 2 secs - this is how it looks The Executors tab displays summary information about the executors that were created for the application, including memory and disk usage and task and shuffle information. The Storage Memory column shows the amount of memory used and reserved for caching data. The Executors tab provides not only resource information (amount of memory, disk, and cores used by each executor) but also performance information (GC time and shuffle information).
  17. 2 secs - this is how it looks
  18. 5 secs We can see idle cores - 9/16 not used This is an unrelated example (not from this specific pipeline)
  19. 2 sec Graph of cpu
  20. Very idle - let's see how we can figure out why the cluster was so under-utilized
  21. Very big machines The config was - Overhead - default 2 unused cores per node 24 GB unused Over the cluster - unused cores & memory over 800 GB not allocated
  22. After you chose the right machine for your cluster - optimized for your process needs (high mem / storage fast storage etc) We want to configure num of cores to use all the cores Applies to on-prem/any cloud as well
  23. Applies to on-prem/any cloud as well - XXX visualize Consider duplicating row 1
  24. On yarn ui we can see now the application uses the whole cluster
  25. Changing the settings mentioned above fixed the utilization and the failures and brought the cluster to this utilization. Here you can see that the cluster is utilized during the main processing phase, but we have a very long tail where the cluster is still significantly under-utilized. - 6 hours - problem This long tail is the phase where output is written to S3, and it was written sequentially (i.e one campaign at a time), we’ll explain why in the next slide.
  26. Reminder We needed to add/overwrite the latest date for each campaign, We want to overwrite for case of failure and keep the older dates.
  27. Usually, when you work with such folder structure we would use this code to write the output in parallel, of the complete dataset + partition by. But when using this method, the whole root folder (mart folder) was overwritten (including dates that we did not have data in our current Dataset).
  28. What was running on production - is the naive solution for this problem The naive solution was to write the output for each campaign+date separately (using “overwrite” option). This is the cause of the long tail we saw in the above screenshot (where the cluster spent 6 out of 7 hours of execution time on writing the output). BTW, the old code created a map of datasets (filtered from red day just the events that were related to a certain campaign)
  29. To solve this problem - we wanted to go back to the better solution of the partition by (very quick) Explain what this new feature (in Spark 2.3) does Explain the side-effect (i.e the driver is moving the files in the .spark-staging-XXX-XXX-XXX to the final destination) What solved this issue Enables dynamic partition inserts when dynamic Default: static When INSERT OVERWRITE a partitioned data source table with dynamic partition columns, Spark SQL supports two modes (case-insensitive): • static - Spark deletes all the partitions that match the partition specification (e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting • dynamic - Spark doesn’t delete partitions ahead, and only overwrites those partitions that have data written into it The default (STATIC) is to keep the same behavior of Spark prior to 2.3. Note that this config doesn’t affect Hive serde tables, as they are always overwritten with dynamic mode. Use SQLConf.partitionOverwriteMode method to access the current value.
  30. well, that was really great, but... We also have the second part of the pipeline, let's look at that now
  31. we've finished with the MartGen and now we're moving to the next part, which is the Enricher In order to run spark process for each campaign - There was a python script that sequentially submits a spark application for each campaign. Apache Spark is an cluster-computing framework. data items distributed over a cluster of machines - over distributed shared memory Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance In memory - faster Faster than hadoop mapreduce requires a cluster manager and a distributed storage system ( Hadoop YARN, HDFS)
  32. Enricher takes the latest mart and all older marts per campaign Creates pairs of tactic - stage with tactic date (tactic happened before the stage) It takes the older data to match older tactics with new data
  33. Cluster was also under utilized This used to be a python script that sequentially submits a spark application for each campaign.
  34. Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users). By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
  35. Each process will read campaign latest & old marts and will re-create an enriched snapshot of the campaign XXX visualize the par of spark
  36. Check it out!