SlideShare a Scribd company logo
1 of 23
Download to read offline
Best Practices for Enabling
Speculative Execution on
Large Scale Platforms
Ron Hu, Venkata Sowrirajan
LinkedIn
Agenda
▪ Motivation
▪ Enhancements
▪ Configuration
▪ Metrics and analysis
▪ User guidance
▪ Future work
Speculative Execution
• A stage consists of many parallel
tasks and the stage can run as fast as
the slowest task runs.
• If one task is running very slowly in a
stage, Spark driver will re-launch a
speculation task for it on a different host.
• Between the regular task and the
speculation task, whichever task finishes
first is used. The slower task is killed.
3
❌
✅
Launch
speculative task
Time
Default Speculation Parameters
4
Configuration
Parameters
Default
value
Meaning
spark.speculation false If set to "true", performs speculative execution of
tasks.
spark.speculation.interval 100ms How often Spark will check for tasks to speculate.
spark.speculation.multiplier 1.5 How many times slower a task is than the median to
be considered for speculation.
spark.speculation.quantile 0.75 Fraction of tasks which must be complete before
speculation is enabled for a particular stage.
Motivation
• Speeds up straggler tasks - additional overhead.
• Default configs are generally too aggressive in most
cases.
• Speculating tasks that run for few seconds are
mostly wasteful.
• Investigate impact caused by data skews,
overloaded shuffle services etc with speculation
enabled.
• What is the impact if we enable speculation by
default in a multi-tenant large scale cluster?
5
Speculative Execution improvements
• Tasks run for few seconds gets speculated wasting
resources unnecessarily.
• Solution: Prevent tasks getting speculated which run
for few seconds
• Internally, introduced a new spark configuration
(spark.speculation.minRuntimeThreshold) which prevents
tasks from getting speculated that runs for less than the min threshold
time.
• Similar feature later got added to Apache Spark in
SPARK-33741
6
Speculative execution metrics
• Additional metrics are required to understand both the
usefulness and overhead introduced by speculative
execution.
• Existing onTaskEnd and onTaskStart event in
AppStatusListener is enriched to produce speculation
summary metrics for a stage.
7
Speculative execution metrics
• Added additional metrics for a stage with
speculative execution like:
• Number of speculated tasks
• Number of successful speculated tasks
• Number of killed speculated tasks
• Number of failed speculated tasks
8
Speculative execution metrics
• Speculation summary for a stage with additional metrics using
the existing events.
9
Updated Speculation Parameter Values
• Upstream Spark’s default speculation parameter values are not good for us.
• LinkedIn’s Spark jobs are mainly for batch off-line jobs plus some interactive analytics
workloads.
• We set speculation parameters to these default values for most LinkedIn’s applications.
Users can still overwrite per their individual needs.
10
Configuration Parameters
Upstream
Default
LinkedIn
Default
spark.speculation false true
spark.speculation.interval 100ms 1 sec
spark.speculation.multiplier 1.5 4.0
spark.speculation.quantile 0.75 0.90
spark.speculation.min.threshold N/A 30 sec
Metrics and Analysis
• We care about ROI (Return On Investment).
• We analyzed
• The return or performance gain, and
• The investment/overhead or additional cost
• We measured various metrics for one week on a large
cluster with 10K+ machines.
• A multi-tenant environment with 40K+ Spark applications running daily.
• Enabled dynamic allocations.
• With resource sharing and contention, performance varies due to transient
delays/congestions.
11
Task Level Statistics
12
1.24% 0.32% 60%
Duration
delta
Success
rate
Additional
tasks
Ratio of all
launched
speculation
tasks over
all tasks
Speculated
tasks success
rate
Ratio of
duration of all
speculation
tasks over
duration of all
regular tasks
1.65M
2.73M
Fruitful
tasks
Speculated
tasks
Total number
of the
launched
speculation
tasks
Total number
of fruitful
speculation
tasks
● A speculation task is
fruitful if it finishes
earlier than the
corresponding regular
task.
● The conservative values
in the config parameters
leads to high success
rate.
Stage Level Statistics
447K 184K 140K
Total
eligible
stages
Stages with
speculation
tasks
Stages with
fruitful
speculation
tasks
● A stage is eligible for
speculation if its duration > 30
seconds with at least 10 tasks.
● 41% of them launched
speculation tasks
● Among those stages that
launched speculation tasks,
76% of them received
performance benefits.
Stages Fruitful Stages
Speculated stages
Application Level Statistics
157K 59K 51K
Total
applications
Applications
with
speculation
tasks
Applications
with fruitful
speculation
tasks
● 38% of all Spark applications
launched speculation tasks.
● 87% of them benefit from the
speculative execution.
● Overall 32% of all Spark
applications benefit from the
speculation execution.
Applications Fruitful apps
Speculated apps
Case Study
• We analyzed the impact on a mission critical
application.
• It has a total of 29 Spark application flows.
• Some Spark flows run daily. Some run hourly.
• Each flow has a well defined SLA.
• We took measures of all the flows for
• two weeks before enabling speculation, and
• two weeks after enabling speculation.
15
Number in Minutes BEFORE enabling AFTER enabling After/Before ratio
Geometric mean of average
elapsed times of all flows
7.44 6.47 87% (or decreased by
13%)
Geometric mean of standard
deviation of elapsed times for
all flows
2.91 1.71 59%(or decreased by
41%)
Resource Consumption Impact
 
17
Decrease
by 24%
User Guidance: Where speculation can help
• A mapper task is slow because the running executor is too busy and/or
some system hangs due to hardware/software issues.
• We used to see ‘run-away’ tasks sometimes due to some system hang issues.
• After enabling speculation, we rarely see ‘run-away’ tasks.
• The ‘run-away’ tasks were later killed since their corresponding speculation tasks
finished earlier.
• The network route is congested somewhere.
• There exists another data copy.
• The regular task normally will reach the ‘NODE_LOCAL’/’RACK_LOCAL” copy. The
speculation task usually reaches the ‘ANY’ data copy
• If the initial task was launched suboptimally, its speculative task can have better
locality.
18
User Guidance: Where speculation cannot help
• Data skew
• Overload of shuffle services causing reducer task
delays
• Not enough memory causing tasks to spill.
• Spark driver does not know the root cause why a task
is slow when it launches a speculation task.
19
Summary
• At LinkedIn, we further enhanced Spark engine to monitor
speculation statistics.
• We shared our configuration settings to effectively manage
speculative executions.
• Depending on your performance goal, you need to decide how much
overhead you can tolerate.
• ROI if speculation parameters are properly set:
• I: small increase in network messages
• I: small overhead in Spark Driver
• R: good saving in executor resources
• R: good reduction in job elapsed times
• R: significant reduction in the variation of elapsed times, leading to a
more predictable/consistent performance.
Future Work
• Add intelligence to Spark driver to decide whether or
not to launch speculation tasks.
• Distinguish between the manageable/unmanageable causes.
• On the cloud, we may have unlimited resources.
However, we may need to factor in the money cost.
• What is the cost in launching additional executors?
21
Acknowledgement
We want to thank
▪ Eric Baldeschweiler
▪ Sunitha Beeram
▪ LinkedIn Spark Team
for their enlightening discussions
and insightful comments.
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at FacebookDatabricks
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 

What's hot (20)

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 

Similar to Best Practices for Enabling Speculative Execution on Large Scale Platforms

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
November 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory gridNovember 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory gridYahoo Developer Network
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Flink Forward
 
Performance tuning Grails applications
Performance tuning Grails applicationsPerformance tuning Grails applications
Performance tuning Grails applicationsLari Hotari
 
Apache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream AnalyticsApache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream AnalyticsPrabhu Thukkaram
 
Deep Dive into Spark
Deep Dive into SparkDeep Dive into Spark
Deep Dive into SparkEric Xiao
 
Software engineering 9 software cost estimation
Software engineering 9 software cost estimationSoftware engineering 9 software cost estimation
Software engineering 9 software cost estimationVaibhav Khanna
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Performance tuning Grails applications
 Performance tuning Grails applications Performance tuning Grails applications
Performance tuning Grails applicationsGR8Conf
 
Benchmarking at Parse
Benchmarking at ParseBenchmarking at Parse
Benchmarking at ParseTravis Redman
 
Advanced Benchmarking at Parse
Advanced Benchmarking at ParseAdvanced Benchmarking at Parse
Advanced Benchmarking at ParseMongoDB
 

Similar to Best Practices for Enabling Speculative Execution on Large Scale Platforms (20)

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
November 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory gridNovember 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory grid
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs

 
Performance tuning Grails applications
Performance tuning Grails applicationsPerformance tuning Grails applications
Performance tuning Grails applications
 
Apache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream AnalyticsApache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream Analytics
 
Deep Dive into Spark
Deep Dive into SparkDeep Dive into Spark
Deep Dive into Spark
 
Software engineering 9 software cost estimation
Software engineering 9 software cost estimationSoftware engineering 9 software cost estimation
Software engineering 9 software cost estimation
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Performance tuning Grails applications
 Performance tuning Grails applications Performance tuning Grails applications
Performance tuning Grails applications
 
Benchmarking at Parse
Benchmarking at ParseBenchmarking at Parse
Benchmarking at Parse
 
Advanced Benchmarking at Parse
Advanced Benchmarking at ParseAdvanced Benchmarking at Parse
Advanced Benchmarking at Parse
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

Best Practices for Enabling Speculative Execution on Large Scale Platforms

  • 1. Best Practices for Enabling Speculative Execution on Large Scale Platforms Ron Hu, Venkata Sowrirajan LinkedIn
  • 2. Agenda ▪ Motivation ▪ Enhancements ▪ Configuration ▪ Metrics and analysis ▪ User guidance ▪ Future work
  • 3. Speculative Execution • A stage consists of many parallel tasks and the stage can run as fast as the slowest task runs. • If one task is running very slowly in a stage, Spark driver will re-launch a speculation task for it on a different host. • Between the regular task and the speculation task, whichever task finishes first is used. The slower task is killed. 3 ❌ ✅ Launch speculative task Time
  • 4. Default Speculation Parameters 4 Configuration Parameters Default value Meaning spark.speculation false If set to "true", performs speculative execution of tasks. spark.speculation.interval 100ms How often Spark will check for tasks to speculate. spark.speculation.multiplier 1.5 How many times slower a task is than the median to be considered for speculation. spark.speculation.quantile 0.75 Fraction of tasks which must be complete before speculation is enabled for a particular stage.
  • 5. Motivation • Speeds up straggler tasks - additional overhead. • Default configs are generally too aggressive in most cases. • Speculating tasks that run for few seconds are mostly wasteful. • Investigate impact caused by data skews, overloaded shuffle services etc with speculation enabled. • What is the impact if we enable speculation by default in a multi-tenant large scale cluster? 5
  • 6. Speculative Execution improvements • Tasks run for few seconds gets speculated wasting resources unnecessarily. • Solution: Prevent tasks getting speculated which run for few seconds • Internally, introduced a new spark configuration (spark.speculation.minRuntimeThreshold) which prevents tasks from getting speculated that runs for less than the min threshold time. • Similar feature later got added to Apache Spark in SPARK-33741 6
  • 7. Speculative execution metrics • Additional metrics are required to understand both the usefulness and overhead introduced by speculative execution. • Existing onTaskEnd and onTaskStart event in AppStatusListener is enriched to produce speculation summary metrics for a stage. 7
  • 8. Speculative execution metrics • Added additional metrics for a stage with speculative execution like: • Number of speculated tasks • Number of successful speculated tasks • Number of killed speculated tasks • Number of failed speculated tasks 8
  • 9. Speculative execution metrics • Speculation summary for a stage with additional metrics using the existing events. 9
  • 10. Updated Speculation Parameter Values • Upstream Spark’s default speculation parameter values are not good for us. • LinkedIn’s Spark jobs are mainly for batch off-line jobs plus some interactive analytics workloads. • We set speculation parameters to these default values for most LinkedIn’s applications. Users can still overwrite per their individual needs. 10 Configuration Parameters Upstream Default LinkedIn Default spark.speculation false true spark.speculation.interval 100ms 1 sec spark.speculation.multiplier 1.5 4.0 spark.speculation.quantile 0.75 0.90 spark.speculation.min.threshold N/A 30 sec
  • 11. Metrics and Analysis • We care about ROI (Return On Investment). • We analyzed • The return or performance gain, and • The investment/overhead or additional cost • We measured various metrics for one week on a large cluster with 10K+ machines. • A multi-tenant environment with 40K+ Spark applications running daily. • Enabled dynamic allocations. • With resource sharing and contention, performance varies due to transient delays/congestions. 11
  • 12. Task Level Statistics 12 1.24% 0.32% 60% Duration delta Success rate Additional tasks Ratio of all launched speculation tasks over all tasks Speculated tasks success rate Ratio of duration of all speculation tasks over duration of all regular tasks 1.65M 2.73M Fruitful tasks Speculated tasks Total number of the launched speculation tasks Total number of fruitful speculation tasks ● A speculation task is fruitful if it finishes earlier than the corresponding regular task. ● The conservative values in the config parameters leads to high success rate.
  • 13. Stage Level Statistics 447K 184K 140K Total eligible stages Stages with speculation tasks Stages with fruitful speculation tasks ● A stage is eligible for speculation if its duration > 30 seconds with at least 10 tasks. ● 41% of them launched speculation tasks ● Among those stages that launched speculation tasks, 76% of them received performance benefits. Stages Fruitful Stages Speculated stages
  • 14. Application Level Statistics 157K 59K 51K Total applications Applications with speculation tasks Applications with fruitful speculation tasks ● 38% of all Spark applications launched speculation tasks. ● 87% of them benefit from the speculative execution. ● Overall 32% of all Spark applications benefit from the speculation execution. Applications Fruitful apps Speculated apps
  • 15. Case Study • We analyzed the impact on a mission critical application. • It has a total of 29 Spark application flows. • Some Spark flows run daily. Some run hourly. • Each flow has a well defined SLA. • We took measures of all the flows for • two weeks before enabling speculation, and • two weeks after enabling speculation. 15
  • 16. Number in Minutes BEFORE enabling AFTER enabling After/Before ratio Geometric mean of average elapsed times of all flows 7.44 6.47 87% (or decreased by 13%) Geometric mean of standard deviation of elapsed times for all flows 2.91 1.71 59%(or decreased by 41%)
  • 18. User Guidance: Where speculation can help • A mapper task is slow because the running executor is too busy and/or some system hangs due to hardware/software issues. • We used to see ‘run-away’ tasks sometimes due to some system hang issues. • After enabling speculation, we rarely see ‘run-away’ tasks. • The ‘run-away’ tasks were later killed since their corresponding speculation tasks finished earlier. • The network route is congested somewhere. • There exists another data copy. • The regular task normally will reach the ‘NODE_LOCAL’/’RACK_LOCAL” copy. The speculation task usually reaches the ‘ANY’ data copy • If the initial task was launched suboptimally, its speculative task can have better locality. 18
  • 19. User Guidance: Where speculation cannot help • Data skew • Overload of shuffle services causing reducer task delays • Not enough memory causing tasks to spill. • Spark driver does not know the root cause why a task is slow when it launches a speculation task. 19
  • 20. Summary • At LinkedIn, we further enhanced Spark engine to monitor speculation statistics. • We shared our configuration settings to effectively manage speculative executions. • Depending on your performance goal, you need to decide how much overhead you can tolerate. • ROI if speculation parameters are properly set: • I: small increase in network messages • I: small overhead in Spark Driver • R: good saving in executor resources • R: good reduction in job elapsed times • R: significant reduction in the variation of elapsed times, leading to a more predictable/consistent performance.
  • 21. Future Work • Add intelligence to Spark driver to decide whether or not to launch speculation tasks. • Distinguish between the manageable/unmanageable causes. • On the cloud, we may have unlimited resources. However, we may need to factor in the money cost. • What is the cost in launching additional executors? 21
  • 22. Acknowledgement We want to thank ▪ Eric Baldeschweiler ▪ Sunitha Beeram ▪ LinkedIn Spark Team for their enlightening discussions and insightful comments.
  • 23. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.