Optimizing Spark

•Download as PPTX, PDF•

0 likes•153 views

I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular. It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?" In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish). You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it? No? Then what exactly is the metric that you should optimize?

Technology

Proprietary and confidential 2
If you’ve thought about this at all, you won’t
learn anything from me today
If you haven’t thought about this, you’ll learn a
few principles to organize your thinking

Proprietary and confidential 3
Know what you want to measure
You don’t want to measure run times
You want to measure effective performance of
some machine characteristic: network
bandwidth, file access latency, or CPU
operations per second

Proprietary and confidential 4
You do this with carefully constructed data sets
To measure network bandwidth, construct a data set with
the same number of files (so file access latency is
constant) and do the same operation on it (so that cpu
operations are constant) but force some extra data with
variable size (e.g. random 1 byte ints vs. random 8 byte
ints) to come along for the ride.
Then take difference of run times.

Proprietary and confidential 5
Case Study: Effective Network Bandwidth
Everything seemed to run slowly under
Spark 2.0...
Latency and CPU performance looked
fine
But we got terrible network bandwidth
from Spark 2.0
Not necessarily intrinsic to Spark 2.0…
could have been some detail of our
setup
However Spark 2.1 worked fine, so we
just decommissioned our Spark 2.0
setup

Proprietary and confidential 6
How do you know if you’re getting your
money’s worth out of parallelization?

Proprietary and confidential 7
Run time vs. Number of Executors
Probably the first
plot you draw…
but doesn’t really
tell you what you
want to know

Proprietary and confidential 8
Overall Cost (in dollars if possible) vs. executors
In a perfect world
(linear speed-ups)
cost is independent
of parallelism
In the real world
costs generally rise
with parallelism

Proprietary and confidential 9
Benefit: 1/walltime = answers per hour
1 hour vs. 2 hours:
Probably not a big deal
1 week vs. 2 weeks:
Probably is a big deal
1 minute vs 10 minutes
is a huge deal:
Too easy to get
distracted if your debug
cycle is 10 minutes.

Proprietary and confidential 10
Once you are crisp on the costs and benefits, you will be in
a position to say things like:
“If I double the amount of parallelism for this job, my AWS
bill will rise by 30 pct and the job will run in 45 minutes
instead of 60 minutes. Does that seem worth it to me?”

Proprietary and confidential 11
Recap
Focus on measuring performance of intrinsic machine
characteristics like network bandwidth to characterize
performance
Use carefully constructed data sets that change one and
only one thing to do it
Be crisp on costs (dollars) and benefits (essentially debug
cycles per hour) of parallelism to make informed choices
about whether you want more or less of it.

What's hot

Presto meetup 2015-03-19 @FacebookTreasure Data, Inc.

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Spark Summit

Data Engineer's Lunch #54: dbt and SparkAnant Corporation

Cassandra Lunch #87: Recreating Cassandra.api using Astra and StargateAnant Corporation

Provisioning Datadog with TerraformMatt Spurlin

Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...Spark Summit

Presto - Analytical Database. Overview and use cases.Wojciech Biela

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB

Gobblin meetup-whats new in 0.7Vasanth Rajamani

Spark and S3 with Ryan BlueDatabricks

Cascalog at May Bay Area Hadoop User Groupnathanmarz

Elastic Stack RoadmapImma Valls Bernaus

Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee

"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at ElasticDataconomy Media

Distributed ML in Apache SparkDatabricks

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Unifying Events and Logs into the CloudEduardo Silva Pereira

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya

Presto talk @ Global AI conference 2018 Bostonkbajda

Scylla Summit 2022: Stream Processing with ScyllaDBScyllaDB

What's hot (20)

Presto meetup 2015-03-19 @Facebook

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Data Engineer's Lunch #54: dbt and Spark

Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate

Provisioning Datadog with Terraform

Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...

Presto - Analytical Database. Overview and use cases.

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...

Gobblin meetup-whats new in 0.7

Spark and S3 with Ryan Blue

Cascalog at May Bay Area Hadoop User Group

Elastic Stack Roadmap

Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale

"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic

Distributed ML in Apache Spark

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Unifying Events and Logs into the Cloud

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...

Presto talk @ Global AI conference 2018 Boston

Scylla Summit 2022: Stream Processing with ScyllaDB

Similar to Optimizing Spark

How Many Slaves (Ukoug)Doug Burns

Essential Data Engineering for Data Scientist SoftServe

Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit

Clouds: All fluff and no substance?Guy Coates

Why do we even have Kubernetes?Sean Walberg

Spanner : Google' s Globally Distributed DatabaseAhmedmchayaa

Using ТРСС to study Firebird performanceMind The Firebird

Apache Spark Performance tuning and Best PractiseKnoldus Inc.

WTF?Andrey Karpov

$Erlang as a cloud citizen, a fractal approach to throughput$ $Erlang as a cloud citizen, a fractal approach to throughput$

Erlang as a cloud citizen, a fractal approach to throughputPaolo Negri

Erlang as a Cloud CitizenWooga

Erlang and the Cloud: A Fractal Approach to ThroughputWooga

Configuration Optimization for Big Data SoftwarePooyan Jamshidi

(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014Amazon Web Services

Getting The Most Out of VR | Sinjin BainJessica Tams

Deep learning with kafkaNitin Kumar

A million connections and beyond - Node.js at scaleTom Croucher

Testing pc’s performanceiteclearners

Intro to Joyent's Manta Object Storage ServiceRod Boothby

Reproducible datascience [with Terraform]David Przybilla

Similar to Optimizing Spark (20)

How Many Slaves (Ukoug)

Essential Data Engineering for Data Scientist

Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)

Clouds: All fluff and no substance?

Why do we even have Kubernetes?

Spanner : Google' s Globally Distributed Database

Using ТРСС to study Firebird performance

Apache Spark Performance tuning and Best Practise

WTF?

$Erlang as a cloud citizen, a fractal approach to throughput$ $Erlang as a cloud citizen, a fractal approach to throughput$

Erlang as a cloud citizen, a fractal approach to throughput

Erlang as a Cloud Citizen

Erlang and the Cloud: A Fractal Approach to Throughput

Configuration Optimization for Big Data Software

(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014

Getting The Most Out of VR | Sinjin Bain

Deep learning with kafka

A million connections and beyond - Node.js at scale

Testing pc’s performance

Intro to Joyent's Manta Object Storage Service

Reproducible datascience [with Terraform]

Recently uploaded

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

Modernizing Legacy Systems Using BallerinaWSO2

CNIC Information System with Pakdata Cf In Pakistandanishmna97

ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek

Navigating Identity and Access Management in the Modern EnterpriseWSO2

Architecting Cloud Native ApplicationsWSO2

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Introduction to use of FHIR Documents in ABDMKumar Satyam

How to Check CNIC Information Online with Pakdata cfdanishmna97

Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Platformless Horizons for Digital AdaptabilityWSO2

The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

API Governance and Monetization - The evolution of API governanceWSO2

AI in Action: Real World Use Cases by AnitarajAnitaRaj43

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....

WSO2's API Vision: Unifying Control, Empowering Developers

Six Myths about Ontologies: The Basics of Formal Ontology

Modernizing Legacy Systems Using Ballerina

CNIC Information System with Pakdata Cf In Pakistan

ChatGPT and Beyond - Elevating DevOps Productivity

Navigating Identity and Access Management in the Modern Enterprise

Architecting Cloud Native Applications

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...

[BuildWithAI] Introduction to Gemini.pdf

Introduction to use of FHIR Documents in ABDM

How to Check CNIC Information Online with Pakdata cf

Design and Development of a Provenance Capture Platform for Data Science

Understanding the FAA Part 107 License ..

Platformless Horizons for Digital Adaptability

The Zero-ETL Approach: Enhancing Data Agility and Insight

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

API Governance and Monetization - The evolution of API governance

AI in Action: Real World Use Cases by Anitaraj

Optimizing Spark

1. Optimizing Spark Greg Novak

2. Proprietary and confidential 2 If you’ve thought about this at all, you won’t learn anything from me today If you haven’t thought about this, you’ll learn a few principles to organize your thinking

3. Proprietary and confidential 3 Know what you want to measure You don’t want to measure run times You want to measure effective performance of some machine characteristic: network bandwidth, file access latency, or CPU operations per second

4. Proprietary and confidential 4 You do this with carefully constructed data sets To measure network bandwidth, construct a data set with the same number of files (so file access latency is constant) and do the same operation on it (so that cpu operations are constant) but force some extra data with variable size (e.g. random 1 byte ints vs. random 8 byte ints) to come along for the ride. Then take difference of run times.

5. Proprietary and confidential 5 Case Study: Effective Network Bandwidth Everything seemed to run slowly under Spark 2.0... Latency and CPU performance looked fine But we got terrible network bandwidth from Spark 2.0 Not necessarily intrinsic to Spark 2.0… could have been some detail of our setup However Spark 2.1 worked fine, so we just decommissioned our Spark 2.0 setup

6. Proprietary and confidential 6 How do you know if you’re getting your money’s worth out of parallelization?

7. Proprietary and confidential 7 Run time vs. Number of Executors Probably the first plot you draw… but doesn’t really tell you what you want to know

8. Proprietary and confidential 8 Overall Cost (in dollars if possible) vs. executors In a perfect world (linear speed-ups) cost is independent of parallelism In the real world costs generally rise with parallelism

9. Proprietary and confidential 9 Benefit: 1/walltime = answers per hour 1 hour vs. 2 hours: Probably not a big deal 1 week vs. 2 weeks: Probably is a big deal 1 minute vs 10 minutes is a huge deal: Too easy to get distracted if your debug cycle is 10 minutes.

10. Proprietary and confidential 10 Once you are crisp on the costs and benefits, you will be in a position to say things like: “If I double the amount of parallelism for this job, my AWS bill will rise by 30 pct and the job will run in 45 minutes instead of 60 minutes. Does that seem worth it to me?”

11. Proprietary and confidential 11 Recap Focus on measuring performance of intrinsic machine characteristics like network bandwidth to characterize performance Use carefully constructed data sets that change one and only one thing to do it Be crisp on costs (dollars) and benefits (essentially debug cycles per hour) of parallelism to make informed choices about whether you want more or less of it.

Optimizing Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optimizing Spark

Similar to Optimizing Spark (20)

More from Stitch Fix Algorithms

More from Stitch Fix Algorithms (10)

Recently uploaded

Recently uploaded (20)

Optimizing Spark