Cost Efficiency Strategies for Managed Apache Spark Service

Cost Efficiency Strategies for
Managed Apache Spark Services
Adi Polak
Microsoft

Find me on Social Media – Adi Polak
Twitter @adipolak
Medium – https://medium.com/@adipolak
Dev.to - dev.to/adipolak
LinkedIn - https://www.linkedin.com/in/adi-polak-68548365/

Agenda
▪ Motivation
▪ Tools
▪ Azure Databricks
▪ Cost Optimizations Strategies
▪ Wrap-up

Start from the Beginning
Business idea ( Product Manager )
Prioritization Process (R&D)
Design & Build (Software Dev HLD)

HDL
▪ Requirements
▪ Features
▪ Architecture
▪ Test Plans
▪ Security
▪ Deployments
▪ Monitoring /Audit Trails
▪ Maintenance
High Level Design

Yeah, but why should I care about costs ?!
- Understand how Budget works – P&L
- Be able to influence Technical Decisions
- Culture of Financial Accountability

https://www.insightpartners.com/blog/product-leaders-are-rd-costs-part-of-your-strategic-discussions/

Apache Spark & Cloud Computing delivery model
IaaS vs. PaaS vs. SaaS

Cloud Pricing Calculator
Azure Pricing Calculator - https://azure.microsoft.com/pricing/calculator/
AWS Pricing Calculator - https://calculator.aws/
GCP Pricing Calculator - https://cloud.google.com/products/calculator

Organize Resources for Cost Awareness
▪ Report and billings - Azure Cost Management
▪ Organize – Resources Groups and/or subscriptions
control, reporting, and attribution of costs

Subscription and Billing models
• Pay as you go
• Enterprise Agreements
• …

Where to run Spark Workloads
▪ Small - Mid-size Team
▪ Spark expertise
▪ Optimizations
▪ Machines –VMs
▪ Network
▪ Storage
▪ DBU
Kubernetes/IaaS vs. Azure Databricks ( Managed Spark Service )
▪ Bigger Team
▪ K8s + Spark expertise
▪ Optimizations Experts
▪ Machines – VMs
▪ Network
▪ Storage

Premium vs Standard
Performance
Security
Monitoring

Databricks Units
▪ DATA ENGINEERING LIGHT
▪ DATA ENGINEERING
▪ DATA ANALYTICS
Three levels of service, AWS + Azure have the same levels

Databricks Data Engineering Light supports:
▪ scheduled JAR, Python, or spark-submit job
▪ Only.

Databricks Light does NOT support:
▪ Delta Lake
▪ Autopilot features such as autoscaling
▪ Highly concurrent, all-purpose clusters
▪ Notebooks, dashboards, and collaboration features
▪ Connectors to various data sources and BI tools
▪ Databricks Light is a runtime environment for jobs (or “automated
workloads”).

DBU: Standard vs. Premium
DBU Standard Premium
Analytics 0.4 0.55
Engineering 0.15 0.5
Engineering Light 0.07 0.22
https://bit.ly/2Tp5Zkh

Workloads Examples
▪ Scheduled Job - Data Engineer
▪ On Demand Job – triggered - Data Engineer / BI / Analytics
▪ Exploratory – Interactive - BI/ML

VMs and DBUs
Prosenjit Chakraborty - https://medium.com/@cprosenjit/azure-databricks-cost-optimizations-5e1e17b39125

Scenario Breakdown
▪ # VMs = 400
▪ Hours run = 1
▪ Cores in VM = 4
▪ General workload

Cost - Engineering VS. Engineering Light - Standart
$$$ = (#VMs*VMs $/hour + $RuntimeType*#DBUs)*(1-performance factor)
156.6
140.94
125.28
109.62
93.96
78.3
132.6 132.6 132.6 132.6 132.6 132.6
0
20
40
60
80
100
120
140
160
180
1 0.9 0.8 0.7 0.6 0.5
Engineering Engineering Light
1- Performance factor
Cost
#VMs = 400
VMs $/hour = 0.279
#DBUs = #VMs*0.75
$EngineeringLight = 0.07
$Engineering = 0.15

Cost - Engineering VS. Engineering Light - Premium
$$$ = (#VMs*VMs $/hour + $RuntimeType*#DBUs)*(1-performance factor)
#VMs = 400
VMs $/hour = 0.279
#DBUs = #VMs*0.75
$EngineeringLight = 0.22
$Engineering = 0. 5
261.6
235.44
209.28
183.12
156.96
130.8
177.6 177.6 177.6 177.6 177.6 177.6
0
50
100
150
200
250
300
1 0.9 0.8 0.7 0.6 0.5
Premium: Engineering VS. Engineering Light
Engineering Engineering Light
Cost 1- Performance factor

1 - Pre Purchase Plan – 1 & 3 years

2 – Select the right runtime & frameworks
• DeltaLake
▪ PySpark Pandas UDF
▪ Photon Engine

3 – Don’t use tmp/local files system storage
▪ dbutils storage is RA-GRS (read-access geo-
redundant storage) - you might not need this type
of storage!
https://bit.ly/2TdXsAi

Manage Spending limit
▪ Per subscription
▪ Per management group
▪ Per resource group
▪ Enable Quota alerts

Enable AutoScale
▪ Scaling machines up and down automatically
https://bit.ly/3dMOROK

VMs
▪ Think about your needs

Cost Efficiency Strategies for Managed Apache Spark Service

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cost Efficiency Strategies for Managed Apache Spark Service

Similar to Cost Efficiency Strategies for Managed Apache Spark Service (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Cost Efficiency Strategies for Managed Apache Spark Service