This document discusses cost efficiency strategies for managed Apache Spark services. It begins by outlining the motivations for focusing on costs and introduces tools for cost analysis and optimization. It then describes the Azure Databricks platform and how it can be used to run Spark workloads more cost efficiently compared to infrastructure as a service options. The document details various Databricks pricing plans and units. Finally, it provides several cost optimization strategies, such as pre-purchasing plans, selecting efficient runtimes and frameworks, avoiding unnecessary storage, setting spending limits, and enabling auto-scaling.
2. Find me on Social Media – Adi Polak
Twitter @adipolak
Medium – https://medium.com/@adipolak
Dev.to - dev.to/adipolak
LinkedIn - https://www.linkedin.com/in/adi-polak-68548365/
5. Start from the Beginning
Business idea ( Product Manager )
Prioritization Process (R&D)
Design & Build (Software Dev HLD)
6. HDL
▪ Requirements
▪ Features
▪ Architecture
▪ Test Plans
▪ Security
▪ Deployments
▪ Monitoring /Audit Trails
▪ Maintenance
High Level Design
7. Yeah, but why should I care about costs ?!
- Understand how Budget works – P&L
- Be able to influence Technical Decisions
- Culture of Financial Accountability
13. Organize Resources for Cost Awareness
▪ Report and billings - Azure Cost Management
▪ Organize – Resources Groups and/or subscriptions
control, reporting, and attribution of costs
25. Databricks Light does NOT support:
▪ Delta Lake
▪ Autopilot features such as autoscaling
▪ Highly concurrent, all-purpose clusters
▪ Notebooks, dashboards, and collaboration features
▪ Connectors to various data sources and BI tools
▪ Databricks Light is a runtime environment for jobs (or “automated
workloads”).
26. DBU: Standard vs. Premium
DBU Standard Premium
Analytics 0.4 0.55
Engineering 0.15 0.5
Engineering Light 0.07 0.22
https://bit.ly/2Tp5Zkh
27. Workloads Examples
▪ Scheduled Job - Data Engineer
▪ On Demand Job – triggered - Data Engineer / BI / Analytics
▪ Exploratory – Interactive - BI/ML
28. VMs and DBUs
Prosenjit Chakraborty - https://medium.com/@cprosenjit/azure-databricks-cost-optimizations-5e1e17b39125
34. 2 – Select the right runtime & frameworks
• DeltaLake
▪ PySpark Pandas UDF
▪ Photon Engine
35. 3 – Don’t use tmp/local files system storage
▪ dbutils storage is RA-GRS (read-access geo-
redundant storage) - you might not need this type
of storage!
https://bit.ly/2TdXsAi