SlideShare a Scribd company logo
1 of 40
Download to read offline
Cost Efficiency Strategies for
Managed Apache Spark Services
Adi Polak
Microsoft
Find me on Social Media – Adi Polak
Twitter @adipolak
Medium – https://medium.com/@adipolak
Dev.to - dev.to/adipolak
LinkedIn - https://www.linkedin.com/in/adi-polak-68548365/
Agenda
▪ Motivation
▪ Tools
▪ Azure Databricks
▪ Cost Optimizations Strategies
▪ Wrap-up
Motivation
Start from the Beginning
Business idea ( Product Manager )
Prioritization Process (R&D)
Design & Build (Software Dev HLD)
HDL
▪ Requirements
▪ Features
▪ Architecture
▪ Test Plans
▪ Security
▪ Deployments
▪ Monitoring /Audit Trails
▪ Maintenance
High Level Design
Yeah, but why should I care about costs ?!
- Understand how Budget works – P&L
- Be able to influence Technical Decisions
- Culture of Financial Accountability
https://www.insightpartners.com/blog/product-leaders-are-rd-costs-part-of-your-strategic-discussions/
Tools
Many, many Services
Apache Spark & Cloud Computing delivery model
IaaS vs. PaaS vs. SaaS
Cloud Pricing Calculator
Azure Pricing Calculator - https://azure.microsoft.com/pricing/calculator/
AWS Pricing Calculator - https://calculator.aws/
GCP Pricing Calculator - https://cloud.google.com/products/calculator
Organize Resources for Cost Awareness
▪ Report and billings - Azure Cost Management
▪ Organize – Resources Groups and/or subscriptions
control, reporting, and attribution of costs
Subscription and Billing models
• Pay as you go
• Enterprise Agreements
• …
Azure Databricks
Where to run Spark Workloads
▪ Small - Mid-size Team
▪ Spark expertise
▪ Optimizations
▪ Machines –VMs
▪ Network
▪ Storage
▪ DBU
Kubernetes/IaaS vs. Azure Databricks ( Managed Spark Service )
▪ Bigger Team
▪ K8s + Spark expertise
▪ Optimizations Experts
▪ Machines – VMs
▪ Network
▪ Storage
Resource Consumed:
Plan Tiers
Premium vs Standard
Performance
Security
Monitoring
Databricks Units
▪ DATA ENGINEERING LIGHT
▪ DATA ENGINEERING
▪ DATA ANALYTICS
Three levels of service, AWS + Azure have the same levels
Databricks Data Engineering Light supports:
▪ scheduled JAR, Python, or spark-submit job
▪ Only.
Databricks Light does NOT support:
▪ Delta Lake
▪ Autopilot features such as autoscaling
▪ Highly concurrent, all-purpose clusters
▪ Notebooks, dashboards, and collaboration features
▪ Connectors to various data sources and BI tools
▪ Databricks Light is a runtime environment for jobs (or “automated
workloads”).
DBU: Standard vs. Premium
DBU Standard Premium
Analytics 0.4 0.55
Engineering 0.15 0.5
Engineering Light 0.07 0.22
https://bit.ly/2Tp5Zkh
Workloads Examples
▪ Scheduled Job - Data Engineer
▪ On Demand Job – triggered - Data Engineer / BI / Analytics
▪ Exploratory – Interactive - BI/ML
VMs and DBUs
Prosenjit Chakraborty - https://medium.com/@cprosenjit/azure-databricks-cost-optimizations-5e1e17b39125
Scenario Breakdown
▪ # VMs = 400
▪ Hours run = 1
▪ Cores in VM = 4
▪ General workload
Cost - Engineering VS. Engineering Light - Standart
$$$ = (#VMs*VMs $/hour + $RuntimeType*#DBUs)*(1-performance factor)
156.6
140.94
125.28
109.62
93.96
78.3
132.6 132.6 132.6 132.6 132.6 132.6
0
20
40
60
80
100
120
140
160
180
1 0.9 0.8 0.7 0.6 0.5
Engineering Engineering Light
1- Performance factor
Cost
#VMs = 400
VMs $/hour = 0.279
#DBUs = #VMs*0.75
$EngineeringLight = 0.07
$Engineering = 0.15
Cost - Engineering VS. Engineering Light - Premium
$$$ = (#VMs*VMs $/hour + $RuntimeType*#DBUs)*(1-performance factor)
#VMs = 400
VMs $/hour = 0.279
#DBUs = #VMs*0.75
$EngineeringLight = 0.22
$Engineering = 0. 5
261.6
235.44
209.28
183.12
156.96
130.8
177.6 177.6 177.6 177.6 177.6 177.6
0
50
100
150
200
250
300
1 0.9 0.8 0.7 0.6 0.5
Premium: Engineering VS. Engineering Light
Engineering Engineering Light
Cost 1- Performance factor
Cost Optimizations strategies
1 - Pre Purchase Plan – 1 & 3 years
2 – Select the right runtime & frameworks
• DeltaLake
▪ PySpark Pandas UDF
▪ Photon Engine
3 – Don’t use tmp/local files system storage
▪ dbutils storage is RA-GRS (read-access geo-
redundant storage) - you might not need this type
of storage!
https://bit.ly/2TdXsAi
Cost Optimizations tips
Manage Spending limit
▪ Per subscription
▪ Per management group
▪ Per resource group
▪ Enable Quota alerts
Enable AutoScale
▪ Scaling machines up and down automatically
https://bit.ly/3dMOROK
VMs
▪ Think about your needs
Thank You!
@adipolak

More Related Content

What's hot

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsDatabricks
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsThomas Sykes
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Sprinkle Data Inc
 

What's hot (20)

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)
 

Similar to Cost Efficiency Strategies for Managed Apache Spark Service

Sunrun slide for informatica summit - Harish Ramachandraiah
Sunrun slide for informatica summit - Harish RamachandraiahSunrun slide for informatica summit - Harish Ramachandraiah
Sunrun slide for informatica summit - Harish RamachandraiahHarish Ramachandraiah
 
20180701 - 1st Meeting - Data Science Orientation
20180701 - 1st Meeting - Data Science Orientation20180701 - 1st Meeting - Data Science Orientation
20180701 - 1st Meeting - Data Science OrientationDuc Lai Trung Minh
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j
 
Why the Microsoft 365 Administrator should care about the Power Platform Gove...
Why the Microsoft 365 Administrator should care about the Power Platform Gove...Why the Microsoft 365 Administrator should care about the Power Platform Gove...
Why the Microsoft 365 Administrator should care about the Power Platform Gove...Sara Barbosa
 
Using Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applicationsUsing Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applicationsDigital Illustrated
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
(PUGDV04) Embedding Powerapps into a Power BI Dashboard.pptx
(PUGDV04) Embedding Powerapps into a Power BI Dashboard.pptx(PUGDV04) Embedding Powerapps into a Power BI Dashboard.pptx
(PUGDV04) Embedding Powerapps into a Power BI Dashboard.pptxaditya555320
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Shop talk - Project Server 2013
Shop talk - Project Server 2013Shop talk - Project Server 2013
Shop talk - Project Server 2013Chris Givens
 
The Microsoft Well Architected Framework For Data Analytics
The Microsoft Well Architected Framework For Data AnalyticsThe Microsoft Well Architected Framework For Data Analytics
The Microsoft Well Architected Framework For Data AnalyticsStephanie Locke
 
Solution Design & Architecture.pptx
Solution Design & Architecture.pptxSolution Design & Architecture.pptx
Solution Design & Architecture.pptxNikhileshSathyavarap
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Streamlining Workflows: Unleashing Automation with Azure and Power Automate
Streamlining Workflows: Unleashing Automation with Azure and Power AutomateStreamlining Workflows: Unleashing Automation with Azure and Power Automate
Streamlining Workflows: Unleashing Automation with Azure and Power AutomateHamida Rebai Trabelsi
 
Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...Vadym Kazulkin
 
Microsoft Flow advanced: tips, pitfalls, problems and warnings to be known be...
Microsoft Flow advanced: tips, pitfalls, problems and warnings to be known be...Microsoft Flow advanced: tips, pitfalls, problems and warnings to be known be...
Microsoft Flow advanced: tips, pitfalls, problems and warnings to be known be...BIWUG
 
Microsoft for BI and DW: Using the Right Tool for the Job
Microsoft for BI and DW: Using the Right Tool for the JobMicrosoft for BI and DW: Using the Right Tool for the Job
Microsoft for BI and DW: Using the Right Tool for the JobSenturus
 
How to Build TOGAF Architectures With System Architect (2).ppt
How to Build TOGAF Architectures With System Architect (2).pptHow to Build TOGAF Architectures With System Architect (2).ppt
How to Build TOGAF Architectures With System Architect (2).pptStevenShing
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Neo4j
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 

Similar to Cost Efficiency Strategies for Managed Apache Spark Service (20)

Sunrun slide for informatica summit - Harish Ramachandraiah
Sunrun slide for informatica summit - Harish RamachandraiahSunrun slide for informatica summit - Harish Ramachandraiah
Sunrun slide for informatica summit - Harish Ramachandraiah
 
20180701 - 1st Meeting - Data Science Orientation
20180701 - 1st Meeting - Data Science Orientation20180701 - 1st Meeting - Data Science Orientation
20180701 - 1st Meeting - Data Science Orientation
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Why the Microsoft 365 Administrator should care about the Power Platform Gove...
Why the Microsoft 365 Administrator should care about the Power Platform Gove...Why the Microsoft 365 Administrator should care about the Power Platform Gove...
Why the Microsoft 365 Administrator should care about the Power Platform Gove...
 
Using Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applicationsUsing Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applications
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
(PUGDV04) Embedding Powerapps into a Power BI Dashboard.pptx
(PUGDV04) Embedding Powerapps into a Power BI Dashboard.pptx(PUGDV04) Embedding Powerapps into a Power BI Dashboard.pptx
(PUGDV04) Embedding Powerapps into a Power BI Dashboard.pptx
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Shop talk - Project Server 2013
Shop talk - Project Server 2013Shop talk - Project Server 2013
Shop talk - Project Server 2013
 
The Microsoft Well Architected Framework For Data Analytics
The Microsoft Well Architected Framework For Data AnalyticsThe Microsoft Well Architected Framework For Data Analytics
The Microsoft Well Architected Framework For Data Analytics
 
Solution Design & Architecture.pptx
Solution Design & Architecture.pptxSolution Design & Architecture.pptx
Solution Design & Architecture.pptx
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Streamlining Workflows: Unleashing Automation with Azure and Power Automate
Streamlining Workflows: Unleashing Automation with Azure and Power AutomateStreamlining Workflows: Unleashing Automation with Azure and Power Automate
Streamlining Workflows: Unleashing Automation with Azure and Power Automate
 
Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...
 
Microsoft Flow advanced: tips, pitfalls, problems and warnings to be known be...
Microsoft Flow advanced: tips, pitfalls, problems and warnings to be known be...Microsoft Flow advanced: tips, pitfalls, problems and warnings to be known be...
Microsoft Flow advanced: tips, pitfalls, problems and warnings to be known be...
 
Microsoft for BI and DW: Using the Right Tool for the Job
Microsoft for BI and DW: Using the Right Tool for the JobMicrosoft for BI and DW: Using the Right Tool for the Job
Microsoft for BI and DW: Using the Right Tool for the Job
 
How to Build TOGAF Architectures With System Architect (2).ppt
How to Build TOGAF Architectures With System Architect (2).pptHow to Build TOGAF Architectures With System Architect (2).ppt
How to Build TOGAF Architectures With System Architect (2).ppt
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Recently uploaded (20)

What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

Cost Efficiency Strategies for Managed Apache Spark Service

  • 1. Cost Efficiency Strategies for Managed Apache Spark Services Adi Polak Microsoft
  • 2. Find me on Social Media – Adi Polak Twitter @adipolak Medium – https://medium.com/@adipolak Dev.to - dev.to/adipolak LinkedIn - https://www.linkedin.com/in/adi-polak-68548365/
  • 3. Agenda ▪ Motivation ▪ Tools ▪ Azure Databricks ▪ Cost Optimizations Strategies ▪ Wrap-up
  • 5. Start from the Beginning Business idea ( Product Manager ) Prioritization Process (R&D) Design & Build (Software Dev HLD)
  • 6. HDL ▪ Requirements ▪ Features ▪ Architecture ▪ Test Plans ▪ Security ▪ Deployments ▪ Monitoring /Audit Trails ▪ Maintenance High Level Design
  • 7. Yeah, but why should I care about costs ?! - Understand how Budget works – P&L - Be able to influence Technical Decisions - Culture of Financial Accountability
  • 11. Apache Spark & Cloud Computing delivery model IaaS vs. PaaS vs. SaaS
  • 12. Cloud Pricing Calculator Azure Pricing Calculator - https://azure.microsoft.com/pricing/calculator/ AWS Pricing Calculator - https://calculator.aws/ GCP Pricing Calculator - https://cloud.google.com/products/calculator
  • 13. Organize Resources for Cost Awareness ▪ Report and billings - Azure Cost Management ▪ Organize – Resources Groups and/or subscriptions control, reporting, and attribution of costs
  • 14.
  • 15. Subscription and Billing models • Pay as you go • Enterprise Agreements • …
  • 17. Where to run Spark Workloads ▪ Small - Mid-size Team ▪ Spark expertise ▪ Optimizations ▪ Machines –VMs ▪ Network ▪ Storage ▪ DBU Kubernetes/IaaS vs. Azure Databricks ( Managed Spark Service ) ▪ Bigger Team ▪ K8s + Spark expertise ▪ Optimizations Experts ▪ Machines – VMs ▪ Network ▪ Storage
  • 20.
  • 21.
  • 23. Databricks Units ▪ DATA ENGINEERING LIGHT ▪ DATA ENGINEERING ▪ DATA ANALYTICS Three levels of service, AWS + Azure have the same levels
  • 24. Databricks Data Engineering Light supports: ▪ scheduled JAR, Python, or spark-submit job ▪ Only.
  • 25. Databricks Light does NOT support: ▪ Delta Lake ▪ Autopilot features such as autoscaling ▪ Highly concurrent, all-purpose clusters ▪ Notebooks, dashboards, and collaboration features ▪ Connectors to various data sources and BI tools ▪ Databricks Light is a runtime environment for jobs (or “automated workloads”).
  • 26. DBU: Standard vs. Premium DBU Standard Premium Analytics 0.4 0.55 Engineering 0.15 0.5 Engineering Light 0.07 0.22 https://bit.ly/2Tp5Zkh
  • 27. Workloads Examples ▪ Scheduled Job - Data Engineer ▪ On Demand Job – triggered - Data Engineer / BI / Analytics ▪ Exploratory – Interactive - BI/ML
  • 28. VMs and DBUs Prosenjit Chakraborty - https://medium.com/@cprosenjit/azure-databricks-cost-optimizations-5e1e17b39125
  • 29. Scenario Breakdown ▪ # VMs = 400 ▪ Hours run = 1 ▪ Cores in VM = 4 ▪ General workload
  • 30. Cost - Engineering VS. Engineering Light - Standart $$$ = (#VMs*VMs $/hour + $RuntimeType*#DBUs)*(1-performance factor) 156.6 140.94 125.28 109.62 93.96 78.3 132.6 132.6 132.6 132.6 132.6 132.6 0 20 40 60 80 100 120 140 160 180 1 0.9 0.8 0.7 0.6 0.5 Engineering Engineering Light 1- Performance factor Cost #VMs = 400 VMs $/hour = 0.279 #DBUs = #VMs*0.75 $EngineeringLight = 0.07 $Engineering = 0.15
  • 31. Cost - Engineering VS. Engineering Light - Premium $$$ = (#VMs*VMs $/hour + $RuntimeType*#DBUs)*(1-performance factor) #VMs = 400 VMs $/hour = 0.279 #DBUs = #VMs*0.75 $EngineeringLight = 0.22 $Engineering = 0. 5 261.6 235.44 209.28 183.12 156.96 130.8 177.6 177.6 177.6 177.6 177.6 177.6 0 50 100 150 200 250 300 1 0.9 0.8 0.7 0.6 0.5 Premium: Engineering VS. Engineering Light Engineering Engineering Light Cost 1- Performance factor
  • 33. 1 - Pre Purchase Plan – 1 & 3 years
  • 34. 2 – Select the right runtime & frameworks • DeltaLake ▪ PySpark Pandas UDF ▪ Photon Engine
  • 35. 3 – Don’t use tmp/local files system storage ▪ dbutils storage is RA-GRS (read-access geo- redundant storage) - you might not need this type of storage! https://bit.ly/2TdXsAi
  • 37. Manage Spending limit ▪ Per subscription ▪ Per management group ▪ Per resource group ▪ Enable Quota alerts
  • 38. Enable AutoScale ▪ Scaling machines up and down automatically https://bit.ly/3dMOROK
  • 39. VMs ▪ Think about your needs