SlideShare a Scribd company logo
Scaling Databricks to Run Data & AI
Workloads on Millions of VMs
Matei Zaharia
Cofounder and Chief Technologist
@matei_zaharia
2
With Contributions From
Jeff Pang Nanxi Kang Haogang Chen
Michael Armbrust Ihor Leshko Ali Ghodsi
3
Outline
About Databricks
Challenges developing large-scale cloud services
Four lessons from our experience
4
Outline
About Databricks
Challenges developing large-scale cloud services
Four lessons from our experience
5
About Databricks
Founded in 2013 by the Apache Spark group at UC Berkeley
Cloud-based data and ML platform for >7000 customers
§ Millions of VMs run per day, processing exabytes of data/day
§ 100,000s of users
1500 employees, 300 engineers, >$350M ARR
6
Security policies
Our Product
Built around
open source:
Data science
workspace
SQL analytics
Scheduled jobs
Data scientists
Data engineers
Business users
Cloud Storage
Compute Clusters
ML platform
Databricks Runtime
Data catalog
Customer’s Cloud AccountDatabricks Services
inside
7
Some of Our Customers
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & IndustrialMarketing & AdTech
Data & Analytics Services
Nike
8
Example Use Cases
Correlate 500,000 patient records with DNA to design therapies
Optimize inventory management using simulations and ML
Identify securities fraud via machine learning on 30 PB of data
9
Usage Growth: VMs/day
10
What’s Challenging About This?
All the challenges of B2B cloud services
§ Availability, security, multitenancy, reliable updates, etc
+
All the challenges of large-scale data processing
§ Scalability, reliability, failures, etc
11
Traditional Software Cloud SoftwareVendorCustomers
Dev Team
Release
6-12 months
Users
Ops
Users
Ops
Users
Ops
Users
Ops
Dev + Ops Team
1-2 weeks
Users
Ops
Users
Ops
Users
Ops
Users
Ops
6-12 months
vs.
12
Benefits of Cloud Software for Users
Management built-in: much more value than the software bits
alone (security, high availability, etc)
Elasticity: pay-as-you-go, scale on demand
Better features released faster
13
Challenges in Building Cloud Software
Building a multitenant service: significant scaling, security and
performance isolation work that you don’t need on-prem
Operating the service: security, availability, monitoring, etc
(but customers would have to do it themselves on-prem)
Upgrading without regressions: very hard, but critical for users to
trust your cloud (on-prem apps don’t require this)
Includes API, semantics, and performance regressions
14
Challenges of Large Scale
Each customer app runs on up to thousands of VMs, so systems
must tolerate failures, stragglers, etc among those
Even one customer app could potentially overload our services
Weird rare failure modes that are only seen at scale
15
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
16
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
17
Causes of Significant Outages
Scaling problem
in our services
Scaling problem in
underlying cloud services
Insufficient
user isolation
Deployment
misconfiguration
Other
30%
20%20%
20%
10%
18
Causes of Significant Outages
Scaling problem
in our services
Scaling problem in
underlying cloud services
Insufficient
user isolation
Deployment
misconfiguration
Other
30%
20%20%
20%
10% 70% scale
related
19
Some Issues We Experienced
Cloud networks: limits, partitions, slow DHCP, hung connections
Automated apps creating large load
Very large requests, results, etc
Slow VM launches/shutdowns, lack of VM capacity
Data corruption writing to cloud storage
20
Example Outage: Aborted Jobs
Jobs Service launches & tracks jobs on clusters
1 customer running many jobs/sec on same cluster
Cloud’s network reaches a limit of 1000 connections/VM
between Jobs Service & clusters
§ After this limit, new connections hang in state SYN_SENT
Resource usage from hanging connections causes
memory pressure and GC
Health checks to some jobs time out, so we abort them
Jobs
Service
Cloud
Network Customer
Clusters
21
Surprisingly Rare Issues
1 cloud-wide VM restart on AWS (Xen patch)
1 misreported security scan on customer VM
1 significant S3 outage
2 Linux kernel bugs (TCP-related)
22
Lessons
Cloud services must handle load that varies on many dimensions,
and rely on other services with varying limits & failure modes
Problems get worse in a “cloud service economy” where each
service depends on others (maybe from other vendors)
23
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
24
Testing for Scalability & Stability
Software correctness is a Boolean property: does your software
give the right output on a given input?
Scalability and stability are a matter of degree
§ What load will your system fail at? (any system with limited resources will)
§ What failure behavior will you have? (crash all clients, drop some, etc)
25
Example Scalability Problems
Large result: can crash browser,
notebook service, driver or Spark
Large record in file
Large # of tasks
Code that freezes a worker
+ All these affect other users!
User Browser Notebook
Service
Driver
App
Workers
Other Users
??
26
Databricks Stress Test Infrastructure
1. Identify dimensions for a system to scale in (e.g. # of users, number
of output rows, size of each row, latency of an RPC, etc)
2. Grow load in each dimension until a failure occurs
3. Record failure type and impact on system
§ Error message, timeout, wrong result?
§ Are other clients affected?
§ Does the system auto-recover? How fast?
4. Compare over time and on changes
27
Example Output
28
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
29
Developing Control Planes
Cloud software consists of interacting, independently updated
services, many of which call other services
What is the right programming model for this software?
Billing
Service
Notebook
Service
Jobs
Service
Cluster
Service …
30
Service Framework
We designed a shared service framework that handles:
§ Cross-cloud deployment: AWS, Azure, local, special environments
§ Storage: databases, schema updates, etc
§ Security tokens & roles
§ Monitoring
§ API routing & limiting
§ Feature flagging
Our service stack:
32
Best Practices
Isolate state: relational DB is usually enough with org sharding
Isolate components that scale differently: allows separate scaling
Manage changes through feature flags: fastest, safest way
Watch key metrics: most outages could be predicted from one of
CPU load, memory load, DB load or thread pool exhaustion
Test pyramid: 70% unit tests, 20% integration, 10% end-to-end
33
Cluster manager v1 Cluster manager v2
Example: Cluster Manager
Cloud
VM API
Cluster
Manager
Customer Clusters
Cloud
VM API
CM Master
Customer Clusters
Delegate Delegate
Usage, billing, etc
VM launch, setup,
monitoring, etc
35
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
36
Evolving Big Data Systems for the Cloud
MapReduce, Spark, etc were designed for on-premise datacenters
How can we evolve these leverage the benefits of the cloud?
§ Availability, elasticity, scale, multitenancy, etc
Two examples from Databricks:
§ Delta Lake: ACID on cloud object stores
§ “Cloudifying” Apache Spark
37
Delta Lake Motivation
Cloud object stores (S3, Azure blob, etc) are the largest storage
systems on the planet
§ Unmatched availability, parallel I/O bandwidth, and cost-efficiency
Open source big data stack was designed for on-prem world
§ Filesystem API for storage
§ RDBMS for table metadata (Hive metastore)
§ Other distributed systems, e.g. ZooKeeper
How can big data systems fully leverage cloud object stores?
Stronger consistency model
Scale & management complexity
38
Spark on HDFS Spark on S3 (Naïve)
Example: Atomic Parallel Writes
Input Files Output Partitions
/tmp-job-1
/my-output
Atomic
Rename
Input Files Output Partitions
/my-output/part-1
/my-output/part-2
/my-output/part-3
/my-output/part-4
part-1
part-2
part-3
part-4
/_DONE
Full object names
(no cheap rename)
Real cases are harder (e.g. appending to a table)
39
Delta Lake Design
1. Track metadata that says which objects are part of a dataset
2. Store this metadata itself in a cloud object store
§ Write-ahead log in S3, compressed using Apache Parquet
Before Delta Lake: 50% of Spark support
issues were about cloud storage
After: fewer issues, increased perf
Input Files Output Partitions
/my-output/part-X
/my-output/part-Y
/my-output/part-Z
/my-output/part-W
/my-output/_delta_log
Commit
Protocol https://delta.io
10x faster metadata
ops than Hive on S3!
41
Major Benefits of Delta Lake
Once we had transactions over S3, we could build much more:
§ UPSERT, DELETE, etc (GDPR)
§ Time travel
§ Caching
§ Multidimensional indexing
§ Audit logging
§ Background optimization
0
0.2
0.4
0.6
0.8
1
Parquet
Parquet+
order
Delta
Z-order
Runningtime
Filter on 2 Fields
Result: greatly simplified
customers’ data architectures
Before and After Delta
Typical Architecture with Multiple Data Systems
ETL Job ETL Job
ETL Job
Delta Lake
Table 1
Delta Lake
Table 2
Delta Lake
Table 3
Streaming
Analytics
Data
Scientists
BI Users
Cloud Object Store
input
Cloud Object Store
ETL Job ETL Job
ETL Job
Message
Queue
Parquet
Table 1
Parquet
Table 2
Parquet
Table 3
Data
Warehouse
Data
Warehouse
Streaming
Analytics
Data
Scientists
BI Users
input
Using Delta Lake for All Storage
Fewer copies of the data, fewer ETL steps,
no divergence & faster results!
Delta Lake Adoption
Used by thousands of companies to process exabytes of data/day
Grew from zero to ~50% of the Databricks workload in 3 years
Largest deployments: exabyte tables and 1000s of users
45
Launched Today: SQL Analytics Product
New native SQL engine
running over Delta Lake
Queries cloud storage with
data warehouse performance
SQL security control model
and easy access from BI tools
0
10000
20000
30000
40000
DW1 DW2 DW3 Delta
Engine
TPC-DS 30TB Benchmark
Time (s)
46
“Cloudifying” Apache Spark
“High-concurrency” clusters with strong user isolation
§ Give users a “serverless” experience for ad hoc analytics queries
Scheduler-integrated cluster autoscaling
Autoscaling storage volumes
…
47
Conclusion
The cloud is eating software by enabling much better products
§ Self-managing, elastic, and highly available
But building reliable, large-scale cloud products is hard
Many ideas can help, from software engineering best practices to
new testing methods and system designs for large scale
48
We’re Hiring!
databricks.com/careers

More Related Content

What's hot

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
Lars E Martinsson
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
Data Governance
Data GovernanceData Governance
Data Governance
Rob Lux
 
3 Keys To Successful Master Data Management - Final Presentation
3 Keys To Successful Master Data Management - Final Presentation3 Keys To Successful Master Data Management - Final Presentation
3 Keys To Successful Master Data Management - Final Presentation
James Chi
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
Data Migration Strategies PowerPoint Presentation Slides
Data Migration Strategies PowerPoint Presentation SlidesData Migration Strategies PowerPoint Presentation Slides
Data Migration Strategies PowerPoint Presentation Slides
SlideTeam
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Customer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewCustomer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° view
Guido Schmutz
 
Chapter 4: Data Architecture Management
Chapter 4: Data Architecture ManagementChapter 4: Data Architecture Management
Chapter 4: Data Architecture Management
Ahmed Alorage
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
Wasm1953
 
Reference master data management
Reference master data managementReference master data management
Reference master data management
Dr. Hamdan Al-Sabri
 
Lessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDMLessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDM
DATAVERSITY
 
Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...
DATAVERSITY
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
DATAVERSITY
 

What's hot (20)

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
 
Data Governance
Data GovernanceData Governance
Data Governance
 
3 Keys To Successful Master Data Management - Final Presentation
3 Keys To Successful Master Data Management - Final Presentation3 Keys To Successful Master Data Management - Final Presentation
3 Keys To Successful Master Data Management - Final Presentation
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Data Migration Strategies PowerPoint Presentation Slides
Data Migration Strategies PowerPoint Presentation SlidesData Migration Strategies PowerPoint Presentation Slides
Data Migration Strategies PowerPoint Presentation Slides
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Customer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewCustomer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° view
 
Chapter 4: Data Architecture Management
Chapter 4: Data Architecture ManagementChapter 4: Data Architecture Management
Chapter 4: Data Architecture Management
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Reference master data management
Reference master data managementReference master data management
Reference master data management
 
Lessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDMLessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDM
 
Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...Master Data Management - Practical Strategies for Integrating into Your Data ...
Master Data Management - Practical Strategies for Integrating into Your Data ...
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 

Similar to Scaling Databricks to Run Data and ML Workloads on Millions of VMs

Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
Matei Zaharia
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
Alicja Sieminska
 
Technology Overview
Technology OverviewTechnology Overview
Technology Overview
Liran Zelkha
 
Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09
Rex Wang
 
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Manoj Kumar
 
Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalability
mabuhr
 
Accelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateAccelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud Private
Michael Elder
 
Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010
DBIS @ Ilmenau University of Technology
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Prolifics
 
An introduction to the cloud 11 v1
An introduction to the cloud 11 v1An introduction to the cloud 11 v1
An introduction to the cloud 11 v1
charan7575
 
Jjm cloud computing
Jjm cloud computingJjm cloud computing
Jjm cloud computing
Manali Bagrecha
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
Mathews Job
 
VAS - VMware CMP
VAS - VMware CMPVAS - VMware CMP
Cloud On-Ramp Project Briefing
Cloud On-Ramp Project BriefingCloud On-Ramp Project Briefing
Cloud On-Ramp Project Briefing
Robert McDermott
 
Cloud Computing Networks
Cloud Computing NetworksCloud Computing Networks
Cloud Computing Networks
jayapal385
 
Cloud Migration.pdf
Cloud Migration.pdfCloud Migration.pdf
Cloud Migration.pdf
Zen Bit Tech
 
mysql_pn_heatwave.pdf
mysql_pn_heatwave.pdfmysql_pn_heatwave.pdf
mysql_pn_heatwave.pdf
RavishPatel19
 
A perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applicationsA perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applications
George Milliken
 
introduction to distributed computing.pptx
introduction to distributed computing.pptxintroduction to distributed computing.pptx
introduction to distributed computing.pptx
ApthiriSurekha
 
cloud computing models
cloud computing modelscloud computing models
cloud computing models
Kiran Kumar Anumandla
 

Similar to Scaling Databricks to Run Data and ML Workloads on Millions of VMs (20)

Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Technology Overview
Technology OverviewTechnology Overview
Technology Overview
 
Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09
 
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
 
Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalability
 
Accelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateAccelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud Private
 
Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
An introduction to the cloud 11 v1
An introduction to the cloud 11 v1An introduction to the cloud 11 v1
An introduction to the cloud 11 v1
 
Jjm cloud computing
Jjm cloud computingJjm cloud computing
Jjm cloud computing
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
VAS - VMware CMP
VAS - VMware CMPVAS - VMware CMP
VAS - VMware CMP
 
Cloud On-Ramp Project Briefing
Cloud On-Ramp Project BriefingCloud On-Ramp Project Briefing
Cloud On-Ramp Project Briefing
 
Cloud Computing Networks
Cloud Computing NetworksCloud Computing Networks
Cloud Computing Networks
 
Cloud Migration.pdf
Cloud Migration.pdfCloud Migration.pdf
Cloud Migration.pdf
 
mysql_pn_heatwave.pdf
mysql_pn_heatwave.pdfmysql_pn_heatwave.pdf
mysql_pn_heatwave.pdf
 
A perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applicationsA perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applications
 
introduction to distributed computing.pptx
introduction to distributed computing.pptxintroduction to distributed computing.pptx
introduction to distributed computing.pptx
 
cloud computing models
cloud computing modelscloud computing models
cloud computing models
 

Recently uploaded

Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
anfaltahir1010
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
seospiralmantra
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
Prestware
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
sandeepmenon62
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
OnePlan Solutions
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 

Recently uploaded (20)

Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 

Scaling Databricks to Run Data and ML Workloads on Millions of VMs

  • 1. Scaling Databricks to Run Data & AI Workloads on Millions of VMs Matei Zaharia Cofounder and Chief Technologist @matei_zaharia
  • 2. 2 With Contributions From Jeff Pang Nanxi Kang Haogang Chen Michael Armbrust Ihor Leshko Ali Ghodsi
  • 3. 3 Outline About Databricks Challenges developing large-scale cloud services Four lessons from our experience
  • 4. 4 Outline About Databricks Challenges developing large-scale cloud services Four lessons from our experience
  • 5. 5 About Databricks Founded in 2013 by the Apache Spark group at UC Berkeley Cloud-based data and ML platform for >7000 customers § Millions of VMs run per day, processing exabytes of data/day § 100,000s of users 1500 employees, 300 engineers, >$350M ARR
  • 6. 6 Security policies Our Product Built around open source: Data science workspace SQL analytics Scheduled jobs Data scientists Data engineers Business users Cloud Storage Compute Clusters ML platform Databricks Runtime Data catalog Customer’s Cloud AccountDatabricks Services inside
  • 7. 7 Some of Our Customers Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & IndustrialMarketing & AdTech Data & Analytics Services Nike
  • 8. 8 Example Use Cases Correlate 500,000 patient records with DNA to design therapies Optimize inventory management using simulations and ML Identify securities fraud via machine learning on 30 PB of data
  • 10. 10 What’s Challenging About This? All the challenges of B2B cloud services § Availability, security, multitenancy, reliable updates, etc + All the challenges of large-scale data processing § Scalability, reliability, failures, etc
  • 11. 11 Traditional Software Cloud SoftwareVendorCustomers Dev Team Release 6-12 months Users Ops Users Ops Users Ops Users Ops Dev + Ops Team 1-2 weeks Users Ops Users Ops Users Ops Users Ops 6-12 months vs.
  • 12. 12 Benefits of Cloud Software for Users Management built-in: much more value than the software bits alone (security, high availability, etc) Elasticity: pay-as-you-go, scale on demand Better features released faster
  • 13. 13 Challenges in Building Cloud Software Building a multitenant service: significant scaling, security and performance isolation work that you don’t need on-prem Operating the service: security, availability, monitoring, etc (but customers would have to do it themselves on-prem) Upgrading without regressions: very hard, but critical for users to trust your cloud (on-prem apps don’t require this) Includes API, semantics, and performance regressions
  • 14. 14 Challenges of Large Scale Each customer app runs on up to thousands of VMs, so systems must tolerate failures, stragglers, etc among those Even one customer app could potentially overload our services Weird rare failure modes that are only seen at scale
  • 15. 15 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 16. 16 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 17. 17 Causes of Significant Outages Scaling problem in our services Scaling problem in underlying cloud services Insufficient user isolation Deployment misconfiguration Other 30% 20%20% 20% 10%
  • 18. 18 Causes of Significant Outages Scaling problem in our services Scaling problem in underlying cloud services Insufficient user isolation Deployment misconfiguration Other 30% 20%20% 20% 10% 70% scale related
  • 19. 19 Some Issues We Experienced Cloud networks: limits, partitions, slow DHCP, hung connections Automated apps creating large load Very large requests, results, etc Slow VM launches/shutdowns, lack of VM capacity Data corruption writing to cloud storage
  • 20. 20 Example Outage: Aborted Jobs Jobs Service launches & tracks jobs on clusters 1 customer running many jobs/sec on same cluster Cloud’s network reaches a limit of 1000 connections/VM between Jobs Service & clusters § After this limit, new connections hang in state SYN_SENT Resource usage from hanging connections causes memory pressure and GC Health checks to some jobs time out, so we abort them Jobs Service Cloud Network Customer Clusters
  • 21. 21 Surprisingly Rare Issues 1 cloud-wide VM restart on AWS (Xen patch) 1 misreported security scan on customer VM 1 significant S3 outage 2 Linux kernel bugs (TCP-related)
  • 22. 22 Lessons Cloud services must handle load that varies on many dimensions, and rely on other services with varying limits & failure modes Problems get worse in a “cloud service economy” where each service depends on others (maybe from other vendors)
  • 23. 23 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 24. 24 Testing for Scalability & Stability Software correctness is a Boolean property: does your software give the right output on a given input? Scalability and stability are a matter of degree § What load will your system fail at? (any system with limited resources will) § What failure behavior will you have? (crash all clients, drop some, etc)
  • 25. 25 Example Scalability Problems Large result: can crash browser, notebook service, driver or Spark Large record in file Large # of tasks Code that freezes a worker + All these affect other users! User Browser Notebook Service Driver App Workers Other Users ??
  • 26. 26 Databricks Stress Test Infrastructure 1. Identify dimensions for a system to scale in (e.g. # of users, number of output rows, size of each row, latency of an RPC, etc) 2. Grow load in each dimension until a failure occurs 3. Record failure type and impact on system § Error message, timeout, wrong result? § Are other clients affected? § Does the system auto-recover? How fast? 4. Compare over time and on changes
  • 28. 28 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 29. 29 Developing Control Planes Cloud software consists of interacting, independently updated services, many of which call other services What is the right programming model for this software? Billing Service Notebook Service Jobs Service Cluster Service …
  • 30. 30 Service Framework We designed a shared service framework that handles: § Cross-cloud deployment: AWS, Azure, local, special environments § Storage: databases, schema updates, etc § Security tokens & roles § Monitoring § API routing & limiting § Feature flagging Our service stack:
  • 31. 32 Best Practices Isolate state: relational DB is usually enough with org sharding Isolate components that scale differently: allows separate scaling Manage changes through feature flags: fastest, safest way Watch key metrics: most outages could be predicted from one of CPU load, memory load, DB load or thread pool exhaustion Test pyramid: 70% unit tests, 20% integration, 10% end-to-end
  • 32. 33 Cluster manager v1 Cluster manager v2 Example: Cluster Manager Cloud VM API Cluster Manager Customer Clusters Cloud VM API CM Master Customer Clusters Delegate Delegate Usage, billing, etc VM launch, setup, monitoring, etc
  • 33. 35 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 34. 36 Evolving Big Data Systems for the Cloud MapReduce, Spark, etc were designed for on-premise datacenters How can we evolve these leverage the benefits of the cloud? § Availability, elasticity, scale, multitenancy, etc Two examples from Databricks: § Delta Lake: ACID on cloud object stores § “Cloudifying” Apache Spark
  • 35. 37 Delta Lake Motivation Cloud object stores (S3, Azure blob, etc) are the largest storage systems on the planet § Unmatched availability, parallel I/O bandwidth, and cost-efficiency Open source big data stack was designed for on-prem world § Filesystem API for storage § RDBMS for table metadata (Hive metastore) § Other distributed systems, e.g. ZooKeeper How can big data systems fully leverage cloud object stores? Stronger consistency model Scale & management complexity
  • 36. 38 Spark on HDFS Spark on S3 (Naïve) Example: Atomic Parallel Writes Input Files Output Partitions /tmp-job-1 /my-output Atomic Rename Input Files Output Partitions /my-output/part-1 /my-output/part-2 /my-output/part-3 /my-output/part-4 part-1 part-2 part-3 part-4 /_DONE Full object names (no cheap rename) Real cases are harder (e.g. appending to a table)
  • 37. 39 Delta Lake Design 1. Track metadata that says which objects are part of a dataset 2. Store this metadata itself in a cloud object store § Write-ahead log in S3, compressed using Apache Parquet Before Delta Lake: 50% of Spark support issues were about cloud storage After: fewer issues, increased perf Input Files Output Partitions /my-output/part-X /my-output/part-Y /my-output/part-Z /my-output/part-W /my-output/_delta_log Commit Protocol https://delta.io 10x faster metadata ops than Hive on S3!
  • 38. 41 Major Benefits of Delta Lake Once we had transactions over S3, we could build much more: § UPSERT, DELETE, etc (GDPR) § Time travel § Caching § Multidimensional indexing § Audit logging § Background optimization 0 0.2 0.4 0.6 0.8 1 Parquet Parquet+ order Delta Z-order Runningtime Filter on 2 Fields Result: greatly simplified customers’ data architectures
  • 39. Before and After Delta Typical Architecture with Multiple Data Systems ETL Job ETL Job ETL Job Delta Lake Table 1 Delta Lake Table 2 Delta Lake Table 3 Streaming Analytics Data Scientists BI Users Cloud Object Store input Cloud Object Store ETL Job ETL Job ETL Job Message Queue Parquet Table 1 Parquet Table 2 Parquet Table 3 Data Warehouse Data Warehouse Streaming Analytics Data Scientists BI Users input Using Delta Lake for All Storage Fewer copies of the data, fewer ETL steps, no divergence & faster results!
  • 40. Delta Lake Adoption Used by thousands of companies to process exabytes of data/day Grew from zero to ~50% of the Databricks workload in 3 years Largest deployments: exabyte tables and 1000s of users
  • 41. 45 Launched Today: SQL Analytics Product New native SQL engine running over Delta Lake Queries cloud storage with data warehouse performance SQL security control model and easy access from BI tools 0 10000 20000 30000 40000 DW1 DW2 DW3 Delta Engine TPC-DS 30TB Benchmark Time (s)
  • 42. 46 “Cloudifying” Apache Spark “High-concurrency” clusters with strong user isolation § Give users a “serverless” experience for ad hoc analytics queries Scheduler-integrated cluster autoscaling Autoscaling storage volumes …
  • 43. 47 Conclusion The cloud is eating software by enabling much better products § Self-managing, elastic, and highly available But building reliable, large-scale cloud products is hard Many ideas can help, from software engineering best practices to new testing methods and system designs for large scale