SlideShare a Scribd company logo
Scaling Databricks to Run Data & AI
Workloads on Millions of VMs
Matei Zaharia
Cofounder and Chief Technologist
@matei_zaharia
2
With Contributions From
Jeff Pang Nanxi Kang Haogang Chen
Michael Armbrust Ihor Leshko Ali Ghodsi
3
Outline
About Databricks
Challenges developing large-scale cloud services
Four lessons from our experience
4
Outline
About Databricks
Challenges developing large-scale cloud services
Four lessons from our experience
5
About Databricks
Founded in 2013 by the Apache Spark group at UC Berkeley
Cloud-based data and ML platform for >7000 customers
§ Millions of VMs run per day, processing exabytes of data/day
§ 100,000s of users
1500 employees, 300 engineers, >$350M ARR
6
Security policies
Our Product
Built around
open source:
Data science
workspace
SQL analytics
Scheduled jobs
Data scientists
Data engineers
Business users
Cloud Storage
Compute Clusters
ML platform
Databricks Runtime
Data catalog
Customer’s Cloud AccountDatabricks Services
inside
7
Some of Our Customers
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & IndustrialMarketing & AdTech
Data & Analytics Services
Nike
8
Example Use Cases
Correlate 500,000 patient records with DNA to design therapies
Optimize inventory management using simulations and ML
Identify securities fraud via machine learning on 30 PB of data
9
Usage Growth: VMs/day
10
What’s Challenging About This?
All the challenges of B2B cloud services
§ Availability, security, multitenancy, reliable updates, etc
+
All the challenges of large-scale data processing
§ Scalability, reliability, failures, etc
11
Traditional Software Cloud SoftwareVendorCustomers
Dev Team
Release
6-12 months
Users
Ops
Users
Ops
Users
Ops
Users
Ops
Dev + Ops Team
1-2 weeks
Users
Ops
Users
Ops
Users
Ops
Users
Ops
6-12 months
vs.
12
Benefits of Cloud Software for Users
Management built-in: much more value than the software bits
alone (security, high availability, etc)
Elasticity: pay-as-you-go, scale on demand
Better features released faster
13
Challenges in Building Cloud Software
Building a multitenant service: significant scaling, security and
performance isolation work that you don’t need on-prem
Operating the service: security, availability, monitoring, etc
(but customers would have to do it themselves on-prem)
Upgrading without regressions: very hard, but critical for users to
trust your cloud (on-prem apps don’t require this)
Includes API, semantics, and performance regressions
14
Challenges of Large Scale
Each customer app runs on up to thousands of VMs, so systems
must tolerate failures, stragglers, etc among those
Even one customer app could potentially overload our services
Weird rare failure modes that are only seen at scale
15
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
16
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
17
Causes of Significant Outages
Scaling problem
in our services
Scaling problem in
underlying cloud services
Insufficient
user isolation
Deployment
misconfiguration
Other
30%
20%20%
20%
10%
18
Causes of Significant Outages
Scaling problem
in our services
Scaling problem in
underlying cloud services
Insufficient
user isolation
Deployment
misconfiguration
Other
30%
20%20%
20%
10% 70% scale
related
19
Some Issues We Experienced
Cloud networks: limits, partitions, slow DHCP, hung connections
Automated apps creating large load
Very large requests, results, etc
Slow VM launches/shutdowns, lack of VM capacity
Data corruption writing to cloud storage
20
Example Outage: Aborted Jobs
Jobs Service launches & tracks jobs on clusters
1 customer running many jobs/sec on same cluster
Cloud’s network reaches a limit of 1000 connections/VM
between Jobs Service & clusters
§ After this limit, new connections hang in state SYN_SENT
Resource usage from hanging connections causes
memory pressure and GC
Health checks to some jobs time out, so we abort them
Jobs
Service
Cloud
Network Customer
Clusters
21
Surprisingly Rare Issues
1 cloud-wide VM restart on AWS (Xen patch)
1 misreported security scan on customer VM
1 significant S3 outage
2 Linux kernel bugs (TCP-related)
22
Lessons
Cloud services must handle load that varies on many dimensions,
and rely on other services with varying limits & failure modes
Problems get worse in a “cloud service economy” where each
service depends on others (maybe from other vendors)
23
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
24
Testing for Scalability & Stability
Software correctness is a Boolean property: does your software
give the right output on a given input?
Scalability and stability are a matter of degree
§ What load will your system fail at? (any system with limited resources will)
§ What failure behavior will you have? (crash all clients, drop some, etc)
25
Example Scalability Problems
Large result: can crash browser,
notebook service, driver or Spark
Large record in file
Large # of tasks
Code that freezes a worker
+ All these affect other users!
User Browser Notebook
Service
Driver
App
Workers
Other Users
??
26
Databricks Stress Test Infrastructure
1. Identify dimensions for a system to scale in (e.g. # of users, number
of output rows, size of each row, latency of an RPC, etc)
2. Grow load in each dimension until a failure occurs
3. Record failure type and impact on system
§ Error message, timeout, wrong result?
§ Are other clients affected?
§ Does the system auto-recover? How fast?
4. Compare over time and on changes
27
Example Output
28
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
29
Developing Control Planes
Cloud software consists of interacting, independently updated
services, many of which call other services
What is the right programming model for this software?
Billing
Service
Notebook
Service
Jobs
Service
Cluster
Service …
30
Service Framework
We designed a shared service framework that handles:
§ Cross-cloud deployment: AWS, Azure, local, special environments
§ Storage: databases, schema updates, etc
§ Security tokens & roles
§ Monitoring
§ API routing & limiting
§ Feature flagging
Our service stack:
32
Best Practices
Isolate state: relational DB is usually enough with org sharding
Isolate components that scale differently: allows separate scaling
Manage changes through feature flags: fastest, safest way
Watch key metrics: most outages could be predicted from one of
CPU load, memory load, DB load or thread pool exhaustion
Test pyramid: 70% unit tests, 20% integration, 10% end-to-end
33
Cluster manager v1 Cluster manager v2
Example: Cluster Manager
Cloud
VM API
Cluster
Manager
Customer Clusters
Cloud
VM API
CM Master
Customer Clusters
Delegate Delegate
Usage, billing, etc
VM launch, setup,
monitoring, etc
35
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
36
Evolving Big Data Systems for the Cloud
MapReduce, Spark, etc were designed for on-premise datacenters
How can we evolve these leverage the benefits of the cloud?
§ Availability, elasticity, scale, multitenancy, etc
Two examples from Databricks:
§ Delta Lake: ACID on cloud object stores
§ “Cloudifying” Apache Spark
37
Delta Lake Motivation
Cloud object stores (S3, Azure blob, etc) are the largest storage
systems on the planet
§ Unmatched availability, parallel I/O bandwidth, and cost-efficiency
Open source big data stack was designed for on-prem world
§ Filesystem API for storage
§ RDBMS for table metadata (Hive metastore)
§ Other distributed systems, e.g. ZooKeeper
How can big data systems fully leverage cloud object stores?
Stronger consistency model
Scale & management complexity
38
Spark on HDFS Spark on S3 (Naïve)
Example: Atomic Parallel Writes
Input Files Output Partitions
/tmp-job-1
/my-output
Atomic
Rename
Input Files Output Partitions
/my-output/part-1
/my-output/part-2
/my-output/part-3
/my-output/part-4
part-1
part-2
part-3
part-4
/_DONE
Full object names
(no cheap rename)
Real cases are harder (e.g. appending to a table)
39
Delta Lake Design
1. Track metadata that says which objects are part of a dataset
2. Store this metadata itself in a cloud object store
§ Write-ahead log in S3, compressed using Apache Parquet
Before Delta Lake: 50% of Spark support
issues were about cloud storage
After: fewer issues, increased perf
Input Files Output Partitions
/my-output/part-X
/my-output/part-Y
/my-output/part-Z
/my-output/part-W
/my-output/_delta_log
Commit
Protocol https://delta.io
10x faster metadata
ops than Hive on S3!
41
Major Benefits of Delta Lake
Once we had transactions over S3, we could build much more:
§ UPSERT, DELETE, etc (GDPR)
§ Time travel
§ Caching
§ Multidimensional indexing
§ Audit logging
§ Background optimization
0
0.2
0.4
0.6
0.8
1
Parquet
Parquet+
order
Delta
Z-order
Runningtime
Filter on 2 Fields
Result: greatly simplified
customers’ data architectures
Before and After Delta
Typical Architecture with Multiple Data Systems
ETL Job ETL Job
ETL Job
Delta Lake
Table 1
Delta Lake
Table 2
Delta Lake
Table 3
Streaming
Analytics
Data
Scientists
BI Users
Cloud Object Store
input
Cloud Object Store
ETL Job ETL Job
ETL Job
Message
Queue
Parquet
Table 1
Parquet
Table 2
Parquet
Table 3
Data
Warehouse
Data
Warehouse
Streaming
Analytics
Data
Scientists
BI Users
input
Using Delta Lake for All Storage
Fewer copies of the data, fewer ETL steps,
no divergence & faster results!
Delta Lake Adoption
Used by thousands of companies to process exabytes of data/day
Grew from zero to ~50% of the Databricks workload in 3 years
Largest deployments: exabyte tables and 1000s of users
45
Launched Today: SQL Analytics Product
New native SQL engine
running over Delta Lake
Queries cloud storage with
data warehouse performance
SQL security control model
and easy access from BI tools
0
10000
20000
30000
40000
DW1 DW2 DW3 Delta
Engine
TPC-DS 30TB Benchmark
Time (s)
46
“Cloudifying” Apache Spark
“High-concurrency” clusters with strong user isolation
§ Give users a “serverless” experience for ad hoc analytics queries
Scheduler-integrated cluster autoscaling
Autoscaling storage volumes
…
47
Conclusion
The cloud is eating software by enabling much better products
§ Self-managing, elastic, and highly available
But building reliable, large-scale cloud products is hard
Many ideas can help, from software engineering best practices to
new testing methods and system designs for large scale
48
We’re Hiring!
databricks.com/careers

More Related Content

What's hot

Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudDATAVERSITY
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsNeo4j
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceNeo4j
 
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...Denodo
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
Neo4j - Cas d'usages pour votre métier
Neo4j - Cas d'usages pour votre métierNeo4j - Cas d'usages pour votre métier
Neo4j - Cas d'usages pour votre métierNeo4j
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceDenodo
 

What's hot (20)

Data Mesh
Data MeshData Mesh
Data Mesh
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-Cloud
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent Applications
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
 
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Neo4j - Cas d'usages pour votre métier
Neo4j - Cas d'usages pour votre métierNeo4j - Cas d'usages pour votre métier
Neo4j - Cas d'usages pour votre métier
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and Governance
 

Similar to Scaling Databricks to Run Data and ML Workloads on Millions of VMs

Technology Overview
Technology OverviewTechnology Overview
Technology OverviewLiran Zelkha
 
Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09Rex Wang
 
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Manoj Kumar
 
Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalabilitymabuhr
 
Accelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateAccelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateMichael Elder
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Prolifics
 
An introduction to the cloud 11 v1
An introduction to the cloud 11 v1An introduction to the cloud 11 v1
An introduction to the cloud 11 v1charan7575
 
Cloud On-Ramp Project Briefing
Cloud On-Ramp Project BriefingCloud On-Ramp Project Briefing
Cloud On-Ramp Project BriefingRobert McDermott
 
Cloud Computing Networks
Cloud Computing NetworksCloud Computing Networks
Cloud Computing Networksjayapal385
 
Cloud Migration.pdf
Cloud Migration.pdfCloud Migration.pdf
Cloud Migration.pdfZen Bit Tech
 
mysql_pn_heatwave.pdf
mysql_pn_heatwave.pdfmysql_pn_heatwave.pdf
mysql_pn_heatwave.pdfRavishPatel19
 
A perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applicationsA perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applicationsGeorge Milliken
 
Availability Considerations for SQL Server
Availability Considerations for SQL ServerAvailability Considerations for SQL Server
Availability Considerations for SQL ServerBob Roudebush
 
Cloud computing ft
Cloud computing ftCloud computing ft
Cloud computing ftPallawi Bala
 

Similar to Scaling Databricks to Run Data and ML Workloads on Millions of VMs (20)

Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Technology Overview
Technology OverviewTechnology Overview
Technology Overview
 
Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09
 
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
 
Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalability
 
Accelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateAccelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud Private
 
Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
An introduction to the cloud 11 v1
An introduction to the cloud 11 v1An introduction to the cloud 11 v1
An introduction to the cloud 11 v1
 
Jjm cloud computing
Jjm cloud computingJjm cloud computing
Jjm cloud computing
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
VAS - VMware CMP
VAS - VMware CMPVAS - VMware CMP
VAS - VMware CMP
 
Cloud On-Ramp Project Briefing
Cloud On-Ramp Project BriefingCloud On-Ramp Project Briefing
Cloud On-Ramp Project Briefing
 
Cloud Computing Networks
Cloud Computing NetworksCloud Computing Networks
Cloud Computing Networks
 
Cloud Migration.pdf
Cloud Migration.pdfCloud Migration.pdf
Cloud Migration.pdf
 
mysql_pn_heatwave.pdf
mysql_pn_heatwave.pdfmysql_pn_heatwave.pdf
mysql_pn_heatwave.pdf
 
A perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applicationsA perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applications
 
cloud computing models
cloud computing modelscloud computing models
cloud computing models
 
Availability Considerations for SQL Server
Availability Considerations for SQL ServerAvailability Considerations for SQL Server
Availability Considerations for SQL Server
 
Cloud computing ft
Cloud computing ftCloud computing ft
Cloud computing ft
 

Recently uploaded

top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownloadvrstrong314
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabbereGrabber
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfDeskTrack
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfOrtus Solutions, Corp
 

Recently uploaded (20)

top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 

Scaling Databricks to Run Data and ML Workloads on Millions of VMs

  • 1. Scaling Databricks to Run Data & AI Workloads on Millions of VMs Matei Zaharia Cofounder and Chief Technologist @matei_zaharia
  • 2. 2 With Contributions From Jeff Pang Nanxi Kang Haogang Chen Michael Armbrust Ihor Leshko Ali Ghodsi
  • 3. 3 Outline About Databricks Challenges developing large-scale cloud services Four lessons from our experience
  • 4. 4 Outline About Databricks Challenges developing large-scale cloud services Four lessons from our experience
  • 5. 5 About Databricks Founded in 2013 by the Apache Spark group at UC Berkeley Cloud-based data and ML platform for >7000 customers § Millions of VMs run per day, processing exabytes of data/day § 100,000s of users 1500 employees, 300 engineers, >$350M ARR
  • 6. 6 Security policies Our Product Built around open source: Data science workspace SQL analytics Scheduled jobs Data scientists Data engineers Business users Cloud Storage Compute Clusters ML platform Databricks Runtime Data catalog Customer’s Cloud AccountDatabricks Services inside
  • 7. 7 Some of Our Customers Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & IndustrialMarketing & AdTech Data & Analytics Services Nike
  • 8. 8 Example Use Cases Correlate 500,000 patient records with DNA to design therapies Optimize inventory management using simulations and ML Identify securities fraud via machine learning on 30 PB of data
  • 10. 10 What’s Challenging About This? All the challenges of B2B cloud services § Availability, security, multitenancy, reliable updates, etc + All the challenges of large-scale data processing § Scalability, reliability, failures, etc
  • 11. 11 Traditional Software Cloud SoftwareVendorCustomers Dev Team Release 6-12 months Users Ops Users Ops Users Ops Users Ops Dev + Ops Team 1-2 weeks Users Ops Users Ops Users Ops Users Ops 6-12 months vs.
  • 12. 12 Benefits of Cloud Software for Users Management built-in: much more value than the software bits alone (security, high availability, etc) Elasticity: pay-as-you-go, scale on demand Better features released faster
  • 13. 13 Challenges in Building Cloud Software Building a multitenant service: significant scaling, security and performance isolation work that you don’t need on-prem Operating the service: security, availability, monitoring, etc (but customers would have to do it themselves on-prem) Upgrading without regressions: very hard, but critical for users to trust your cloud (on-prem apps don’t require this) Includes API, semantics, and performance regressions
  • 14. 14 Challenges of Large Scale Each customer app runs on up to thousands of VMs, so systems must tolerate failures, stragglers, etc among those Even one customer app could potentially overload our services Weird rare failure modes that are only seen at scale
  • 15. 15 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 16. 16 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 17. 17 Causes of Significant Outages Scaling problem in our services Scaling problem in underlying cloud services Insufficient user isolation Deployment misconfiguration Other 30% 20%20% 20% 10%
  • 18. 18 Causes of Significant Outages Scaling problem in our services Scaling problem in underlying cloud services Insufficient user isolation Deployment misconfiguration Other 30% 20%20% 20% 10% 70% scale related
  • 19. 19 Some Issues We Experienced Cloud networks: limits, partitions, slow DHCP, hung connections Automated apps creating large load Very large requests, results, etc Slow VM launches/shutdowns, lack of VM capacity Data corruption writing to cloud storage
  • 20. 20 Example Outage: Aborted Jobs Jobs Service launches & tracks jobs on clusters 1 customer running many jobs/sec on same cluster Cloud’s network reaches a limit of 1000 connections/VM between Jobs Service & clusters § After this limit, new connections hang in state SYN_SENT Resource usage from hanging connections causes memory pressure and GC Health checks to some jobs time out, so we abort them Jobs Service Cloud Network Customer Clusters
  • 21. 21 Surprisingly Rare Issues 1 cloud-wide VM restart on AWS (Xen patch) 1 misreported security scan on customer VM 1 significant S3 outage 2 Linux kernel bugs (TCP-related)
  • 22. 22 Lessons Cloud services must handle load that varies on many dimensions, and rely on other services with varying limits & failure modes Problems get worse in a “cloud service economy” where each service depends on others (maybe from other vendors)
  • 23. 23 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 24. 24 Testing for Scalability & Stability Software correctness is a Boolean property: does your software give the right output on a given input? Scalability and stability are a matter of degree § What load will your system fail at? (any system with limited resources will) § What failure behavior will you have? (crash all clients, drop some, etc)
  • 25. 25 Example Scalability Problems Large result: can crash browser, notebook service, driver or Spark Large record in file Large # of tasks Code that freezes a worker + All these affect other users! User Browser Notebook Service Driver App Workers Other Users ??
  • 26. 26 Databricks Stress Test Infrastructure 1. Identify dimensions for a system to scale in (e.g. # of users, number of output rows, size of each row, latency of an RPC, etc) 2. Grow load in each dimension until a failure occurs 3. Record failure type and impact on system § Error message, timeout, wrong result? § Are other clients affected? § Does the system auto-recover? How fast? 4. Compare over time and on changes
  • 28. 28 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 29. 29 Developing Control Planes Cloud software consists of interacting, independently updated services, many of which call other services What is the right programming model for this software? Billing Service Notebook Service Jobs Service Cluster Service …
  • 30. 30 Service Framework We designed a shared service framework that handles: § Cross-cloud deployment: AWS, Azure, local, special environments § Storage: databases, schema updates, etc § Security tokens & roles § Monitoring § API routing & limiting § Feature flagging Our service stack:
  • 31. 32 Best Practices Isolate state: relational DB is usually enough with org sharding Isolate components that scale differently: allows separate scaling Manage changes through feature flags: fastest, safest way Watch key metrics: most outages could be predicted from one of CPU load, memory load, DB load or thread pool exhaustion Test pyramid: 70% unit tests, 20% integration, 10% end-to-end
  • 32. 33 Cluster manager v1 Cluster manager v2 Example: Cluster Manager Cloud VM API Cluster Manager Customer Clusters Cloud VM API CM Master Customer Clusters Delegate Delegate Usage, billing, etc VM launch, setup, monitoring, etc
  • 33. 35 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 34. 36 Evolving Big Data Systems for the Cloud MapReduce, Spark, etc were designed for on-premise datacenters How can we evolve these leverage the benefits of the cloud? § Availability, elasticity, scale, multitenancy, etc Two examples from Databricks: § Delta Lake: ACID on cloud object stores § “Cloudifying” Apache Spark
  • 35. 37 Delta Lake Motivation Cloud object stores (S3, Azure blob, etc) are the largest storage systems on the planet § Unmatched availability, parallel I/O bandwidth, and cost-efficiency Open source big data stack was designed for on-prem world § Filesystem API for storage § RDBMS for table metadata (Hive metastore) § Other distributed systems, e.g. ZooKeeper How can big data systems fully leverage cloud object stores? Stronger consistency model Scale & management complexity
  • 36. 38 Spark on HDFS Spark on S3 (Naïve) Example: Atomic Parallel Writes Input Files Output Partitions /tmp-job-1 /my-output Atomic Rename Input Files Output Partitions /my-output/part-1 /my-output/part-2 /my-output/part-3 /my-output/part-4 part-1 part-2 part-3 part-4 /_DONE Full object names (no cheap rename) Real cases are harder (e.g. appending to a table)
  • 37. 39 Delta Lake Design 1. Track metadata that says which objects are part of a dataset 2. Store this metadata itself in a cloud object store § Write-ahead log in S3, compressed using Apache Parquet Before Delta Lake: 50% of Spark support issues were about cloud storage After: fewer issues, increased perf Input Files Output Partitions /my-output/part-X /my-output/part-Y /my-output/part-Z /my-output/part-W /my-output/_delta_log Commit Protocol https://delta.io 10x faster metadata ops than Hive on S3!
  • 38. 41 Major Benefits of Delta Lake Once we had transactions over S3, we could build much more: § UPSERT, DELETE, etc (GDPR) § Time travel § Caching § Multidimensional indexing § Audit logging § Background optimization 0 0.2 0.4 0.6 0.8 1 Parquet Parquet+ order Delta Z-order Runningtime Filter on 2 Fields Result: greatly simplified customers’ data architectures
  • 39. Before and After Delta Typical Architecture with Multiple Data Systems ETL Job ETL Job ETL Job Delta Lake Table 1 Delta Lake Table 2 Delta Lake Table 3 Streaming Analytics Data Scientists BI Users Cloud Object Store input Cloud Object Store ETL Job ETL Job ETL Job Message Queue Parquet Table 1 Parquet Table 2 Parquet Table 3 Data Warehouse Data Warehouse Streaming Analytics Data Scientists BI Users input Using Delta Lake for All Storage Fewer copies of the data, fewer ETL steps, no divergence & faster results!
  • 40. Delta Lake Adoption Used by thousands of companies to process exabytes of data/day Grew from zero to ~50% of the Databricks workload in 3 years Largest deployments: exabyte tables and 1000s of users
  • 41. 45 Launched Today: SQL Analytics Product New native SQL engine running over Delta Lake Queries cloud storage with data warehouse performance SQL security control model and easy access from BI tools 0 10000 20000 30000 40000 DW1 DW2 DW3 Delta Engine TPC-DS 30TB Benchmark Time (s)
  • 42. 46 “Cloudifying” Apache Spark “High-concurrency” clusters with strong user isolation § Give users a “serverless” experience for ad hoc analytics queries Scheduler-integrated cluster autoscaling Autoscaling storage volumes …
  • 43. 47 Conclusion The cloud is eating software by enabling much better products § Self-managing, elastic, and highly available But building reliable, large-scale cloud products is hard Many ideas can help, from software engineering best practices to new testing methods and system designs for large scale