SlideShare a Scribd company logo
1 of 45
Download to read offline
Lessons from Large-Scale
Cloud Software at Databricks
Matei Zaharia
@matei_zaharia
2
Outline
The cloud is eating software, but why?
About Databricks
Challenges, solutions and research questions
3
Outline
The cloud is eating software, but why?
About Databricks
Challenges, solutions and research questions
4
Traditional Software Cloud SoftwareVendorCustomers
Dev Team
Release
6-12 months
Users
Ops
Users
Ops
Users
Ops
Users
Ops
Dev + Ops Team
1-2 weeks
Users
Ops
Users
Ops
Users
Ops
Users
Ops
6-12 months
5
Why Use Cloud Software?
Management built-in: much more value than the software
bits alone (security, availability, etc)
1
2
3
Elasticity: pay-as-you-go, scale on demand
Better features released faster
6
Differences in Building Cloud Software
+ Release cycle: send to users faster, get feedback faster
+ Only need to maintain 2 software versions (current & next),
in fewer configurations than you’d have on-prem
– Upgrading without regressions: very hard, but critical for users
to trust your cloud (on-prem apps don’t need this)
§ Includes API, semantics, and performance regressions
7
Many of these challenges aren’t studied in research
Differences in Building Cloud Software
– Building a multitenant service: significant scaling, security and
performance isolation work that you won’t need on-prem
(customers install separate instances)
– Operating the service: security, availability, monitoring, etc
(but customers would have to do it themselves on-prem)
+ Monitoring: see usage live for ops & product analytics
8
About Databricks
Founded in 2013 by the Apache Spark team at UC Berkeley
Data and ML platform on AWS and Azure for >5000 customers
§ Millions of VMs launched/day, processing exabytes of data
§ 100,000s of users
1000 employees, 200 engineers, >$200M ARR
9
VMs Managed / Day
10
Some of Our CustomersFinancial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services
11
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services
Identify fraud using machine
learning on 30 PB of trade data
Some of Our Customers
12
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services
Correlate 500,000 patients’ records
with their DNA to design therapies
Some of Our Customers
13
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services
Curb abusive behavior in the
world’s largest online game
Some of Our Customers
14
Security policies
Our Product
Built around
open source:
Interactive
data science
Scheduled jobs
SQL frontend
Data scientists
Data engineers
Business users
Cloud Storage
Compute Clusters
ML platform
Databricks Runtime
Data catalog
Customer’s Cloud AccountDatabricks Service
15
Our Specific Challenges
All the usual challenges of SaaS:
§ Availability, security, multitenancy, updates, etc
Plus, the workloads themselves are large-scale!
§ One user job could easily overload control services
§ Millions of VMs ⇒ many weird failures
16
Four Lessons
1
2
3
What goes wrong in cloud systems?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
17
Four Lessons
1
2
3
What goes wrong in cloud systems?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
18
What Goes Wrong in the Cloud?
Academic research studies many kinds of failures:
§ Software bugs, network config, crash failures, etc
These matter, but other problems often have larger impact:
§ Scaling and resource limits
§ Workload isolation
§ Updates & regressions
19
Causes of Significant Outages
Scaling problem
in our services
Scaling problem in
underlying cloud services
Insufficient
user isolation
Deployment
misconfiguration
Other
30%
20%20%
20%
10%
20
Causes of Significant Outages
Scaling problem
in our services
Scaling problem in
underlying cloud services
Insufficient
user isolation
Deployment
misconfiguration
Other
30%
20%20%
20%
10% 70% scale
related
21
Some Issues We Experienced
Cloud networks: limits, partitions, slow DHCP, hung connections
Automated apps creating large load
Very large requests, results, etc
Slow VM launches/shutdowns, lack of VM capacity
Data corruption writing to cloud storage
22
Example Outage: Aborted Jobs
Jobs Service launches & tracks jobs on clusters
1 customer running many jobs/sec on same cluster
Cloud’s network reaches a limit of 1000 connections/VM
between Jobs Service & clusters
§ After this limit, new connections hang in state SYN_SENT
Resource usage from hanging connections causes
memory pressure and GC
Health checks to some jobs time out, so we abort them
Jobs
Service
Cloud
Network Customer
Clusters
23
Surprisingly Rare Issues
1 cloud-wide VM restart on AWS (Xen patch)
1 misreported security scan on customer VM
1 significant S3 outage
1 kernel bug (hung TCP connections due to SACK fix)
24
Lessons
Cloud services must handle load that varies on many dimensions,
and rely on other services with varying limits & failure modes
§ Problems likely to get worse in a “cloud service economy”
End-to-end issues remain hard to prevent
The usual factors of MTTR, monitoring, testing, etc help
25
Four Lessons
1
2
3
What goes wrong in cloud systems?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
26
Testing for Scalability & Stability
Software correctness is a Boolean property: does your software
give the right output on a given input?
Scalability and stability are a matter of degree
§ What load will your system fail at? (any system with limited resources will)
§ What failure behavior will you have? (crash all clients, drop some, etc)
27
Example Scalability Problems
Large result: can crash browser,
notebook service, driver or Spark
Large record in file
Large # of tasks
Code that freezes a worker
+ All these affect other users!
User Browser Notebook
Service
Driver
App
Workers
Other Users
??
28
Databricks Stress Test Infrastructure
1. Identify dimensions for a system to scale in (e.g. # of users, number
of output rows, size of each output row, etc)
2. Grow load in each dimension until a failure occurs
3. Record failure type and impact on system
§ Error message, timeout, wrong result?
§ Are other clients affected?
§ Does the system auto-recover? How fast?
4. Compare over time and on changes
29
Example Output
30
Four Lessons
1
2
3
What goes wrong in cloud systems?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
31
Developing Control Planes
Cloud software consists of interacting, independently updated
services, many of which call other services
What should be the programming model for this software?
32
Examples
Cluster manager service:
§ API: requests to launch, scale and shut down clusters
§ Behavior: request VMs, set up clusters, reuse VMs in pools
§ State: requests, running VMs, etc
Jobs service:
§ API: scheduled or API-triggered jobs to execute
§ Behavior: acquire a cluster, run job, monitor state, retry
§ State: jobs to be run, what’s currently active, where is it, etc
33
Examples
Cluster manager service:
§ API: requests to launch, scale and shut down clusters
§ Behavior: request VMs, set up clusters, reuse VMs in pools
§ State: requests, running VMs, etc
Jobs service:
§ API: scheduled or API-triggered jobs to execute
§ Behavior: acquire a cluster, run job, monitor state, retry
§ State: jobs to be run, what’s currently active, where is it, etc
Cloud VM
Service
IAM
Service
Usage
Service
Notebook
Service
. . .
. . .
34
Control Plane Infrastructure
Our Platform Team develops a service framework that handles:
§ Deployment: AWS, Azure, local, special environments
§ Storage: databases, schema updates, etc
§ Security tokens & roles
§ Monitoring
§ API routing & limiting
§ Feature flagging
Our service stack:
JSonnet
35
Best Practices
Isolate state: relational DB is usually enough with org sharding
Isolate components that scale differently: allows separate scaling
Manage changes through feature flags: fastest, safest way
Watch key metrics: most outages could be predicted from one of
CPU load, memory load, DB load or thread pool exhaustion
Test pyramid: 70% unit tests, 20% integration, 10% end-to-end
36
Cluster manager v1 Cluster manager v2
Example: Cluster Manager
Cloud
VM API
Cluster
Manager
Customer Clusters
Cloud
VM API
CM Master
Customer Clusters
Delegate Delegate
Usage, billing, etc
VM launch, setup,
monitoring, etc
37
Challenges in Control Planes
Fine-grained isolation within a service
Non-standard failure modes (e.g. network conn. exhaustion)
Transitioning between architectures
38
Four Lessons
1
2
3
What goes wrong in cloud systems?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
39
Evolving Big Data Systems for the Cloud
MapReduce, Spark, etc were designed for on-premise datacenters
How can we evolve these leverage the benefits of the cloud?
§ Availability, elasticity, scale, multitenancy, etc
Two examples from Databricks:
§ Delta Lake: ACID on cloud object stores
§ Cloudifying Apache Spark
40
Delta Lake Motivation
Cloud object stores (S3, Azure blob, etc) are the largest storage
systems on the planet
§ Unmatched availability, parallel I/O bandwidth, and cost-efficiency
Open source big data stack was designed for on-prem world
§ Filesystem API for storage
§ RDBMS for table metadata (Hive metastore)
§ Other distributed systems, e.g. ZooKeeper
How can big data systems fully leverage cloud object stores?
Stronger consistency model
Scale & management complexity
41
Spark on HDFS Spark on S3 (Naïve)
Example: Atomic Parallel Writes
Input Files Output Partitions
/tmp-job-1
/my-output
Atomic
Rename
Input Files Output Partitions
/my-output/part-1
/my-output/part-2
/my-output/part-3
/my-output/part-4
part-1
part-2
part-3
part-4
/_DONE
Full object names
(no cheap rename)
Real cases are harder (e.g. appending to a table)
42
Delta Lake Design
1. Track metadata that says which objects are part of a dataset
2. Store this metadata itself in a cloud object store
§ Write-ahead log in S3, compressed using Apache Parquet
Before Delta Lake: 50% of Spark support
issues were about cloud storage
After: fewer issues, increased perf
Input Files Output Partitions
/my-output/part-X
/my-output/part-Y
/my-output/part-Z
/my-output/part-W
/my-output/_delta_log
Commit
Manager https://delta.io
10x faster metadata
ops than Hive on S3!
43
Major Benefits of Delta Lake
Once we had transactions over S3, we could build much more:
§ UPSERT, DELETE, etc (GDPR)
§ Caching
§ Multidimensional indexing
§ Audit logging
§ Time travel
§ Background optimization
0
0.2
0.4
0.6
0.8
1
Parquet
Parquet+
order
Delta
Z-order
Runningtime
Filter on 2 Fields
Result: greatly simplified
customers’ data architectures
44
Other Cloud Features
Scheduler-integrated autoscaling for Apache Spark
Autoscaling local storage volumes
User isolation for high-concurrency Spark clusters
§ Serverless experience for users inside an org
§ Separate library envs, IAM roles, performance & fault isolation
45
Conclusion
The cloud is eating software by enabling much better products
§ Self-managing, elastic, more reliable & scalable
But building cloud products is understudied and hard
§ Come see what’s involved in an internship!
Many opportunities, from service fabrics to cloud-native systems
We’re hiring in SF, Amsterdam & Toronto: databricks.com/jobs

More Related Content

What's hot

Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSAmazon Web Services
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Neo4j
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsMatei Zaharia
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxSwathiPonugumati
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 

What's hot (20)

Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Graph based data models
Graph based data modelsGraph based data models
Graph based data models
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
Athena & Glue
Athena & GlueAthena & Glue
Athena & Glue
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 

Similar to Lessons from Building Large-Scale Cloud Software

Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalabilitymabuhr
 
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Manoj Kumar
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Prolifics
 
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...Amazon Web Services
 
V center application discovery manager customer facing technical presentation
V center application discovery manager customer facing technical presentationV center application discovery manager customer facing technical presentation
V center application discovery manager customer facing technical presentationsolarisyourep
 
Refactoring to Microservices
Refactoring to MicroservicesRefactoring to Microservices
Refactoring to MicroservicesJacinto Limjap
 
Service Virtualization: What Testers Need to Know
Service Virtualization: What Testers Need to KnowService Virtualization: What Testers Need to Know
Service Virtualization: What Testers Need to KnowTechWell
 
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...RightScale
 
Ensuring d.s
Ensuring d.sEnsuring d.s
Ensuring d.skarthi j
 
Data stream processing and micro service architecture
Data stream processing and micro service architectureData stream processing and micro service architecture
Data stream processing and micro service architectureVyacheslav Benedichuk
 
Stateful on Stateless - The Future of Applications in the Cloud
Stateful on Stateless - The Future of Applications in the CloudStateful on Stateless - The Future of Applications in the Cloud
Stateful on Stateless - The Future of Applications in the CloudMarkus Eisele
 
Getting Started with ThousandEyes Proof of Concepts
Getting Started with ThousandEyes Proof of ConceptsGetting Started with ThousandEyes Proof of Concepts
Getting Started with ThousandEyes Proof of ConceptsThousandEyes
 
Celera Networks on Cloud Computing
Celera Networks on Cloud Computing Celera Networks on Cloud Computing
Celera Networks on Cloud Computing CeleraNetworks
 
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...VMworld
 
2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservice2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservicedevopsdaysaustin
 
Ojoconsulting Oy Nimbus Monitoring Service description v1.2 public
Ojoconsulting Oy Nimbus Monitoring Service description v1.2 publicOjoconsulting Oy Nimbus Monitoring Service description v1.2 public
Ojoconsulting Oy Nimbus Monitoring Service description v1.2 publicOjoconsulting Oy
 

Similar to Lessons from Building Large-Scale Cloud Software (20)

Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalability
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
 
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
 
Microservices
MicroservicesMicroservices
Microservices
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
 
Cloud Design Patterns
Cloud Design PatternsCloud Design Patterns
Cloud Design Patterns
 
V center application discovery manager customer facing technical presentation
V center application discovery manager customer facing technical presentationV center application discovery manager customer facing technical presentation
V center application discovery manager customer facing technical presentation
 
Refactoring to Microservices
Refactoring to MicroservicesRefactoring to Microservices
Refactoring to Microservices
 
VAS - VMware CMP
VAS - VMware CMPVAS - VMware CMP
VAS - VMware CMP
 
Service Virtualization: What Testers Need to Know
Service Virtualization: What Testers Need to KnowService Virtualization: What Testers Need to Know
Service Virtualization: What Testers Need to Know
 
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
 
Ensuring d.s
Ensuring d.sEnsuring d.s
Ensuring d.s
 
Data stream processing and micro service architecture
Data stream processing and micro service architectureData stream processing and micro service architecture
Data stream processing and micro service architecture
 
Stateful on Stateless - The Future of Applications in the Cloud
Stateful on Stateless - The Future of Applications in the CloudStateful on Stateless - The Future of Applications in the Cloud
Stateful on Stateless - The Future of Applications in the Cloud
 
Getting Started with ThousandEyes Proof of Concepts
Getting Started with ThousandEyes Proof of ConceptsGetting Started with ThousandEyes Proof of Concepts
Getting Started with ThousandEyes Proof of Concepts
 
Celera Networks on Cloud Computing
Celera Networks on Cloud Computing Celera Networks on Cloud Computing
Celera Networks on Cloud Computing
 
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
 
2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservice2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservice
 
Ojoconsulting Oy Nimbus Monitoring Service description v1.2 public
Ojoconsulting Oy Nimbus Monitoring Service description v1.2 publicOjoconsulting Oy Nimbus Monitoring Service description v1.2 public
Ojoconsulting Oy Nimbus Monitoring Service description v1.2 public
 

Recently uploaded

Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 

Recently uploaded (20)

Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 

Lessons from Building Large-Scale Cloud Software

  • 1. Lessons from Large-Scale Cloud Software at Databricks Matei Zaharia @matei_zaharia
  • 2. 2 Outline The cloud is eating software, but why? About Databricks Challenges, solutions and research questions
  • 3. 3 Outline The cloud is eating software, but why? About Databricks Challenges, solutions and research questions
  • 4. 4 Traditional Software Cloud SoftwareVendorCustomers Dev Team Release 6-12 months Users Ops Users Ops Users Ops Users Ops Dev + Ops Team 1-2 weeks Users Ops Users Ops Users Ops Users Ops 6-12 months
  • 5. 5 Why Use Cloud Software? Management built-in: much more value than the software bits alone (security, availability, etc) 1 2 3 Elasticity: pay-as-you-go, scale on demand Better features released faster
  • 6. 6 Differences in Building Cloud Software + Release cycle: send to users faster, get feedback faster + Only need to maintain 2 software versions (current & next), in fewer configurations than you’d have on-prem – Upgrading without regressions: very hard, but critical for users to trust your cloud (on-prem apps don’t need this) § Includes API, semantics, and performance regressions
  • 7. 7 Many of these challenges aren’t studied in research Differences in Building Cloud Software – Building a multitenant service: significant scaling, security and performance isolation work that you won’t need on-prem (customers install separate instances) – Operating the service: security, availability, monitoring, etc (but customers would have to do it themselves on-prem) + Monitoring: see usage live for ops & product analytics
  • 8. 8 About Databricks Founded in 2013 by the Apache Spark team at UC Berkeley Data and ML platform on AWS and Azure for >5000 customers § Millions of VMs launched/day, processing exabytes of data § 100,000s of users 1000 employees, 200 engineers, >$200M ARR
  • 10. 10 Some of Our CustomersFinancial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services
  • 11. 11 Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services Identify fraud using machine learning on 30 PB of trade data Some of Our Customers
  • 12. 12 Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services Correlate 500,000 patients’ records with their DNA to design therapies Some of Our Customers
  • 13. 13 Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services Curb abusive behavior in the world’s largest online game Some of Our Customers
  • 14. 14 Security policies Our Product Built around open source: Interactive data science Scheduled jobs SQL frontend Data scientists Data engineers Business users Cloud Storage Compute Clusters ML platform Databricks Runtime Data catalog Customer’s Cloud AccountDatabricks Service
  • 15. 15 Our Specific Challenges All the usual challenges of SaaS: § Availability, security, multitenancy, updates, etc Plus, the workloads themselves are large-scale! § One user job could easily overload control services § Millions of VMs ⇒ many weird failures
  • 16. 16 Four Lessons 1 2 3 What goes wrong in cloud systems? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 17. 17 Four Lessons 1 2 3 What goes wrong in cloud systems? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 18. 18 What Goes Wrong in the Cloud? Academic research studies many kinds of failures: § Software bugs, network config, crash failures, etc These matter, but other problems often have larger impact: § Scaling and resource limits § Workload isolation § Updates & regressions
  • 19. 19 Causes of Significant Outages Scaling problem in our services Scaling problem in underlying cloud services Insufficient user isolation Deployment misconfiguration Other 30% 20%20% 20% 10%
  • 20. 20 Causes of Significant Outages Scaling problem in our services Scaling problem in underlying cloud services Insufficient user isolation Deployment misconfiguration Other 30% 20%20% 20% 10% 70% scale related
  • 21. 21 Some Issues We Experienced Cloud networks: limits, partitions, slow DHCP, hung connections Automated apps creating large load Very large requests, results, etc Slow VM launches/shutdowns, lack of VM capacity Data corruption writing to cloud storage
  • 22. 22 Example Outage: Aborted Jobs Jobs Service launches & tracks jobs on clusters 1 customer running many jobs/sec on same cluster Cloud’s network reaches a limit of 1000 connections/VM between Jobs Service & clusters § After this limit, new connections hang in state SYN_SENT Resource usage from hanging connections causes memory pressure and GC Health checks to some jobs time out, so we abort them Jobs Service Cloud Network Customer Clusters
  • 23. 23 Surprisingly Rare Issues 1 cloud-wide VM restart on AWS (Xen patch) 1 misreported security scan on customer VM 1 significant S3 outage 1 kernel bug (hung TCP connections due to SACK fix)
  • 24. 24 Lessons Cloud services must handle load that varies on many dimensions, and rely on other services with varying limits & failure modes § Problems likely to get worse in a “cloud service economy” End-to-end issues remain hard to prevent The usual factors of MTTR, monitoring, testing, etc help
  • 25. 25 Four Lessons 1 2 3 What goes wrong in cloud systems? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 26. 26 Testing for Scalability & Stability Software correctness is a Boolean property: does your software give the right output on a given input? Scalability and stability are a matter of degree § What load will your system fail at? (any system with limited resources will) § What failure behavior will you have? (crash all clients, drop some, etc)
  • 27. 27 Example Scalability Problems Large result: can crash browser, notebook service, driver or Spark Large record in file Large # of tasks Code that freezes a worker + All these affect other users! User Browser Notebook Service Driver App Workers Other Users ??
  • 28. 28 Databricks Stress Test Infrastructure 1. Identify dimensions for a system to scale in (e.g. # of users, number of output rows, size of each output row, etc) 2. Grow load in each dimension until a failure occurs 3. Record failure type and impact on system § Error message, timeout, wrong result? § Are other clients affected? § Does the system auto-recover? How fast? 4. Compare over time and on changes
  • 30. 30 Four Lessons 1 2 3 What goes wrong in cloud systems? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 31. 31 Developing Control Planes Cloud software consists of interacting, independently updated services, many of which call other services What should be the programming model for this software?
  • 32. 32 Examples Cluster manager service: § API: requests to launch, scale and shut down clusters § Behavior: request VMs, set up clusters, reuse VMs in pools § State: requests, running VMs, etc Jobs service: § API: scheduled or API-triggered jobs to execute § Behavior: acquire a cluster, run job, monitor state, retry § State: jobs to be run, what’s currently active, where is it, etc
  • 33. 33 Examples Cluster manager service: § API: requests to launch, scale and shut down clusters § Behavior: request VMs, set up clusters, reuse VMs in pools § State: requests, running VMs, etc Jobs service: § API: scheduled or API-triggered jobs to execute § Behavior: acquire a cluster, run job, monitor state, retry § State: jobs to be run, what’s currently active, where is it, etc Cloud VM Service IAM Service Usage Service Notebook Service . . . . . .
  • 34. 34 Control Plane Infrastructure Our Platform Team develops a service framework that handles: § Deployment: AWS, Azure, local, special environments § Storage: databases, schema updates, etc § Security tokens & roles § Monitoring § API routing & limiting § Feature flagging Our service stack: JSonnet
  • 35. 35 Best Practices Isolate state: relational DB is usually enough with org sharding Isolate components that scale differently: allows separate scaling Manage changes through feature flags: fastest, safest way Watch key metrics: most outages could be predicted from one of CPU load, memory load, DB load or thread pool exhaustion Test pyramid: 70% unit tests, 20% integration, 10% end-to-end
  • 36. 36 Cluster manager v1 Cluster manager v2 Example: Cluster Manager Cloud VM API Cluster Manager Customer Clusters Cloud VM API CM Master Customer Clusters Delegate Delegate Usage, billing, etc VM launch, setup, monitoring, etc
  • 37. 37 Challenges in Control Planes Fine-grained isolation within a service Non-standard failure modes (e.g. network conn. exhaustion) Transitioning between architectures
  • 38. 38 Four Lessons 1 2 3 What goes wrong in cloud systems? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 39. 39 Evolving Big Data Systems for the Cloud MapReduce, Spark, etc were designed for on-premise datacenters How can we evolve these leverage the benefits of the cloud? § Availability, elasticity, scale, multitenancy, etc Two examples from Databricks: § Delta Lake: ACID on cloud object stores § Cloudifying Apache Spark
  • 40. 40 Delta Lake Motivation Cloud object stores (S3, Azure blob, etc) are the largest storage systems on the planet § Unmatched availability, parallel I/O bandwidth, and cost-efficiency Open source big data stack was designed for on-prem world § Filesystem API for storage § RDBMS for table metadata (Hive metastore) § Other distributed systems, e.g. ZooKeeper How can big data systems fully leverage cloud object stores? Stronger consistency model Scale & management complexity
  • 41. 41 Spark on HDFS Spark on S3 (Naïve) Example: Atomic Parallel Writes Input Files Output Partitions /tmp-job-1 /my-output Atomic Rename Input Files Output Partitions /my-output/part-1 /my-output/part-2 /my-output/part-3 /my-output/part-4 part-1 part-2 part-3 part-4 /_DONE Full object names (no cheap rename) Real cases are harder (e.g. appending to a table)
  • 42. 42 Delta Lake Design 1. Track metadata that says which objects are part of a dataset 2. Store this metadata itself in a cloud object store § Write-ahead log in S3, compressed using Apache Parquet Before Delta Lake: 50% of Spark support issues were about cloud storage After: fewer issues, increased perf Input Files Output Partitions /my-output/part-X /my-output/part-Y /my-output/part-Z /my-output/part-W /my-output/_delta_log Commit Manager https://delta.io 10x faster metadata ops than Hive on S3!
  • 43. 43 Major Benefits of Delta Lake Once we had transactions over S3, we could build much more: § UPSERT, DELETE, etc (GDPR) § Caching § Multidimensional indexing § Audit logging § Time travel § Background optimization 0 0.2 0.4 0.6 0.8 1 Parquet Parquet+ order Delta Z-order Runningtime Filter on 2 Fields Result: greatly simplified customers’ data architectures
  • 44. 44 Other Cloud Features Scheduler-integrated autoscaling for Apache Spark Autoscaling local storage volumes User isolation for high-concurrency Spark clusters § Serverless experience for users inside an org § Separate library envs, IAM roles, performance & fault isolation
  • 45. 45 Conclusion The cloud is eating software by enabling much better products § Self-managing, elastic, more reliable & scalable But building cloud products is understudied and hard § Come see what’s involved in an internship! Many opportunities, from service fabrics to cloud-native systems We’re hiring in SF, Amsterdam & Toronto: databricks.com/jobs