SlideShare a Scribd company logo
1 of 44
Download to read offline
Scaling Databricks to Run Data & AI
Workloads on Millions of VMs
Matei Zaharia
Cofounder and Chief Technologist
@matei_zaharia
2
With Contributions From
Jeff Pang Nanxi Kang Haogang Chen
Michael Armbrust Ihor Leshko Ali Ghodsi
3
Outline
About Databricks
Challenges developing large-scale cloud services
Four lessons from our experience
4
Outline
About Databricks
Challenges developing large-scale cloud services
Four lessons from our experience
5
About Databricks
Founded in 2013 by the Apache Spark group at UC Berkeley
Cloud-based data and ML platform for >7000 customers
§ Millions of VMs run per day, processing exabytes of data/day
§ 100,000s of users
1500 employees, 300 engineers, >$350M ARR
6
Security policies
Our Product
Built around
open source:
Data science
workspace
SQL analytics
Scheduled jobs
Data scientists
Data engineers
Business users
Cloud Storage
Compute Clusters
ML platform
Databricks Runtime
Data catalog
Customer’s Cloud AccountDatabricks Services
inside
7
Some of Our Customers
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & IndustrialMarketing & AdTech
Data & Analytics Services
Nike
8
Example Use Cases
Correlate 500,000 patient records with DNA to design therapies
Optimize inventory management using simulations and ML
Identify securities fraud via machine learning on 30 PB of data
9
Usage Growth: VMs/day
10
What’s Challenging About This?
All the challenges of B2B cloud services
§ Availability, security, multitenancy, reliable updates, etc
+
All the challenges of large-scale data processing
§ Scalability, reliability, failures, etc
11
Traditional Software Cloud SoftwareVendorCustomers
Dev Team
Release
6-12 months
Users
Ops
Users
Ops
Users
Ops
Users
Ops
Dev + Ops Team
1-2 weeks
Users
Ops
Users
Ops
Users
Ops
Users
Ops
6-12 months
vs.
12
Benefits of Cloud Software for Users
Management built-in: much more value than the software bits
alone (security, high availability, etc)
Elasticity: pay-as-you-go, scale on demand
Better features released faster
13
Challenges in Building Cloud Software
Building a multitenant service: significant scaling, security and
performance isolation work that you don’t need on-prem
Operating the service: security, availability, monitoring, etc
(but customers would have to do it themselves on-prem)
Upgrading without regressions: very hard, but critical for users to
trust your cloud (on-prem apps don’t require this)
Includes API, semantics, and performance regressions
14
Challenges of Large Scale
Each customer app runs on up to thousands of VMs, so systems
must tolerate failures, stragglers, etc among those
Even one customer app could potentially overload our services
Weird rare failure modes that are only seen at scale
15
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
16
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
17
Causes of Significant Outages
Scaling problem
in our services
Scaling problem in
underlying cloud services
Insufficient
user isolation
Deployment
misconfiguration
Other
30%
20%20%
20%
10%
18
Causes of Significant Outages
Scaling problem
in our services
Scaling problem in
underlying cloud services
Insufficient
user isolation
Deployment
misconfiguration
Other
30%
20%20%
20%
10% 70% scale
related
19
Some Issues We Experienced
Cloud networks: limits, partitions, slow DHCP, hung connections
Automated apps creating large load
Very large requests, results, etc
Slow VM launches/shutdowns, lack of VM capacity
Data corruption writing to cloud storage
20
Example Outage: Aborted Jobs
Jobs Service launches & tracks jobs on clusters
1 customer running many jobs/sec on same cluster
Cloud’s network reaches a limit of 1000 connections/VM
between Jobs Service & clusters
§ After this limit, new connections hang in state SYN_SENT
Resource usage from hanging connections causes
memory pressure and GC
Health checks to some jobs time out, so we abort them
Jobs
Service
Cloud
Network Customer
Clusters
21
Surprisingly Rare Issues
1 cloud-wide VM restart on AWS (Xen patch)
1 misreported security scan on customer VM
1 significant S3 outage
2 Linux kernel bugs (TCP-related)
22
Lessons
Cloud services must handle load that varies on many dimensions,
and rely on other services with varying limits & failure modes
Problems get worse in a “cloud service economy” where each
service depends on others (maybe from other vendors)
23
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
24
Testing for Scalability & Stability
Software correctness is a Boolean property: does your software
give the right output on a given input?
Scalability and stability are a matter of degree
§ What load will your system fail at? (any system with limited resources will)
§ What failure behavior will you have? (crash all clients, drop some, etc)
25
Example Scalability Problems
Large result: can crash browser,
notebook service, driver or Spark
Large record in file
Large # of tasks
Code that freezes a worker
+ All these affect other users!
User Browser Notebook
Service
Driver
App
Workers
Other Users
??
26
Databricks Stress Test Infrastructure
1. Identify dimensions for a system to scale in (e.g. # of users, number
of output rows, size of each row, latency of an RPC, etc)
2. Grow load in each dimension until a failure occurs
3. Record failure type and impact on system
§ Error message, timeout, wrong result?
§ Are other clients affected?
§ Does the system auto-recover? How fast?
4. Compare over time and on changes
27
Example Output
28
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
29
Developing Control Planes
Cloud software consists of interacting, independently updated
services, many of which call other services
What is the right programming model for this software?
Billing
Service
Notebook
Service
Jobs
Service
Cluster
Service …
30
Service Framework
We designed a shared service framework that handles:
§ Cross-cloud deployment: AWS, Azure, local, special environments
§ Storage: databases, schema updates, etc
§ Security tokens & roles
§ Monitoring
§ API routing & limiting
§ Feature flagging
Our service stack:
32
Best Practices
Isolate state: relational DB is usually enough with org sharding
Isolate components that scale differently: allows separate scaling
Manage changes through feature flags: fastest, safest way
Watch key metrics: most outages could be predicted from one of
CPU load, memory load, DB load or thread pool exhaustion
Test pyramid: 70% unit tests, 20% integration, 10% end-to-end
33
Cluster manager v1 Cluster manager v2
Example: Cluster Manager
Cloud
VM API
Cluster
Manager
Customer Clusters
Cloud
VM API
CM Master
Customer Clusters
Delegate Delegate
Usage, billing, etc
VM launch, setup,
monitoring, etc
35
Four Lessons
1
2
3
What goes wrong in cloud services?
Testing for scalability & stability
Developing control planes
4 Evolving big data systems for the cloud
36
Evolving Big Data Systems for the Cloud
MapReduce, Spark, etc were designed for on-premise datacenters
How can we evolve these leverage the benefits of the cloud?
§ Availability, elasticity, scale, multitenancy, etc
Two examples from Databricks:
§ Delta Lake: ACID on cloud object stores
§ “Cloudifying” Apache Spark
37
Delta Lake Motivation
Cloud object stores (S3, Azure blob, etc) are the largest storage
systems on the planet
§ Unmatched availability, parallel I/O bandwidth, and cost-efficiency
Open source big data stack was designed for on-prem world
§ Filesystem API for storage
§ RDBMS for table metadata (Hive metastore)
§ Other distributed systems, e.g. ZooKeeper
How can big data systems fully leverage cloud object stores?
Stronger consistency model
Scale & management complexity
38
Spark on HDFS Spark on S3 (Naïve)
Example: Atomic Parallel Writes
Input Files Output Partitions
/tmp-job-1
/my-output
Atomic
Rename
Input Files Output Partitions
/my-output/part-1
/my-output/part-2
/my-output/part-3
/my-output/part-4
part-1
part-2
part-3
part-4
/_DONE
Full object names
(no cheap rename)
Real cases are harder (e.g. appending to a table)
39
Delta Lake Design
1. Track metadata that says which objects are part of a dataset
2. Store this metadata itself in a cloud object store
§ Write-ahead log in S3, compressed using Apache Parquet
Before Delta Lake: 50% of Spark support
issues were about cloud storage
After: fewer issues, increased perf
Input Files Output Partitions
/my-output/part-X
/my-output/part-Y
/my-output/part-Z
/my-output/part-W
/my-output/_delta_log
Commit
Protocol https://delta.io
10x faster metadata
ops than Hive on S3!
41
Major Benefits of Delta Lake
Once we had transactions over S3, we could build much more:
§ UPSERT, DELETE, etc (GDPR)
§ Time travel
§ Caching
§ Multidimensional indexing
§ Audit logging
§ Background optimization
0
0.2
0.4
0.6
0.8
1
Parquet
Parquet+
order
Delta
Z-order
Runningtime
Filter on 2 Fields
Result: greatly simplified
customers’ data architectures
Before and After Delta
Typical Architecture with Multiple Data Systems
ETL Job ETL Job
ETL Job
Delta Lake
Table 1
Delta Lake
Table 2
Delta Lake
Table 3
Streaming
Analytics
Data
Scientists
BI Users
Cloud Object Store
input
Cloud Object Store
ETL Job ETL Job
ETL Job
Message
Queue
Parquet
Table 1
Parquet
Table 2
Parquet
Table 3
Data
Warehouse
Data
Warehouse
Streaming
Analytics
Data
Scientists
BI Users
input
Using Delta Lake for All Storage
Fewer copies of the data, fewer ETL steps,
no divergence & faster results!
Delta Lake Adoption
Used by thousands of companies to process exabytes of data/day
Grew from zero to ~50% of the Databricks workload in 3 years
Largest deployments: exabyte tables and 1000s of users
45
Launched Today: SQL Analytics Product
New native SQL engine
running over Delta Lake
Queries cloud storage with
data warehouse performance
SQL security control model
and easy access from BI tools
0
10000
20000
30000
40000
DW1 DW2 DW3 Delta
Engine
TPC-DS 30TB Benchmark
Time (s)
46
“Cloudifying” Apache Spark
“High-concurrency” clusters with strong user isolation
§ Give users a “serverless” experience for ad hoc analytics queries
Scheduler-integrated cluster autoscaling
Autoscaling storage volumes
…
47
Conclusion
The cloud is eating software by enabling much better products
§ Self-managing, elastic, and highly available
But building reliable, large-scale cloud products is hard
Many ideas can help, from software engineering best practices to
new testing methods and system designs for large scale
48
We’re Hiring!
databricks.com/careers

More Related Content

What's hot

Migrating Oracle database to PostgreSQL
Migrating Oracle database to PostgreSQLMigrating Oracle database to PostgreSQL
Migrating Oracle database to PostgreSQLUmair Mansoob
 
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovationsre:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovationsGrant McAlister
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsGokhan Atil
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Sandesh Rao
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19cMaria Colgan
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
 
TFA Collector - what can one do with it
TFA Collector - what can one do with it TFA Collector - what can one do with it
TFA Collector - what can one do with it Sandesh Rao
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Azure Security Fundamentals
Azure Security FundamentalsAzure Security Fundamentals
Azure Security FundamentalsLorenzo Barbieri
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafkaconfluent
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Migrating from Oracle to Postgres
Migrating from Oracle to PostgresMigrating from Oracle to Postgres
Migrating from Oracle to PostgresEDB
 
Whats new in Autonomous Database in 2022
Whats new in Autonomous Database in 2022Whats new in Autonomous Database in 2022
Whats new in Autonomous Database in 2022Sandesh Rao
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuningSimon Huang
 
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...Amazon Web Services Korea
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningBobby Curtis
 
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...Amazon Web Services
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019Amit Banerjee
 

What's hot (20)

Migrating Oracle database to PostgreSQL
Migrating Oracle database to PostgreSQLMigrating Oracle database to PostgreSQL
Migrating Oracle database to PostgreSQL
 
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovationsre:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
 
(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19c
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
TFA Collector - what can one do with it
TFA Collector - what can one do with it TFA Collector - what can one do with it
TFA Collector - what can one do with it
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Azure Security Fundamentals
Azure Security FundamentalsAzure Security Fundamentals
Azure Security Fundamentals
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Migrating from Oracle to Postgres
Migrating from Oracle to PostgresMigrating from Oracle to Postgres
Migrating from Oracle to Postgres
 
Whats new in Autonomous Database in 2022
Whats new in Autonomous Database in 2022Whats new in Autonomous Database in 2022
Whats new in Autonomous Database in 2022
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuning
 
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
 
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
 

Similar to Scaling Databricks to Run Data & AI Workloads on Millions of VMs

Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksMatei Zaharia
 
Technology Overview
Technology OverviewTechnology Overview
Technology OverviewLiran Zelkha
 
Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09Rex Wang
 
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Manoj Kumar
 
Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalabilitymabuhr
 
Accelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateAccelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateMichael Elder
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Prolifics
 
An introduction to the cloud 11 v1
An introduction to the cloud 11 v1An introduction to the cloud 11 v1
An introduction to the cloud 11 v1charan7575
 
Cloud On-Ramp Project Briefing
Cloud On-Ramp Project BriefingCloud On-Ramp Project Briefing
Cloud On-Ramp Project BriefingRobert McDermott
 
Cloud Computing Networks
Cloud Computing NetworksCloud Computing Networks
Cloud Computing Networksjayapal385
 
Cloud Migration.pdf
Cloud Migration.pdfCloud Migration.pdf
Cloud Migration.pdfZen Bit Tech
 
mysql_pn_heatwave.pdf
mysql_pn_heatwave.pdfmysql_pn_heatwave.pdf
mysql_pn_heatwave.pdfRavishPatel19
 
A perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applicationsA perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applicationsGeorge Milliken
 
Availability Considerations for SQL Server
Availability Considerations for SQL ServerAvailability Considerations for SQL Server
Availability Considerations for SQL ServerBob Roudebush
 

Similar to Scaling Databricks to Run Data & AI Workloads on Millions of VMs (20)

Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Technology Overview
Technology OverviewTechnology Overview
Technology Overview
 
Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09Oracle Keynote Cloud Expo 11-04-09
Oracle Keynote Cloud Expo 11-04-09
 
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
Cloud Computing – Opportunities, Definitions, Options, and Risks (Part-1)
 
Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalability
 
Accelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateAccelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud Private
 
Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
An introduction to the cloud 11 v1
An introduction to the cloud 11 v1An introduction to the cloud 11 v1
An introduction to the cloud 11 v1
 
Jjm cloud computing
Jjm cloud computingJjm cloud computing
Jjm cloud computing
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
VAS - VMware CMP
VAS - VMware CMPVAS - VMware CMP
VAS - VMware CMP
 
Cloud On-Ramp Project Briefing
Cloud On-Ramp Project BriefingCloud On-Ramp Project Briefing
Cloud On-Ramp Project Briefing
 
Cloud Computing Networks
Cloud Computing NetworksCloud Computing Networks
Cloud Computing Networks
 
Cloud Migration.pdf
Cloud Migration.pdfCloud Migration.pdf
Cloud Migration.pdf
 
mysql_pn_heatwave.pdf
mysql_pn_heatwave.pdfmysql_pn_heatwave.pdf
mysql_pn_heatwave.pdf
 
A perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applicationsA perspective on cloud computing and enterprise saa s applications
A perspective on cloud computing and enterprise saa s applications
 
cloud computing models
cloud computing modelscloud computing models
cloud computing models
 
Availability Considerations for SQL Server
Availability Considerations for SQL ServerAvailability Considerations for SQL Server
Availability Considerations for SQL Server
 

Recently uploaded

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 

Recently uploaded (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

Scaling Databricks to Run Data & AI Workloads on Millions of VMs

  • 1. Scaling Databricks to Run Data & AI Workloads on Millions of VMs Matei Zaharia Cofounder and Chief Technologist @matei_zaharia
  • 2. 2 With Contributions From Jeff Pang Nanxi Kang Haogang Chen Michael Armbrust Ihor Leshko Ali Ghodsi
  • 3. 3 Outline About Databricks Challenges developing large-scale cloud services Four lessons from our experience
  • 4. 4 Outline About Databricks Challenges developing large-scale cloud services Four lessons from our experience
  • 5. 5 About Databricks Founded in 2013 by the Apache Spark group at UC Berkeley Cloud-based data and ML platform for >7000 customers § Millions of VMs run per day, processing exabytes of data/day § 100,000s of users 1500 employees, 300 engineers, >$350M ARR
  • 6. 6 Security policies Our Product Built around open source: Data science workspace SQL analytics Scheduled jobs Data scientists Data engineers Business users Cloud Storage Compute Clusters ML platform Databricks Runtime Data catalog Customer’s Cloud AccountDatabricks Services inside
  • 7. 7 Some of Our Customers Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & IndustrialMarketing & AdTech Data & Analytics Services Nike
  • 8. 8 Example Use Cases Correlate 500,000 patient records with DNA to design therapies Optimize inventory management using simulations and ML Identify securities fraud via machine learning on 30 PB of data
  • 10. 10 What’s Challenging About This? All the challenges of B2B cloud services § Availability, security, multitenancy, reliable updates, etc + All the challenges of large-scale data processing § Scalability, reliability, failures, etc
  • 11. 11 Traditional Software Cloud SoftwareVendorCustomers Dev Team Release 6-12 months Users Ops Users Ops Users Ops Users Ops Dev + Ops Team 1-2 weeks Users Ops Users Ops Users Ops Users Ops 6-12 months vs.
  • 12. 12 Benefits of Cloud Software for Users Management built-in: much more value than the software bits alone (security, high availability, etc) Elasticity: pay-as-you-go, scale on demand Better features released faster
  • 13. 13 Challenges in Building Cloud Software Building a multitenant service: significant scaling, security and performance isolation work that you don’t need on-prem Operating the service: security, availability, monitoring, etc (but customers would have to do it themselves on-prem) Upgrading without regressions: very hard, but critical for users to trust your cloud (on-prem apps don’t require this) Includes API, semantics, and performance regressions
  • 14. 14 Challenges of Large Scale Each customer app runs on up to thousands of VMs, so systems must tolerate failures, stragglers, etc among those Even one customer app could potentially overload our services Weird rare failure modes that are only seen at scale
  • 15. 15 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 16. 16 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 17. 17 Causes of Significant Outages Scaling problem in our services Scaling problem in underlying cloud services Insufficient user isolation Deployment misconfiguration Other 30% 20%20% 20% 10%
  • 18. 18 Causes of Significant Outages Scaling problem in our services Scaling problem in underlying cloud services Insufficient user isolation Deployment misconfiguration Other 30% 20%20% 20% 10% 70% scale related
  • 19. 19 Some Issues We Experienced Cloud networks: limits, partitions, slow DHCP, hung connections Automated apps creating large load Very large requests, results, etc Slow VM launches/shutdowns, lack of VM capacity Data corruption writing to cloud storage
  • 20. 20 Example Outage: Aborted Jobs Jobs Service launches & tracks jobs on clusters 1 customer running many jobs/sec on same cluster Cloud’s network reaches a limit of 1000 connections/VM between Jobs Service & clusters § After this limit, new connections hang in state SYN_SENT Resource usage from hanging connections causes memory pressure and GC Health checks to some jobs time out, so we abort them Jobs Service Cloud Network Customer Clusters
  • 21. 21 Surprisingly Rare Issues 1 cloud-wide VM restart on AWS (Xen patch) 1 misreported security scan on customer VM 1 significant S3 outage 2 Linux kernel bugs (TCP-related)
  • 22. 22 Lessons Cloud services must handle load that varies on many dimensions, and rely on other services with varying limits & failure modes Problems get worse in a “cloud service economy” where each service depends on others (maybe from other vendors)
  • 23. 23 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 24. 24 Testing for Scalability & Stability Software correctness is a Boolean property: does your software give the right output on a given input? Scalability and stability are a matter of degree § What load will your system fail at? (any system with limited resources will) § What failure behavior will you have? (crash all clients, drop some, etc)
  • 25. 25 Example Scalability Problems Large result: can crash browser, notebook service, driver or Spark Large record in file Large # of tasks Code that freezes a worker + All these affect other users! User Browser Notebook Service Driver App Workers Other Users ??
  • 26. 26 Databricks Stress Test Infrastructure 1. Identify dimensions for a system to scale in (e.g. # of users, number of output rows, size of each row, latency of an RPC, etc) 2. Grow load in each dimension until a failure occurs 3. Record failure type and impact on system § Error message, timeout, wrong result? § Are other clients affected? § Does the system auto-recover? How fast? 4. Compare over time and on changes
  • 28. 28 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 29. 29 Developing Control Planes Cloud software consists of interacting, independently updated services, many of which call other services What is the right programming model for this software? Billing Service Notebook Service Jobs Service Cluster Service …
  • 30. 30 Service Framework We designed a shared service framework that handles: § Cross-cloud deployment: AWS, Azure, local, special environments § Storage: databases, schema updates, etc § Security tokens & roles § Monitoring § API routing & limiting § Feature flagging Our service stack:
  • 31. 32 Best Practices Isolate state: relational DB is usually enough with org sharding Isolate components that scale differently: allows separate scaling Manage changes through feature flags: fastest, safest way Watch key metrics: most outages could be predicted from one of CPU load, memory load, DB load or thread pool exhaustion Test pyramid: 70% unit tests, 20% integration, 10% end-to-end
  • 32. 33 Cluster manager v1 Cluster manager v2 Example: Cluster Manager Cloud VM API Cluster Manager Customer Clusters Cloud VM API CM Master Customer Clusters Delegate Delegate Usage, billing, etc VM launch, setup, monitoring, etc
  • 33. 35 Four Lessons 1 2 3 What goes wrong in cloud services? Testing for scalability & stability Developing control planes 4 Evolving big data systems for the cloud
  • 34. 36 Evolving Big Data Systems for the Cloud MapReduce, Spark, etc were designed for on-premise datacenters How can we evolve these leverage the benefits of the cloud? § Availability, elasticity, scale, multitenancy, etc Two examples from Databricks: § Delta Lake: ACID on cloud object stores § “Cloudifying” Apache Spark
  • 35. 37 Delta Lake Motivation Cloud object stores (S3, Azure blob, etc) are the largest storage systems on the planet § Unmatched availability, parallel I/O bandwidth, and cost-efficiency Open source big data stack was designed for on-prem world § Filesystem API for storage § RDBMS for table metadata (Hive metastore) § Other distributed systems, e.g. ZooKeeper How can big data systems fully leverage cloud object stores? Stronger consistency model Scale & management complexity
  • 36. 38 Spark on HDFS Spark on S3 (Naïve) Example: Atomic Parallel Writes Input Files Output Partitions /tmp-job-1 /my-output Atomic Rename Input Files Output Partitions /my-output/part-1 /my-output/part-2 /my-output/part-3 /my-output/part-4 part-1 part-2 part-3 part-4 /_DONE Full object names (no cheap rename) Real cases are harder (e.g. appending to a table)
  • 37. 39 Delta Lake Design 1. Track metadata that says which objects are part of a dataset 2. Store this metadata itself in a cloud object store § Write-ahead log in S3, compressed using Apache Parquet Before Delta Lake: 50% of Spark support issues were about cloud storage After: fewer issues, increased perf Input Files Output Partitions /my-output/part-X /my-output/part-Y /my-output/part-Z /my-output/part-W /my-output/_delta_log Commit Protocol https://delta.io 10x faster metadata ops than Hive on S3!
  • 38. 41 Major Benefits of Delta Lake Once we had transactions over S3, we could build much more: § UPSERT, DELETE, etc (GDPR) § Time travel § Caching § Multidimensional indexing § Audit logging § Background optimization 0 0.2 0.4 0.6 0.8 1 Parquet Parquet+ order Delta Z-order Runningtime Filter on 2 Fields Result: greatly simplified customers’ data architectures
  • 39. Before and After Delta Typical Architecture with Multiple Data Systems ETL Job ETL Job ETL Job Delta Lake Table 1 Delta Lake Table 2 Delta Lake Table 3 Streaming Analytics Data Scientists BI Users Cloud Object Store input Cloud Object Store ETL Job ETL Job ETL Job Message Queue Parquet Table 1 Parquet Table 2 Parquet Table 3 Data Warehouse Data Warehouse Streaming Analytics Data Scientists BI Users input Using Delta Lake for All Storage Fewer copies of the data, fewer ETL steps, no divergence & faster results!
  • 40. Delta Lake Adoption Used by thousands of companies to process exabytes of data/day Grew from zero to ~50% of the Databricks workload in 3 years Largest deployments: exabyte tables and 1000s of users
  • 41. 45 Launched Today: SQL Analytics Product New native SQL engine running over Delta Lake Queries cloud storage with data warehouse performance SQL security control model and easy access from BI tools 0 10000 20000 30000 40000 DW1 DW2 DW3 Delta Engine TPC-DS 30TB Benchmark Time (s)
  • 42. 46 “Cloudifying” Apache Spark “High-concurrency” clusters with strong user isolation § Give users a “serverless” experience for ad hoc analytics queries Scheduler-integrated cluster autoscaling Autoscaling storage volumes …
  • 43. 47 Conclusion The cloud is eating software by enabling much better products § Self-managing, elastic, and highly available But building reliable, large-scale cloud products is hard Many ideas can help, from software engineering best practices to new testing methods and system designs for large scale