SlideShare a Scribd company logo
1 of 28
DATA WAREHOUSE
BUILT FOR THE CLOUD
QCON San Francisco, November 2019
Thierry Cruanes, Co-Founder & CTO
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
snowflake-architecture/
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
THE DREAM DATA WAREHOUSE
(CIRCA2012)
No management
tasks, offered as a
service
Fast out-of-box with
no tuning knobs
Structured and
semi-structured
Petabyte scale
at very low cost
Full support for ACID
transactions with
read consistency
ANSI SQL, RBAC
No data silos
10x faster for the
same price, no
over provisioning
Extreme
simplicity
Store all
your data
No compromises
full fledge Data
Warehouse
Unlimited and
Instant Scaling
WHY THEN?
OURVIEWOFTHECLOUD…
20x
§ Storage became dirt cheap
§ Flat network offered uniform
bandwidth
§ Single core performance
stalled
§ Data warehouse and analytic
workload are mostly CPU
bound
Design for
abundance
and not
scarcity of
resources
THREE PILLARS
20x
Multi-Tenant
Service
Multi-cluster shared
data Architecture
Immutable Scalable
Storage
Leverage cloud elasticity
and pay only what you use
Instant scale
Performance isolation
Real-time Data sharing
Extremely fast response
time at scale
Fine grain vertical and
horizontal pruning on any
column
Automatically applied to any
data (structured and semi-
structured)
Self-tuning, self-healing
Transparent upgrade
Service architecture
designed for availability,
durability and security
ARCHITECTURE
AN ARCHITECTURE BUILT FOR THE CLOUD
Traditional Architectures
Shared storage
Single cluster
Shared-disk
Decentralized, local storage
Single cluster
Shared-nothing Multi-cluster, shared data
Centralized, scale-out storage
Multiple, independent compute clusters
MULTI-CLUSTER, SHARED DATAARCHITECTURE
No data silos
Storage decoupled from compute
Any data
Native for structured & semi-
structured
Unlimited scalability
Along many dimensions
Low cost
Compute on demand
Instantly cloning
Isolate prod from dev & qa
Highly available
11 9’s durability, 4 9’s availability
Databases
Clone
Data Science
Virtual
Warehouse
ETL & Data
Loading
Virtual
Warehouse
Finance
Virtual
Warehouse
Dev, Test,
QA
Virtual
Warehouse
Dashboards
Virtual
Warehouse
Marketing
Virtual
Warehouse
Virtual
Warehouse
VIRTUAL WAREHOUSE
How to allow concurrent workloads run without impacting each other?
One or more MPP compute cluster
Unit of fault and performance isolation
Use multiple warehouses to segregate
workload
Resizable on the fly
Able to access data in any database
Transparently caches data accessed
Transaction manager synchronizes
data access
Automatic suspend when idle and
resume when needed
SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache
Virtual
warehouse A
Virtual
warehouse B
Virtual
warehouse C
Virtual
warehouse D
ETL Transformation SQL BI
MULTI-CLUSTER WAREHOUSE
LEVERAGEABUNDANCEOFCOMPUTERESOURCES
Automatically scales compute
resources based on concurrent usage
Single virtual warehouse of multiple
compute clusters
Queries are load balanced across the
clusters in a virtual warehouse
Split across availability zones for high
availability
Cluster 1 Cluster 2 Cluster 3
Virtual
Warehouse
Group
Query
Query
Query
Query scheduler
IN THE REAL-WORLD
Continuous
Loading (4TB/day)
S3
<5min SLA
Virtual Warehouse
Medium
ETL &
Maintenance
Virtual Warehouse
Large
Virtual Warehouse
2X-Large
Reporting
(Segmented)
Interactive
Dashboard
50% < 1s
85% < 2s
95% < 5s
Virtual Warehouse
Auto Scale – X-Large x 5
4 trillion rows
3+ petabyte raw
8x compression ratio
25M+ micro-partitions
Prod DB
SCALABLE IMMUTABLE
STORAGE
STORAGE IMMUTABILITY
Accumulates immutable data over time
Well supported by all cloud vendor object stores
Allow separation of storage and compute resources
Enable workload scalability
Heavily optimized for read mostly workload
Natural fit for analytic systems
Transaction management becomes a metadata problem
Multi-version concurrency control and Snapshot isolation semantic
Transaction coordination separated from storage and compute
Allow for consistent access across compute resources
SCALABLE STORAGE
AUTOMATICMICRO-PARTITIONING
Columnar
Partitions
Data is automatically partitioned at load time
Storage decoupled from compute
Columnar organization in each micro-partition
Enable both horizontal vertical pruning
Micro partition – only few 10MBs
Fine grain pruning, no skew
Metadata structure tracks data distribution
Very fast pruning at optimization time
Applied to both structured and semi-structured data
Very fast response time for both
AUTOMATICALLYAPPLIED TO SEMI-STRUCTURED DATA
> SELECT … FROM
…
Semi-structured data
(JSON, Avro, XML, Parquet, ORC)
Structured data
(e.g., CSV, TSV, …)
Optimized storage
Optimized data type, no fixed schema or
transformation required
Optimized SQL
querying
Full benefit of database optimizations
(pruning, filtering, …)
Native support
Loaded in raw form (e.g. JSON, Avro, XML)
EXAMPLE
Compute
Client Application
Web UIODBC Driver JDBC Driver
Storage
S3
Sale
s
MarketingData
Cloud
Services
Security
Optimization
Warehouse
Mgmt
Query
Mgmt
MetadataMetadataMetadata
DDL
Custom Reports
Node Node Node Node
Node Node Node Node
Node Node Node Node
Node Node Node Node
Custom
Reports
XL
Node Node Node Node
Node Node Node Node
Campaign Analysis
Campaign
Analysts
L
Storage 1
9
H
2
A
I
3
B
J
4
C
K
5
D
L
6
E
M
7
F
N
8
G
O
1 3 6 8
2 B D F
1B6F
38JG
K7O4
H2CL
P
T
Q
U
R
V
S
W
Node Node Node Node
Node Node Node Node
Loading
WH
L
Loading WH
P Q R S
T U V W
P
Q
R
S
T
U
VW
HTTPS (JDBC/ODBC/Python)
ENABLE DATA SHARING
Data Consumers
Secure and integrated
Snowflake’s access control
model
Only pay normal storage costs
for shared data
No limit to the number of
consumer accounts with which
a dataset may be shared
Providers
Get access to the data
without any need to move
or transform it.
Query and combine shared
data with existing data or
join together data from
multiple publishers
Consumers
Data Providers
ENABLE GLOBAL REPLICATION
AWS
(US West)
Azure
(US East)
AWS
(Ireland)
AWS
(Sydney)
AWS
Azure
AWS
(US East)
Azure
(Frankfurt)
AWS
(Frankfurt)
MULTI-TENANT SERVICE
DATA WAREHOUSE AS A SERVICE
DurabilityMulti-Tenant Service
No administration, self-tuning
and healing,
Transparent upgrade
Service architecture designed
for high availability and
durability
Security is at the core
Availability
All tier distributed over multiple
datacenters with active-active
data replication
No maintenance downtime,
fully transparent software &
hardware upgrade
Automatic repair of any failed
servers with transparent re-
execution of any failed queries
Persistent session for load-
balancing and transparent
fail-over
Synchronous replication of
data over multiple data centers
Automatic data retention and
fail safe technology to guard
against any data removal
SNOWFLAKE SERVICE
Three independent layers
Cache Cache Cache Cache
Authentication & Access Control
Infrastructure
manager Optimizer
Transaction
manager Security
Metadata
Cloud services
Compilation and Management
Data processing
Virtual warehouses
Storage
Databases
MANAGED SERVICE
BUILT-INDISASTERRECOVERYANDHIGHAVAILABILITY
Scale-out of all tiers
metadata, compute, storage
Resiliency across multiple
availability zones
geographic separation
separate power grids
built for synchronous replication
Fully online updates & patches
zero downtime
Back pressure and throttling
all the way back to the client
Cloud
services
Virtual
warehouses
Database
storage
Services
Metadata
ADAPTIVE ALL THE WAY TO THE CORE
SELFTUNING&SELFHEALINGINTERNALS
Adaptive
Self-tuning
Do no harm!
Automatic
Default
Automatic
Memory
Management
Automatic
Workload
Management
Automatic
Distribution
Method
Automatic
Degree of
Parallelism
Automatic
Fault
Handling No Statistics
No Vacuuming
EXAMPLE: AUTOMATIC SKEW AVOIDANCE
Detect popular values on the build side of the join
popular values detected at runtime
number of values
no performance degradation
kicks in when needed
enabled by default for all joins
Adaptive
Self-tuning
Do no harm!
Automatic
Default
Execution Plan
scan
join
scan
filter
1
2 Use broadcast for those and directed join for the others
21
WHAT’S NEXT?
SERVERLESS DATASERVICES
Target predictable well-identified database workloads
Horizontal scaling is automatic
Fine grain unit of work allow for degree of parallelism to be arbitrarily small or large
Secure since handled by the service
Transparent retry on failures
Service state entirely managed by the service
Monitoring and observability of the service
CLOUD NATIVE
ARCHITECTURE
A GIFT THAT KEEPS ON GIVING
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
snowflake-architecture/

More Related Content

More from C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 

Recently uploaded (20)

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 

Snowflake Architecture: Building a Data Warehouse for the Cloud

  • 1. DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November 2019 Thierry Cruanes, Co-Founder & CTO
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ snowflake-architecture/
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. THE DREAM DATA WAREHOUSE (CIRCA2012) No management tasks, offered as a service Fast out-of-box with no tuning knobs Structured and semi-structured Petabyte scale at very low cost Full support for ACID transactions with read consistency ANSI SQL, RBAC No data silos 10x faster for the same price, no over provisioning Extreme simplicity Store all your data No compromises full fledge Data Warehouse Unlimited and Instant Scaling
  • 5. WHY THEN? OURVIEWOFTHECLOUD… 20x § Storage became dirt cheap § Flat network offered uniform bandwidth § Single core performance stalled § Data warehouse and analytic workload are mostly CPU bound Design for abundance and not scarcity of resources
  • 6. THREE PILLARS 20x Multi-Tenant Service Multi-cluster shared data Architecture Immutable Scalable Storage Leverage cloud elasticity and pay only what you use Instant scale Performance isolation Real-time Data sharing Extremely fast response time at scale Fine grain vertical and horizontal pruning on any column Automatically applied to any data (structured and semi- structured) Self-tuning, self-healing Transparent upgrade Service architecture designed for availability, durability and security
  • 8. AN ARCHITECTURE BUILT FOR THE CLOUD Traditional Architectures Shared storage Single cluster Shared-disk Decentralized, local storage Single cluster Shared-nothing Multi-cluster, shared data Centralized, scale-out storage Multiple, independent compute clusters
  • 9. MULTI-CLUSTER, SHARED DATAARCHITECTURE No data silos Storage decoupled from compute Any data Native for structured & semi- structured Unlimited scalability Along many dimensions Low cost Compute on demand Instantly cloning Isolate prod from dev & qa Highly available 11 9’s durability, 4 9’s availability Databases Clone Data Science Virtual Warehouse ETL & Data Loading Virtual Warehouse Finance Virtual Warehouse Dev, Test, QA Virtual Warehouse Dashboards Virtual Warehouse Marketing Virtual Warehouse Virtual Warehouse
  • 10. VIRTUAL WAREHOUSE How to allow concurrent workloads run without impacting each other? One or more MPP compute cluster Unit of fault and performance isolation Use multiple warehouses to segregate workload Resizable on the fly Able to access data in any database Transparently caches data accessed Transaction manager synchronizes data access Automatic suspend when idle and resume when needed SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache Virtual warehouse A Virtual warehouse B Virtual warehouse C Virtual warehouse D ETL Transformation SQL BI
  • 11. MULTI-CLUSTER WAREHOUSE LEVERAGEABUNDANCEOFCOMPUTERESOURCES Automatically scales compute resources based on concurrent usage Single virtual warehouse of multiple compute clusters Queries are load balanced across the clusters in a virtual warehouse Split across availability zones for high availability Cluster 1 Cluster 2 Cluster 3 Virtual Warehouse Group Query Query Query Query scheduler
  • 12. IN THE REAL-WORLD Continuous Loading (4TB/day) S3 <5min SLA Virtual Warehouse Medium ETL & Maintenance Virtual Warehouse Large Virtual Warehouse 2X-Large Reporting (Segmented) Interactive Dashboard 50% < 1s 85% < 2s 95% < 5s Virtual Warehouse Auto Scale – X-Large x 5 4 trillion rows 3+ petabyte raw 8x compression ratio 25M+ micro-partitions Prod DB
  • 14. STORAGE IMMUTABILITY Accumulates immutable data over time Well supported by all cloud vendor object stores Allow separation of storage and compute resources Enable workload scalability Heavily optimized for read mostly workload Natural fit for analytic systems Transaction management becomes a metadata problem Multi-version concurrency control and Snapshot isolation semantic Transaction coordination separated from storage and compute Allow for consistent access across compute resources
  • 15. SCALABLE STORAGE AUTOMATICMICRO-PARTITIONING Columnar Partitions Data is automatically partitioned at load time Storage decoupled from compute Columnar organization in each micro-partition Enable both horizontal vertical pruning Micro partition – only few 10MBs Fine grain pruning, no skew Metadata structure tracks data distribution Very fast pruning at optimization time Applied to both structured and semi-structured data Very fast response time for both
  • 16. AUTOMATICALLYAPPLIED TO SEMI-STRUCTURED DATA > SELECT … FROM … Semi-structured data (JSON, Avro, XML, Parquet, ORC) Structured data (e.g., CSV, TSV, …) Optimized storage Optimized data type, no fixed schema or transformation required Optimized SQL querying Full benefit of database optimizations (pruning, filtering, …) Native support Loaded in raw form (e.g. JSON, Avro, XML)
  • 17. EXAMPLE Compute Client Application Web UIODBC Driver JDBC Driver Storage S3 Sale s MarketingData Cloud Services Security Optimization Warehouse Mgmt Query Mgmt MetadataMetadataMetadata DDL Custom Reports Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Custom Reports XL Node Node Node Node Node Node Node Node Campaign Analysis Campaign Analysts L Storage 1 9 H 2 A I 3 B J 4 C K 5 D L 6 E M 7 F N 8 G O 1 3 6 8 2 B D F 1B6F 38JG K7O4 H2CL P T Q U R V S W Node Node Node Node Node Node Node Node Loading WH L Loading WH P Q R S T U V W P Q R S T U VW HTTPS (JDBC/ODBC/Python)
  • 18. ENABLE DATA SHARING Data Consumers Secure and integrated Snowflake’s access control model Only pay normal storage costs for shared data No limit to the number of consumer accounts with which a dataset may be shared Providers Get access to the data without any need to move or transform it. Query and combine shared data with existing data or join together data from multiple publishers Consumers Data Providers
  • 19. ENABLE GLOBAL REPLICATION AWS (US West) Azure (US East) AWS (Ireland) AWS (Sydney) AWS Azure AWS (US East) Azure (Frankfurt) AWS (Frankfurt)
  • 21. DATA WAREHOUSE AS A SERVICE DurabilityMulti-Tenant Service No administration, self-tuning and healing, Transparent upgrade Service architecture designed for high availability and durability Security is at the core Availability All tier distributed over multiple datacenters with active-active data replication No maintenance downtime, fully transparent software & hardware upgrade Automatic repair of any failed servers with transparent re- execution of any failed queries Persistent session for load- balancing and transparent fail-over Synchronous replication of data over multiple data centers Automatic data retention and fail safe technology to guard against any data removal
  • 22. SNOWFLAKE SERVICE Three independent layers Cache Cache Cache Cache Authentication & Access Control Infrastructure manager Optimizer Transaction manager Security Metadata Cloud services Compilation and Management Data processing Virtual warehouses Storage Databases
  • 23. MANAGED SERVICE BUILT-INDISASTERRECOVERYANDHIGHAVAILABILITY Scale-out of all tiers metadata, compute, storage Resiliency across multiple availability zones geographic separation separate power grids built for synchronous replication Fully online updates & patches zero downtime Back pressure and throttling all the way back to the client Cloud services Virtual warehouses Database storage Services Metadata
  • 24. ADAPTIVE ALL THE WAY TO THE CORE SELFTUNING&SELFHEALINGINTERNALS Adaptive Self-tuning Do no harm! Automatic Default Automatic Memory Management Automatic Workload Management Automatic Distribution Method Automatic Degree of Parallelism Automatic Fault Handling No Statistics No Vacuuming
  • 25. EXAMPLE: AUTOMATIC SKEW AVOIDANCE Detect popular values on the build side of the join popular values detected at runtime number of values no performance degradation kicks in when needed enabled by default for all joins Adaptive Self-tuning Do no harm! Automatic Default Execution Plan scan join scan filter 1 2 Use broadcast for those and directed join for the others 21
  • 26. WHAT’S NEXT? SERVERLESS DATASERVICES Target predictable well-identified database workloads Horizontal scaling is automatic Fine grain unit of work allow for degree of parallelism to be arbitrarily small or large Secure since handled by the service Transparent retry on failures Service state entirely managed by the service Monitoring and observability of the service
  • 27. CLOUD NATIVE ARCHITECTURE A GIFT THAT KEEPS ON GIVING
  • 28. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ snowflake-architecture/