SlideShare a Scribd company logo
1 of 28
How to Design a Modern Data
Warehouse in BigQuery
...or why I needed to forget everything I learned in data
modeling school
Dan Sullivan
March 11, 2021
Architecture
Ahead
Datastore Options
➤ Relational
➢ Highly structured and transactional
➢ Difficult to scale
➤ NoSQL
➢ Semi-structured, eventual consistency, scalable
➤ Analytical
➢ Structured, scalable, not transactional
Data Warehouse (early 2000s)
➤ Few servers
➤ Tightly coupled storage and
compute
➤ Scale vertically
➤ Built on same relational database
management systems used for
OLTP
BigQuery
➤ Serverless data warehouse
➤ Petabyte scale
➤ Uses SQL but is not a relational database
➤ Analytical database
➤ Other features
➢ BigQuery ML
➢ BigQuery BI Engine
➢ BigQuery GIS
So What’s Different
about BigQuery?
Source: https://cloud.google.com/blog/products/data-analytics/cloud-data-warehouse-bigquery-4-9s-sla
Dremel
➤ Multi-tenant cluster
➤ SQL queries to execution trees
➢ Leaves are called slots; read data and perform computation
➢ Inner nodes perform aggregation
➤ Dynamically allocate slots to queries
➤ Maintains fairness
➤ Single user cloud get 1,000s of slots
Source: https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview
Colossus
➤ Distributed storage system
➤ Handles replication and recovery
➤ No need to managed storage
https://en.wikipedia.org/wiki/Google_File_System#/media/File:GoogleFileSystemGFS.svg
Jupiter & Borg
➤ Jupiter
➢ Google networking switch
➢ Petibit scale
➢ Storage to compute communication
➢ No need for rack awareness
➤ Borg
➢ Predecessor of Kubernetes
➢ Manages mixers and slots
https://medium.com/@jerub/the-production-environment-at-google-
8a1aaece3767
https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf
Capacitor
➤ Columnar storage format
➤ Supports semi-structured data
➢ Nested structures
➢ Repeated fields
➤ No need to read parent column to produce a
nested structure attribute value
➤ Compression
What Does this Mean
for Data Modeling?
If you remember anything
from this talk ...
➤ Design for scanning in parallel
➤ Partition to minimize amount of data scanned
➤ Cluster to further reduce the amount of data scanned
➤ Joins may require shuffling data across slots so ...
➤ Denormalize using nested and repeated fields
Partitioning
Partitioned Tables
➤ Table is divided into segments called partitions
➤ Improves query performance
➤ Lowers cost by reducing amount of data scanned
Partition by Ingestion Time
➤ Loads data into daily, date-based partitions
➤ Automatically creates new partitions
➤ Uses ingestion time to determine partition
➤ Create pseudo-column _PARTITIONTIME
➢ Date-based timestamp
➢ Used in queries to limit the number of partitions scanned
Date/Timestamp Partitioning
➤ Partition based on date or timestamp column
➤ Each partition holds one day of data
➤ No need for _PARTITIONTIME
➤ Special partitions
➢ _NULL_ when nulls in partition column
➢ _UNPARTITION_ when values in column outside allowed range
Integer Range Partition
➤ Partition column must be an integer type
➤ Partition column cannot be repeated
➤ Cannot use Legacy SQL to query partitioned tables
Sharding vs. Partitioning
➤ Sharding
➢ Use separate table for each day
➢ [TABLE_NAME_PREFIX]_YYMMDD
➢ Use UNION in queries to scan multiple tables
➤ Partitioning is preferred over sharding
➢ Less metadata to maintain
➢ Less permission checking overhead
➢ Better performance
Requiring Partition Filter
➤ Require_partitioning_filter parameter
➤ Specified at table level (formerly at partition level)
➤ Requires a WHERE clause with the partition column
Clustered Tables
Clustered Tables
➤ Data sorted based on values in one or more columns
➤ Can improve performance of aggregate queries
➤ Can reduce scanning when cluster columns used in WHERE clause
➤ Used with partitioned tables
Automatic Reclustering
➤ As new data is added to a table, data
may be stored out of order
➤ BigQuery automatically re-clusters in the
background
Nested and Repeated
Fields
Nested and Repeated Fields
Nested and Repeated Fields
One more time … if you remember
anything from this talk ...
➤ Design for scanning in parallel
➤ Partition to minimize amount of data scanned
➤ Cluster to further reduce the amount of data scanned
➤ Joins may require shuffling data across slots so ...
➤ Denormalize using nested and repeated fields to avoid needing joins

More Related Content

What's hot

Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
Willy Lulciuc
 

What's hot (20)

Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
Bigquery 101
Bigquery 101Bigquery 101
Bigquery 101
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
SQL vs. NoSQL Databases
SQL vs. NoSQL DatabasesSQL vs. NoSQL Databases
SQL vs. NoSQL Databases
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Types of no sql databases
Types of no sql databasesTypes of no sql databases
Types of no sql databases
 
Introduction to D3.js
Introduction to D3.jsIntroduction to D3.js
Introduction to D3.js
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 

Similar to How to Design a Modern Data Warehouse in BigQuery

ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-Features
Navneet Upneja
 

Similar to How to Design a Modern Data Warehouse in BigQuery (20)

My Database Skills Killed the Server
My Database Skills Killed the ServerMy Database Skills Killed the Server
My Database Skills Killed the Server
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
 
NDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data LakeNDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data Lake
 
ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-Features
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Mysql For Developers
Mysql For DevelopersMysql For Developers
Mysql For Developers
 
My SQL Skills Killed the Server
My SQL Skills Killed the ServerMy SQL Skills Killed the Server
My SQL Skills Killed the Server
 
Sql killedserver
Sql killedserverSql killedserver
Sql killedserver
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best Practices
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Google - Bigtable
Google - BigtableGoogle - Bigtable
Google - Bigtable
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Kill mysql-performance
Kill mysql-performanceKill mysql-performance
Kill mysql-performance
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
 
The Peoper Care and Feeding of a MySQL Server for Busy Linux Admin
The Peoper Care and Feeding of a MySQL Server for Busy Linux AdminThe Peoper Care and Feeding of a MySQL Server for Busy Linux Admin
The Peoper Care and Feeding of a MySQL Server for Busy Linux Admin
 
Tech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesTech-Spark: Scaling Databases
Tech-Spark: Scaling Databases
 
Data Enginering from Google Data Warehouse
Data Enginering from Google Data WarehouseData Enginering from Google Data Warehouse
Data Enginering from Google Data Warehouse
 
Where do I put this data? #lessql
Where do I put this data? #lessqlWhere do I put this data? #lessql
Where do I put this data? #lessql
 

More from Dan Sullivan, Ph.D.

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Dan Sullivan, Ph.D.
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Dan Sullivan, Ph.D.
 

More from Dan Sullivan, Ph.D. (13)

With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 

Recently uploaded

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Recently uploaded (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

How to Design a Modern Data Warehouse in BigQuery

  • 1. How to Design a Modern Data Warehouse in BigQuery ...or why I needed to forget everything I learned in data modeling school Dan Sullivan March 11, 2021
  • 3. Datastore Options ➤ Relational ➢ Highly structured and transactional ➢ Difficult to scale ➤ NoSQL ➢ Semi-structured, eventual consistency, scalable ➤ Analytical ➢ Structured, scalable, not transactional
  • 4. Data Warehouse (early 2000s) ➤ Few servers ➤ Tightly coupled storage and compute ➤ Scale vertically ➤ Built on same relational database management systems used for OLTP
  • 5. BigQuery ➤ Serverless data warehouse ➤ Petabyte scale ➤ Uses SQL but is not a relational database ➤ Analytical database ➤ Other features ➢ BigQuery ML ➢ BigQuery BI Engine ➢ BigQuery GIS
  • 8. Dremel ➤ Multi-tenant cluster ➤ SQL queries to execution trees ➢ Leaves are called slots; read data and perform computation ➢ Inner nodes perform aggregation ➤ Dynamically allocate slots to queries ➤ Maintains fairness ➤ Single user cloud get 1,000s of slots
  • 10. Colossus ➤ Distributed storage system ➤ Handles replication and recovery ➤ No need to managed storage https://en.wikipedia.org/wiki/Google_File_System#/media/File:GoogleFileSystemGFS.svg
  • 11. Jupiter & Borg ➤ Jupiter ➢ Google networking switch ➢ Petibit scale ➢ Storage to compute communication ➢ No need for rack awareness ➤ Borg ➢ Predecessor of Kubernetes ➢ Manages mixers and slots https://medium.com/@jerub/the-production-environment-at-google- 8a1aaece3767 https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf
  • 12. Capacitor ➤ Columnar storage format ➤ Supports semi-structured data ➢ Nested structures ➢ Repeated fields ➤ No need to read parent column to produce a nested structure attribute value ➤ Compression
  • 13. What Does this Mean for Data Modeling?
  • 14. If you remember anything from this talk ... ➤ Design for scanning in parallel ➤ Partition to minimize amount of data scanned ➤ Cluster to further reduce the amount of data scanned ➤ Joins may require shuffling data across slots so ... ➤ Denormalize using nested and repeated fields
  • 16. Partitioned Tables ➤ Table is divided into segments called partitions ➤ Improves query performance ➤ Lowers cost by reducing amount of data scanned
  • 17. Partition by Ingestion Time ➤ Loads data into daily, date-based partitions ➤ Automatically creates new partitions ➤ Uses ingestion time to determine partition ➤ Create pseudo-column _PARTITIONTIME ➢ Date-based timestamp ➢ Used in queries to limit the number of partitions scanned
  • 18. Date/Timestamp Partitioning ➤ Partition based on date or timestamp column ➤ Each partition holds one day of data ➤ No need for _PARTITIONTIME ➤ Special partitions ➢ _NULL_ when nulls in partition column ➢ _UNPARTITION_ when values in column outside allowed range
  • 19. Integer Range Partition ➤ Partition column must be an integer type ➤ Partition column cannot be repeated ➤ Cannot use Legacy SQL to query partitioned tables
  • 20. Sharding vs. Partitioning ➤ Sharding ➢ Use separate table for each day ➢ [TABLE_NAME_PREFIX]_YYMMDD ➢ Use UNION in queries to scan multiple tables ➤ Partitioning is preferred over sharding ➢ Less metadata to maintain ➢ Less permission checking overhead ➢ Better performance
  • 21. Requiring Partition Filter ➤ Require_partitioning_filter parameter ➤ Specified at table level (formerly at partition level) ➤ Requires a WHERE clause with the partition column
  • 23. Clustered Tables ➤ Data sorted based on values in one or more columns ➤ Can improve performance of aggregate queries ➤ Can reduce scanning when cluster columns used in WHERE clause ➤ Used with partitioned tables
  • 24. Automatic Reclustering ➤ As new data is added to a table, data may be stored out of order ➤ BigQuery automatically re-clusters in the background
  • 28. One more time … if you remember anything from this talk ... ➤ Design for scanning in parallel ➤ Partition to minimize amount of data scanned ➤ Cluster to further reduce the amount of data scanned ➤ Joins may require shuffling data across slots so ... ➤ Denormalize using nested and repeated fields to avoid needing joins