SlideShare a Scribd company logo
1 of 27
Download to read offline
Data Warehouse
Index
● OLAP vs OLTP
● What is data warehouse
● BigQuery
○ Cost
○ Partitions and Clustering
○ Best practices
○ Internals
○ ML in BQ
OLAP vs OLTP
OLTP OLAP
Purpose Control and run essential
business operations in real
time
Plan, solve problems,
support decisions,
discover hidden insights
Data updates Short, fast updates
initiated by user
Data periodically
refreshed with
scheduled, long-running
batch jobs
Database design Normalized databases for
efficiency
Denormalized
databases for analysis
Space requirements Generally small if historical
data is archived
Generally large due to
aggregating large
datasets
OLTP OLAP
Backup and recovery Regular backups required to ensure
business continuity and meet legal
and governance requirements
Lost data can be reloaded
from OLTP database as
needed in lieu of regular
backups
Productivity Increases productivity of end users Increases productivity of
business managers, data
analysts, and executives
Data view Lists day-to-day business
transactions
Multi-dimensional view of
enterprise data
User examples Customer-facing personnel, clerks,
online shoppers
Knowledge workers such as
data analysts, business
analysts, and executives
What is a data warehouse
● OLAP solution
● Used for reporting and data analysis
BigQuery
● Serverless data warehouse
○ There are no servers to manage or database software to install
● Software as well as infrastructure including
○ scalability and high-availability
● Built-in features like
○ machine learning
○ geospatial analysis
○ business intelligence
● BigQuery maximizes flexibility by separating the compute engine that
analyzes your data from your storage
BigQuery Cost
● On demand pricing
○ 1 TB of data processed is $5
● Flat rate pricing
○ Based on number of pre requested slots
○ 100 slots → $2,000/month = 400 TB data processed on demand pricing
Partition in BQ
Clustering in BigQuery
BigQuery partition
● Time-unit column
● Ingestion time (_PARTITIONTIME)
● Integer range partitioning
● When using Time unit or ingestion time
○ Daily (Default)
○ Hourly
○ Monthly or yearly
● Number of partitions limit is 4000
Resource: https://cloud.google.com/bigquery/docs/partitioned-tables
BigQuery Clustering
● Columns you specify are used to colocate related data
● Order of the column is important
● The order of the specified columns determines the sort order of the data.
● Clustering improves
○ Filter queries
○ Aggregate queries
● Table with data size < 1 GB, don’t show significant improvement with
partitioning and clustering
● You can specify up to four clustering columns
BigQuery Clustering
Clustering columns must be top-level, non-repeated columns
● DATE
● BOOL
● GEOGRAPHY
● INT64
● NUMERIC
● BIGNUMERIC
● STRING
● TIMESTAMP
● DATETIME
Partitioning vs Clustering
Clustering Partitoning
Cost benefit unknown Cost known upfront
You need more granularity than partitioning
alone allows
You need partition-level management.
Your queries commonly use filters or
aggregation against multiple particular
columns
Filter or aggregate on single column
The cardinality of the number of values in a
column or group of columns is large
Clustering over paritioning
● Partitioning results in a small amount of data per partition (approximately less
than 1 GB)
● Partitioning results in a large number of partitions beyond the limits on
partitioned tables
● Partitioning results in your mutation operations modifying the majority of
partitions in the table frequently (for example, every few minutes)
Automatic reclustering
As data is added to a clustered table
● the newly inserted data can be written to blocks that contain key ranges that
overlap with the key ranges in previously written blocks
● These overlapping keys weaken the sort property of the table
To maintain the performance characteristics of a clustered table
● BigQuery performs automatic re-clustering in the background to restore the
sort property of the table
● For partitioned tables, clustering is maintained for data within the scope of
each partition.
BigQuery-Best Practice
● Cost reduction
○ Avoid SELECT *
○ Price your queries before running them
○ Use clustered or partitioned tables
○ Use streaming inserts with caution
○ Materialize query results in stages
BigQuery-Best Practice
● Query performance
○ Filter on partitioned columns
○ Denormalizing data
○ Use nested or repeated columns
○ Use external data sources appropriately
○ Don't use it, in case u want a high query performance
○ Reduce data before using a JOIN
○ Do not treat WITH clauses as prepared statements
○ Avoid oversharding tables
BigQuery-Best Practice
● Query performance
○ Avoid JavaScript user-defined functions
○ Use approximate aggregation functions (HyperLogLog++)
○ Order Last, for query operations to maximize performance
○ Optimize your join patterns
■ As a best practice, place the table with the largest number of rows first, followed by the
table with the fewest rows, and then place the remaining tables by decreasing size.
Internals
Internals
Internals
Reference
https://cloud.google.com/bigquery/docs/how-to
https://research.google/pubs/pub36632/
https://panoply.io/data-warehouse-guide/bigquery-architecture/
http://www.goldsborough.me/distributed-systems/2019/05/18/21-09-00-a_look_at_dremel/
ML in BigQuery
● Target audience Data analysts, managers
● No need for Python or Java knowledge
● No need to export data into a different system
ML in BigQuery pricing
● Free
○ 10 GB per month of data storage
○ 1 TB per month of queries processed
○ ML Create model step: First 10 GB per month is free
ML in BigQuery pricing
ML in BigQuery
ML in BigQuery

More Related Content

Similar to Data Enginering from Google Data Warehouse

Geek Sync | Intro to Query Store
Geek Sync | Intro to Query StoreGeek Sync | Intro to Query Store
Geek Sync | Intro to Query StoreIDERA Software
 
Tuning data warehouse
Tuning data warehouseTuning data warehouse
Tuning data warehouseSrinivasan R
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB plc
 
Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Hiral Patel
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Ontico
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
 
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
Best Practices – Extreme Performance with Data Warehousing on Oracle DatabaseBest Practices – Extreme Performance with Data Warehousing on Oracle Database
Best Practices – Extreme Performance with Data Warehousing on Oracle DatabaseEdgar Alejandro Villegas
 
How to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxHow to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxSadeka Islam
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQLtomflemingh2
 
Importance of Data Structures
Importance of Data StructuresImportance of Data Structures
Importance of Data StructuresPradipta Poudel
 
Oracle 12 c new-features
Oracle 12 c new-featuresOracle 12 c new-features
Oracle 12 c new-featuresNavneet Upneja
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptxHarissh16
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query BasicsIdo Green
 
Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus Ashnikbiz
 

Similar to Data Enginering from Google Data Warehouse (20)

Geek Sync | Intro to Query Store
Geek Sync | Intro to Query StoreGeek Sync | Intro to Query Store
Geek Sync | Intro to Query Store
 
Tuning data warehouse
Tuning data warehouseTuning data warehouse
Tuning data warehouse
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and Optimization
 
Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
Best Practices – Extreme Performance with Data Warehousing on Oracle DatabaseBest Practices – Extreme Performance with Data Warehousing on Oracle Database
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
 
How to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxHow to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptx
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
CNCF opa
CNCF opaCNCF opa
CNCF opa
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
 
Importance of Data Structures
Importance of Data StructuresImportance of Data Structures
Importance of Data Structures
 
Oracle 12 c new-features
Oracle 12 c new-featuresOracle 12 c new-features
Oracle 12 c new-features
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
 
Druid
DruidDruid
Druid
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Data Enginering from Google Data Warehouse

  • 2. Index ● OLAP vs OLTP ● What is data warehouse ● BigQuery ○ Cost ○ Partitions and Clustering ○ Best practices ○ Internals ○ ML in BQ
  • 3. OLAP vs OLTP OLTP OLAP Purpose Control and run essential business operations in real time Plan, solve problems, support decisions, discover hidden insights Data updates Short, fast updates initiated by user Data periodically refreshed with scheduled, long-running batch jobs Database design Normalized databases for efficiency Denormalized databases for analysis Space requirements Generally small if historical data is archived Generally large due to aggregating large datasets
  • 4. OLTP OLAP Backup and recovery Regular backups required to ensure business continuity and meet legal and governance requirements Lost data can be reloaded from OLTP database as needed in lieu of regular backups Productivity Increases productivity of end users Increases productivity of business managers, data analysts, and executives Data view Lists day-to-day business transactions Multi-dimensional view of enterprise data User examples Customer-facing personnel, clerks, online shoppers Knowledge workers such as data analysts, business analysts, and executives
  • 5. What is a data warehouse ● OLAP solution ● Used for reporting and data analysis
  • 6. BigQuery ● Serverless data warehouse ○ There are no servers to manage or database software to install ● Software as well as infrastructure including ○ scalability and high-availability ● Built-in features like ○ machine learning ○ geospatial analysis ○ business intelligence ● BigQuery maximizes flexibility by separating the compute engine that analyzes your data from your storage
  • 7. BigQuery Cost ● On demand pricing ○ 1 TB of data processed is $5 ● Flat rate pricing ○ Based on number of pre requested slots ○ 100 slots → $2,000/month = 400 TB data processed on demand pricing
  • 10. BigQuery partition ● Time-unit column ● Ingestion time (_PARTITIONTIME) ● Integer range partitioning ● When using Time unit or ingestion time ○ Daily (Default) ○ Hourly ○ Monthly or yearly ● Number of partitions limit is 4000 Resource: https://cloud.google.com/bigquery/docs/partitioned-tables
  • 11. BigQuery Clustering ● Columns you specify are used to colocate related data ● Order of the column is important ● The order of the specified columns determines the sort order of the data. ● Clustering improves ○ Filter queries ○ Aggregate queries ● Table with data size < 1 GB, don’t show significant improvement with partitioning and clustering ● You can specify up to four clustering columns
  • 12. BigQuery Clustering Clustering columns must be top-level, non-repeated columns ● DATE ● BOOL ● GEOGRAPHY ● INT64 ● NUMERIC ● BIGNUMERIC ● STRING ● TIMESTAMP ● DATETIME
  • 13. Partitioning vs Clustering Clustering Partitoning Cost benefit unknown Cost known upfront You need more granularity than partitioning alone allows You need partition-level management. Your queries commonly use filters or aggregation against multiple particular columns Filter or aggregate on single column The cardinality of the number of values in a column or group of columns is large
  • 14. Clustering over paritioning ● Partitioning results in a small amount of data per partition (approximately less than 1 GB) ● Partitioning results in a large number of partitions beyond the limits on partitioned tables ● Partitioning results in your mutation operations modifying the majority of partitions in the table frequently (for example, every few minutes)
  • 15. Automatic reclustering As data is added to a clustered table ● the newly inserted data can be written to blocks that contain key ranges that overlap with the key ranges in previously written blocks ● These overlapping keys weaken the sort property of the table To maintain the performance characteristics of a clustered table ● BigQuery performs automatic re-clustering in the background to restore the sort property of the table ● For partitioned tables, clustering is maintained for data within the scope of each partition.
  • 16. BigQuery-Best Practice ● Cost reduction ○ Avoid SELECT * ○ Price your queries before running them ○ Use clustered or partitioned tables ○ Use streaming inserts with caution ○ Materialize query results in stages
  • 17. BigQuery-Best Practice ● Query performance ○ Filter on partitioned columns ○ Denormalizing data ○ Use nested or repeated columns ○ Use external data sources appropriately ○ Don't use it, in case u want a high query performance ○ Reduce data before using a JOIN ○ Do not treat WITH clauses as prepared statements ○ Avoid oversharding tables
  • 18. BigQuery-Best Practice ● Query performance ○ Avoid JavaScript user-defined functions ○ Use approximate aggregation functions (HyperLogLog++) ○ Order Last, for query operations to maximize performance ○ Optimize your join patterns ■ As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.
  • 23. ML in BigQuery ● Target audience Data analysts, managers ● No need for Python or Java knowledge ● No need to export data into a different system
  • 24. ML in BigQuery pricing ● Free ○ 10 GB per month of data storage ○ 1 TB per month of queries processed ○ ML Create model step: First 10 GB per month is free
  • 25. ML in BigQuery pricing