2. 2
We unlock the power of data to reimagine retail
Contents
Clustering 03
When to apply clustering 04
How to apply clustering 05
Scenario 1 07
Scenario 2 10
Limitations 17
Click to add footnotes
3. 3
We unlock the power of data to reimagine retail
● Purpose: Limits the amount of data that needs to be read while executing a query.
● Benefits
○ Improved query performance when filtering (WHERE or JOIN) or aggregation (GROUP BY) by clustered
columns
○ Reduced cost when filtering or aggregation by clustered columns
● How?
○ Sorts storage blocks based on the values of the clustered columns.
○ Useful when the clustered columns have high cardinality.
○ Scans only the relevant blocks of clustered columns as specified by the filter or group by clause.
○ No accurate query cost estimate before running the query.
Clustering
Click to add footnotes
4. 4
We unlock the power of data to reimagine retail
● Queries commonly filter on particular columns
● Filtered columns have high cardinality (many distinct values)
When to apply clustering?
Click to add footnotes
5. 5
We unlock the power of data to reimagine retail
● Clustered column types
○ STRING, INT64, NUMERIC, BIGNUMERIC, DATE, DATETIME, TIMESTAMP, BOOL, GEOGRAPHY
● Clustered column order is important
○ Query filter order must match the clustered column order
○ The first clustered column must be included in the query filter.
How to apply clustering?
Click to add footnotes
6. 6
We unlock the power of data to reimagine retail
CREATE TABLE `gcp-wow-wiq-dclo-test.sql_tuning.stackoverflow_posts_cluster_creation_date_view_count`
(
id INTEGER,
title STRING,
answer_count INTEGER,
creation_date TIMESTAMP,
view_count INTEGER
)
CLUSTER BY
creation_date,
view_count;
How to create a clustered table?
Click to add footnotes
7. 7
We unlock the power of data to reimagine retail
Select the
- ID
- Title
- Answer count
of Stackoverflow posts
Where the creation date is between 01 Oct 2008 and 01 Dec 2008.
Scenario 1
Click to add footnotes
8. 8
We unlock the power of data to reimagine retail
56 seconds
Slot time consumed
Query performance without clustering
Click to add footnotes
9. 9
We unlock the power of data to reimagine retail
27 seconds
Slot time consumed
Query performance with clustering
(creation date as clustered column)
Click to add footnotes
Clustering improves performance by 2X
10. 10
We unlock the power of data to reimagine retail
Select the
- ID
- Title
- Answer count
of Stackoverflow posts
Where the creation date is between 01 Oct 2008 and 01 Dec 2008
and view count >= 100.
Scenario 2
Click to add footnotes
11. 11
We unlock the power of data to reimagine retail
1 minute 9 seconds
Slot time consumed
Query performance without clustering
Click to add footnotes
12. 12
We unlock the power of data to reimagine retail
21 seconds
Slot time consumed
Query performance with clustering
(creation date as clustered column)
Click to add footnotes
Clustering by the first filter column
improves performance by 3X
13. 13
We unlock the power of data to reimagine retail
574 milliseconds
Slot time consumed
Query performance with clustering
(creation date, view count as clustered columns)
Click to add footnotes
Clustering by the both filter columns
improves performance by 120X
14. 14
We unlock the power of data to reimagine retail
15 seconds(worse)
Slot time consumed
Query performance with clustering - reverse filtering order
(creation date, view count as clustered columns)
Click to add footnotes
Reverse order of the clustered columns
in the query filter does not yield optimal
performance.
15. 15
We unlock the power of data to reimagine retail
2 minutes 45 seconds (not optimal)
Slot time consumed
Query performance with clustering - filter by view count only
(creation date, view count as clustered columns)
Click to add footnotes
Excluding the first clustered column
in the query filter does not yield
optimal performance.
16. 16
We unlock the power of data to reimagine retail
4 minutes 38 seconds
Slot time consumed
Query performance without clustering - filter by view count only
Click to add footnotes
17. 17
We unlock the power of data to reimagine retail
2 minutes 29 seconds (optimal)
Slot time consumed
Query performance with clustering - filter by view count only
(view count as clustered column)
Click to add footnotes
Including view_count clustered
column in the query filter yields
optimal performance.
18. 18
We unlock the power of data to reimagine retail
● Only Google SQL is supported.
● At most 4 clustered columns per table.
● When clustering by the STRING type column, only the first 1024 characters are used for clustering.
● When clustering an table with existing data, the existing data is not clustered; only new data is clustered
and is subject to automatic reclustering.
Limitations
Click to add footnotes
19. 19
We unlock the power of data to reimagine retail
1. Introduction to Clustered tables
2. Create and use clustered tables
References
Click to add footnotes