2. 2
We unlock the power of data to reimagine retail
Contents
Why cost optimization? 03
Technology stack 04
Cost optimization tips 05
Results 10
References 11
3. 3
We unlock the power of data to reimagine retail
● Data/ML pipelines solve business problems to gain financial benefits.
● Data/ML pipelines incur operational expenses.
Why cost optimization?
Net gain = Financial benefits (const) - Operational expenses
High Operational expenses, Low Net gain.
4. 4
We unlock the power of data to reimagine retail
● BigQuery
● Airflow
Technology stack
5. 5
We unlock the power of data to reimagine retail
● Incremental transformation
● Efficient query processing
● Efficient storage
● Relevant flow of execution
Cost optimization tips
6. 6
We unlock the power of data to reimagine retail
● Don’t transform the entire history of data in every ETL run.
● Transform only the latest available or relevant history of data in
each ETL run.
Tip 1: Incremental transformation
7. 7
We unlock the power of data to reimagine retail
1. Fetch the relevant rows and columns of data only.
2. Filter data with WHERE clause as early as possible.
3. Aggregate as late and as seldom as possible.
4. Avoid Cross and self JOINs; use analytic functions instead.
5. Prefer simple STRING management (e.g., LIKE) over REGEX.
6. Prefer REGEX_CONTAINS() over LOWER() or UPPER() for case insensitive
STRING comparison.
7. Partition and cluster tables to reduce the number of scans.
8. If there are multiple ways to write up a query, prefer that one with the
lower slot time and/or lower bytes processed.
Tip 2: Efficient query processing
8. 8
We unlock the power of data to reimagine retail
1. Store relevant rows and columns only.
2. Avoid duplicates.
3. Denormalize data into individual columns (instead of the STRUCT type).
4. Set table expiration for short-term tables or use temp tables.
5. Use cached results and permanent tables instead of views.
6. Update tables incrementally.
7. Partition and cluster tables based on the (intended) workload.
Tip 3: Efficient storage
9. 9
We unlock the power of data to reimagine retail
● Run or skip a task in the pipeline based on configuration.
● If the output of a task doesn’t change in subsequent ETL runs, it can be
skipped and its output from cache can be used without any adverse
consequences.
● This tip reduces cost significantly during the Pipeline Build phase.
Tip 4: Relevant flow of execution (Build phase)
Run Setup and Redemption
during a Redemption Build.
Sales can be skipped.
10. 10
We unlock the power of data to reimagine retail
Results: 3X reduction in OpEx
Before optimization After optimization
3X reduction in Operational expenses after applying
cost optimization techniques.
11. 11
We unlock the power of data to reimagine retail
1. 4 simple cost optimization techniques for data pipelines in the Cloud
2. 4 simple cost optimization techniques for data pipelines in the Cloud
3. Writing efficient queries
4. Query optimization in BigQuery
5. Advanced query optimization.
6. Optimize query computation.
7. Optimise string comparisons
References
12. 12
We unlock the power of data to reimagine retail
● Feature Store team
○ Tinus Williemse
○ Nikhil Kankarla
○ Jeremy Lin
○ Darren Thehamihardja
Acknowledgement