In this talk, I will discuss how to implement a robust CI (Continuous Integration) workflow with dbt (Data Build Tool) to optimize data warehouse quality. The implementation of a CI workflow can streamline collaboration and improve the quality of the data warehouse by catching errors early in the development process. By leveraging dbt's modular approach and test-driven development practices, the CI workflow can help data teams ensure the accuracy and reliability of their data. Attendees will learn best practices for implementing a CI workflow with dbt and how to optimize their data warehousing quality
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt for Optimizing Data Warehouse Quality
1. Implementing a robust CI workflow with dbt
for optimizing data warehouse quality
Maeo Molteni
Data Engineer, Data Enablement
maeo.molteni@kognic.com
2. About Kognic
● Mid-size company (~100 employees)
● Mature operational data
● Average-sized data (Tb/bn)
● Goal: enable analytics at scale by creating a collaborative data
platform
Goal and constraints
● Creation of a collaborative data platform:
○ Fast to deploy, easy to maintain
○ Limit budget, favour open-source solutions
○ Team of ~2 data engineers
Existing setup
● GCP + Kubernetes
● Well-established plateng team for facilitating deployments
Some context
3. ● Minimize the number of tools used
● Extract/Load via dockerized python app
● Pipelines only for ingestion (no transformations)
● All the Transformations are managed via dbt
○ dbt tweaked into a catalog and data governance tool
Our ELT architecture
4. ● dbt (data build tool) is the “Transformation” in ELT
● No compute engine, it piggybacks on your warehouse
● Version control everything in the warehouse
● Each transformation/model is a *.sql file. dbt:
○ compiles templates into plain SQL
○ resolves dependencies
○ executes sequentially the code in your warehouse
dbt for orchestrating data transformation
5. ● Seamless integration with Airflow for scheduling
○ DAGs are dynamically generated (parse json or use Astro SDK)
○ do not use bash operator!
● Latest dbt manifest kept in sync with Airflow
● Multiple targets (test/dev/production)
● Modular dag -> easy to route alerts to model owners (e.g. based on
some metadata)
dbt - how it integrates with Airflow
6. ● Base code version controlled in GIT
● Same approach as in software development, SQLis treated as
code:
○ Version control
○ Linting
○ Testing, testing, testing
○ PR-review
dbt - local development - sql as code
7. ● Minimize boilerplate with dbt codegen
● Automated generation of boilerplate for documentation for sources
● Automated generation of staging model
● Automated propagation of documentation to downstream models
dbt - local development - lower the threshold for developers
8. ● sqlflu is a python based linting for SQL
● New major 2.0 release March 2023
● It comes with an optional dbt-templater
● Fully customizable:
○ Define a set of rules
○ “sqlflu fix” will auto-format most of the code
○ Enforced as first step in CI workflow
● Result: a common standard for all your sql files
dbt CI - Lint your SQL
9. ● dbt provides packages for testing both the data models and the
metadata
● Tests as a one-liner in *.yaml
○ dbt-core: basic tests
○ dbt-expectations (~ assertions in unit testing)
○ dbt-meta-testing (metadata/project)
● Having “some test coverage” is a requirement enforced in the CI
pipeline (dbt run-operation required_tests) for critical data
dbt CI - Test your models
10. ● Documentation kept together with the *.sql code
● Docs compiled into static HTMLcontent at merge.
○ Hosted on dbt.kognic.io
○ Catalog with metadata and basic profiling
○ Data lineage, columns description, and ownership
○ Compiled/uncompiled code generating the table(s)
○ Data discovery functionality
● Having complete documentation is a requirement enforced in the CI pipeline (dbt
run-operation required_docs)
dbt CI - Documentation, lineage, and data discovery
11. Typical issues faced by the data platform team when developing
and maintaining the platform:
- Seing up a test/dev environment
- Handling conflicts when developing
- Monolithic architecture
- Cost control
- Handling changes
- Unwanted breaking changes
- Rollout of major changes
CD and the maintainer experience
12. Tables in dbt production environment can only be materialized
by Airflow jobs, users do not have write access to prod
A staging environment is used for everything else:
- Local runs
- CI workflows
- Central config with dynamic datasets based on current
branch name (CI) or user name (local):
- Compartmentalised dev environments
- Separation of concerns
- Enforced lifecycle rules for keeping staging clean and
avoiding the creation of long term dependencies
dbt CD - production & staging project
13. The main risk of having one environment per branch
and user is the cost skyrocketing, since would
out-of-the-box materialise the model and all the
upstream dependencies in their entirety.
dbt CD - defer and filter for reducing data consumed
Two obvious ways to keep costs
under control:
- Macro for reducing the
volume
of data used in staging
(opt-in) based on partitions
- ----defer upstream to
production environment
(opt-out)
14. Typical issues faced by the data platform team when developing
and maintaining the platform:
- Seing up a test/dev environment
- Handling conflicts when developing
- Monolithic architecture
- Cost control
- Handling changes
- Unwanted breaking changes
- Rollout of major changes
CD and the maintainer experience
15. Tests can capture unwanted
behaviour after the model is
created, which sometimes is too late
Model exposed to a broader
audience, or to an exposure can have
an extra layer of controls by mean of
a contract.
Contracts are enforced before the
model is materialised, and defines
the “shape” of the model, not the
business logic therein
dbt CD - contracts
16. Breaking changes are sometimes needed and in a
collaborative environment it’s crucial having a smooth
transition.
dbt CD - versions for smooth rollout of changes
model versioning:
- Base model defined in *.yml
- One *.sql file per model - only di in
the *yml file
- A view on top of versioned models
pointing at “the latest”
- Older version can keep on existing
until when no downstream
dependencies exists
- Challenge: need to ensure we
eventually have one version of each
model
17. ● “If it is not in git, it does not exist”
● User experience:
○ codegen for reducing boilerplate
○ lint your sql (e.g. with sqlflu)
○ test metadata before merging
■ complete catalog for data discovery “for free”
● Maintainer experience:
○ Separate concerns by having distinct environments between
dev and prod, and within dev
○ Reduce costs with ----defer and partitions pruning in dev
○ Prevent or control changes:
■ model versions for controlling major changes
■ contracts for preventing changes to “exposed” data
Some key learnings