[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt for Optimizing Data Warehouse Quality

Implementing a robust CI workﬂow with dbt
for optimizing data warehouse quality
Maeo Molteni
Data Engineer, Data Enablement
maeo.molteni@kognic.com

About Kognic
● Mid-size company (~100 employees)
● Mature operational data
● Average-sized data (Tb/bn)
● Goal: enable analytics at scale by creating a collaborative data
platform
Goal and constraints
● Creation of a collaborative data platform:
○ Fast to deploy, easy to maintain
○ Limit budget, favour open-source solutions
○ Team of ~2 data engineers
Existing setup
● GCP + Kubernetes
● Well-established plateng team for facilitating deployments
Some context

● Minimize the number of tools used
● Extract/Load via dockerized python app
● Pipelines only for ingestion (no transformations)
● All the Transformations are managed via dbt
○ dbt tweaked into a catalog and data governance tool
Our ELT architecture

● dbt (data build tool) is the “Transformation” in ELT
● No compute engine, it piggybacks on your warehouse
● Version control everything in the warehouse
● Each transformation/model is a *.sql ﬁle. dbt:
○ compiles templates into plain SQL
○ resolves dependencies
○ executes sequentially the code in your warehouse
dbt for orchestrating data transformation

● Seamless integration with Airflow for scheduling
○ DAGs are dynamically generated (parse json or use Astro SDK)
○ do not use bash operator!
● Latest dbt manifest kept in sync with Airflow
● Multiple targets (test/dev/production)
● Modular dag -> easy to route alerts to model owners (e.g. based on
some metadata)
dbt - how it integrates with Airflow

● Base code version controlled in GIT
● Same approach as in software development, SQLis treated as
code:
○ Version control
○ Linting
○ Testing, testing, testing
○ PR-review
dbt - local development - sql as code

● Minimize boilerplate with dbt codegen
● Automated generation of boilerplate for documentation for sources
● Automated generation of staging model
● Automated propagation of documentation to downstream models
dbt - local development - lower the threshold for developers

● sqlflu is a python based linting for SQL
● New major 2.0 release March 2023
● It comes with an optional dbt-templater
● Fully customizable:
○ Define a set of rules
○ “sqlflu fix” will auto-format most of the code
○ Enforced as first step in CI workflow
● Result: a common standard for all your sql files
dbt CI - Lint your SQL

● dbt provides packages for testing both the data models and the
metadata
● Tests as a one-liner in *.yaml
○ dbt-core: basic tests
○ dbt-expectations (~ assertions in unit testing)
○ dbt-meta-testing (metadata/project)
● Having “some test coverage” is a requirement enforced in the CI
pipeline (dbt run-operation required_tests) for critical data
dbt CI - Test your models

● Documentation kept together with the *.sql code
● Docs compiled into static HTMLcontent at merge.
○ Hosted on dbt.kognic.io
○ Catalog with metadata and basic proﬁling
○ Data lineage, columns description, and ownership
○ Compiled/uncompiled code generating the table(s)
○ Data discovery functionality
● Having complete documentation is a requirement enforced in the CI pipeline (dbt
run-operation required_docs)
dbt CI - Documentation, lineage, and data discovery

Typical issues faced by the data platform team when developing
and maintaining the platform:
- Seing up a test/dev environment
- Handling conﬂicts when developing
- Monolithic architecture
- Cost control
- Handling changes
- Unwanted breaking changes
- Rollout of major changes
CD and the maintainer experience

Tables in dbt production environment can only be materialized
by Airflow jobs, users do not have write access to prod
A staging environment is used for everything else:
- Local runs
- CI workflows
- Central config with dynamic datasets based on current
branch name (CI) or user name (local):
- Compartmentalised dev environments
- Separation of concerns
- Enforced lifecycle rules for keeping staging clean and
avoiding the creation of long term dependencies
dbt CD - production & staging project

The main risk of having one environment per branch
and user is the cost skyrocketing, since would
out-of-the-box materialise the model and all the
upstream dependencies in their entirety.
dbt CD - defer and ﬁlter for reducing data consumed
Two obvious ways to keep costs
under control:
- Macro for reducing the
volume
of data used in staging
(opt-in) based on partitions
- ----defer upstream to
production environment
(opt-out)

Tests can capture unwanted
behaviour after the model is
created, which sometimes is too late
Model exposed to a broader
audience, or to an exposure can have
an extra layer of controls by mean of
a contract.
Contracts are enforced before the
model is materialised, and deﬁnes
the “shape” of the model, not the
business logic therein
dbt CD - contracts

Breaking changes are sometimes needed and in a
collaborative environment it’s crucial having a smooth
transition.
dbt CD - versions for smooth rollout of changes
model versioning:
- Base model defined in *.yml
- One *.sql file per model - only di in
the *yml file
- A view on top of versioned models
pointing at “the latest”
- Older version can keep on existing
until when no downstream
dependencies exists
- Challenge: need to ensure we
eventually have one version of each
model

● “If it is not in git, it does not exist”
● User experience:
○ codegen for reducing boilerplate
○ lint your sql (e.g. with sqlﬂu)
○ test metadata before merging
■ complete catalog for data discovery “for free”
● Maintainer experience:
○ Separate concerns by having distinct environments between
dev and prod, and within dev
○ Reduce costs with ----defer and partitions pruning in dev
○ Prevent or control changes:
■ model versions for controlling major changes
■ contracts for preventing changes to “exposed” data
Some key learnings

Thanks for listening!
Maeo Molteni
Data Engineer, Data Enablement
maeo.molteni@kognic.com

[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt for Optimizing Data Warehouse Quality

Recommended

Recommended

More Related Content

Similar to [DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt for Optimizing Data Warehouse Quality

Similar to [DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt for Optimizing Data Warehouse Quality (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt for Optimizing Data Warehouse Quality