Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?

Dr. Martin Loetzsch
Project A Data Modelling
Best Practices Part II:
How to Build a Data Warehouse?

Create a unified analytical model & activate data across whole company
Create tables in a database Challenges
Consistency & correctness
Changeability
Complexity
Transparency
| @martin_loetzsch
In a Kimball-style star / snowflake schema
Single analytical truth = a set of tables
2
application
databases
events
csv files
apis
central data
store/
data lake
single
analytical
truth
(marketing)
automation
reporting
machine
learning
…

| @martin_loetzsch
Two data engineers at work
Can be difficult at times
3

Today:
Best practices for coping with
complexity
4 | @martin_loetzsch

Not today:
Databases & orchestration tools
5 | @martin_loetzsch

Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code
Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
Functional data engineering
Reproducability, idempotency, (immutability).
Running a pipeline on the same inputs will always produce the same results.
No side effects.
| @martin_loetzsch
Apply standard software engineering best practices
Make changing and testing things easy
6
https://medium.com/@maximebeauchemin/functional-data-engineering-
a-modern-paradigm-for-batch-data-processing-2327ec32c42a

https://hakibenita.com/sql-for-data-analysis#sql-vs-pandas-performance Simple analysis
SELECT activated,
count(*) AS cnt
FROM users
GROUP BY activated
Processing data in Pandas creates quite some overhead
Original table 65MB, Pandas data frame ~300MB
| @martin_loetzsch
Processing data in Python / Java etc. is very expensive
Leave data in database
7

Target of computation
CREATE TABLE m_dim_next.region (
region_id SMALLINT PRIMARY KEY,
region_name TEXT NOT NULL UNIQUE,
country_id SMALLINT NOT NULL,
country_name TEXT NOT NULL,
_region_name TEXT NOT NULL
);
Do computation and store result in table
WITH raw_region
AS (SELECT DISTINCT
country,
region
FROM m_data.ga_session
ORDER BY country, region)
INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id,
CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1;
INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');
Speedup subsequent transformations
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']);
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']);
ANALYZE m_dim_next.region;
| @martin_loetzsch
Tables as (intermediate) results of processing steps
Use SQL as a data processing language
8

At the beginning of a transformation pipeline
DROP SCHEMA IF EXISTS m_dim_next CASCADE;
CREATE SCHEMA m_dim_next;
DROP SCHEMA IF EXISTS m_tmp CASCADE;
CREATE SCHEMA m_tmp;
Do all transformations
CREATE TABLE m_tmp.foo
AS SELECT n
FROM generate_series(0, 10000) n;
CREATE TABLE m_dim_next.foo
AS SELECT
n,
n + 1
FROM m_tmp.foo;
-- ..
At the end of pipeline, after checks
DROP SCHEMA IF EXISTS m_dim CASCADE;
ALTER SCHEMA m_dim_next RENAME TO m_dim;
Build m_dim_next schema while users are seeing m_dim
Data shown in frontend is always consistent
When ETL breaks, data just gets old
DBT users
Copy final reporting tables in a last step, after checks
Or abuse hooks/ macros
Explicit operations determine structure and content of DWH
Very developer / git friendly
Requirement: Must be possible to rebuild DWH from sources
Add incremental loading / processing when there is a pain
| @martin_loetzsch
Don’t bother with updates and/ or migrations
Always drop & rebuild, keep last working version
9

Renaming / dropping schemas requires exclusive locks
CREATE FUNCTION util.cancel_queries_on_schema(schema TEXT)
RETURNS BOOLEAN AS $$
SELECT pg_cancel_backend(pid)
FROM
(SELECT DISTINCT pid
FROM pg_locks
JOIN pg_database ON database = pg_database.oid
JOIN pg_class ON pg_class.oid = relation
JOIN pg_namespace ON relnamespace = pg_namespace.oid
WHERE datname = current_database() AND nspname = schema
AND pid != pg_backend_pid()) t;
$$ LANGUAGE SQL;
First rename, then drop
CREATE FUNCTION util.replace_schema(schemaname TEXT,
replace_with TEXT)
RETURNS VOID AS $$
BEGIN
PERFORM util.cancel_queries_on_schema(schemaname);
IF EXISTS(SELECT *
FROM information_schema.schemata s
WHERE s.schema_name = schemaname)
THEN
EXECUTE 'ALTER SCHEMA ' || schemaname
|| ' RENAME TO ' || schemaname || '_old;';
END IF;
EXECUTE 'ALTER SCHEMA ' || replace_with
|| ' RENAME TO ' || schemaname || ';';
-- again, for good measure
PERFORM util.cancel_queries_on_schema(schemaname);
EXECUTE 'DROP SCHEMA IF EXISTS ' || schemaname
|| '_old CASCADE;';
END;
$$ LANGUAGE 'plpgsql'
Nice one-liner
SELECT util.replace_schema('m_dim', 'm_dim_next');
| @martin_loetzsch
Transactional DDL FTW
Atomic & robust schema switches in PostgreSQL
10

(Re)-creating data sets (schemas)
def re_create_data_set(next_bq_dataset_id=next_bq_dataset_id):
from mara_db.bigquery import bigquery_client
client = bigquery_client(bq_db_alias)
print(f'deleting dataset {next_bq_dataset_id}')
client.delete_dataset(dataset=next_bq_dataset_id,
delete_contents=True, not_found_ok=True)
print(f'creating dataset {next_bq_dataset_id}')
client.create_dataset(dataset=next_bq_dataset_id, exists_ok=True)
return True
Replacing data sets
def replace_dataset(db_alias, dataset_id, next_dataset_id):
from mara_db.bigquery import bigquery_client
client = bigquery_client(db_alias)
client.create_dataset(dataset=dataset_id, exists_ok=True)
next_tables = set([table.table_id for table
in client.list_tables(next_dataset_id)])
ddl = 'n'
for table in client.list_tables(dataset_id):
if table.table_id not in next_tables:
ddl += f'DROP TABLE `{dataset_id}`.`{table.table_id}`;'
# hopefully atomic operation
for table_id in next_tables:
ddl += f'CREATE OR REPLACE TABLE `{dataset_id}`.`{table_id}`
AS SELECT * FROM `{next_dataset_id}`.`{table_id}`;n'
ddl += f'DROP TABLE `{next_dataset_id}`.`{table_id}`;n'
client.query(ddl)
retries = 1
while True:
try:
client.delete_dataset(next_dataset_id)
return
except BadRequest as e:
if retries <= 10:
print(e, file=sys.stderr)
seconds_to_sleep = retries * 4
print(f'Waiting {seconds_to_sleep} seconds')
time.sleep(seconds_to_sleep)
retries += 1
else:
raise e
| @martin_loetzsch
https://github.com/mara/mara-db/blob/master/mara_db/bigquery.py#L124
Schema switches are a bit less nice in BigQuery
11

ALTER SCHEMA m_dim_next SWAP WITH m_dim;
DROP SCHEMA m_dim CASCADE;
| @martin_loetzsch
They built a command for atomic schema switches
Very nice in Snowflake
12

It’s easy to make mistakes during ETL
DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
CREATE TABLE s.city (
city_id SMALLINT,
city_name TEXT,
country_name TEXT
);
INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', 'Hungary');
CREATE TABLE s.customer (
customer_id BIGINT,
city_fk SMALLINT
);
INSERT INTO s.customer VALUES
(1, 1),
(1, 2),
(2, 3);
Customers per country?
SELECT
country_name,
count(*) AS number_of_customers
FROM s.customer JOIN s.city
ON customer.city_fk = s.city.city_id
GROUP BY country_name;
Back up all assumptions about data by constraints
ALTER TABLE s.city ADD PRIMARY KEY (city_id);
ALTER TABLE s.city ADD UNIQUE (city_name);
ALTER TABLE s.city ADD UNIQUE (city_name, country_name);
ALTER TABLE s.customer ADD PRIMARY KEY (customer_id);
[23505] ERROR: could not create unique index "customer_pkey"
Detail: Key (customer_id)=(1) is duplicated.
ALTER TABLE s.customer ADD FOREIGN KEY (city_fk)
REFERENCES s.city (city_id);
[23503] ERROR: insert or update on table "customer" violates foreign
key constraint "customer_city_fk_fkey"
Detail: Key (city_fk)=(3) is not present in table "city"
| @martin_loetzsch
Only very little overhead, will save your ass
Avoid happy path programming
13

In all tables across the pipeline
DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
CREATE TABLE s.city (
city_id SMALLINT NOT NULL UNIQUE,
city_name TEXT NOT NULL,
country_name TEXT NOT NULL
);
INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', ‘Hungary'),
(-1, 'Unknown', ‘Unknown');
CREATE TABLE s.customer (
customer_id BIGINT NOT NULL UNIQUE,
city_fk SMALLINT NOT NULL
);
INSERT INTO s.customer VALUES
(1, 1),
(2, -1);
NOT NULL constraints
There needs to be a justification for making a column nullable
Helps you to detect problems very early on
Unique constraints
When you think it should be unique, add a constraint
Will save run times and cost
Unknown values
Adding another breakdown should not change the sum of a metric
| @martin_loetzsch
And sometimes add “Unknown” values
Make columns not nullable by default
14

| @martin_loetzsch
Never ever!
Don’t repeat business logic
15

“Processed order”?
SELECT sum(total_price) AS revenue
FROM os_data.order
WHERE status IN ('pending', 'accepted', 'completed',
'proposal_for_change');
SELECT CASE WHEN (status <> 'started'
AND payment_status = 'authorised'
AND order_type <> 'backend')
THEN o.order_id END AS processed_order_fk
FROM os_data.order;
SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed
FROM os_data.order;
Solution: refactor pipeline
Create separate table that computes everything we know about an order
| @martin_loetzsch
Never ever!
Don’t repeat business logic
16
Never ever!

Load → clean → preprocess → transform → flatten
Never query data from more than one layer above
If you need one more column, add it to each of these steps
Will save cost & runtime and make data lineage easier
Compute things as early as possible
It’s always the easiest to get something done based on finished fact tables
Don’t do that!
| @martin_loetzsch
Requires discipline
Organise processing steps in layers
17
load-product load-order load-customer
transform-product transform-order transform-customer
flatten-product-fact flatten-order-fact flatten-customer-fact
clean-product clean-order clean-customer
preprocess-product preprocess-order preprocess-customer
Unchanged raw data as ingested
Remove test orders, parse timestamps, correct garbage
Business logic: everything you know about an entity without looking at others
Dimensional model: establish links between entities, transfer metrics
Prepare data sets for reporting and other data activation use cases

clean
a d
clean
d
clean
b f
clean
c
clean
ei e
a ian e
clean
a e
d
clean
m f
clean
mad
e ce
ini ial
de
e ce
b king
e ce
e ce
m f
ak i
bi
e ce
mailing
e n
e ce
a el
e ce
che
e ce
a el
m i n
e ce
a e
d
e ce
che
a el
e ice
e ce
a el
e ice
e ce
b king
i em
e ce
aim3
c e
in en
a
adj men
b king
e ce
a
c m e
b king
e ence
an f m
a el
an f m
b king
e ce
c n
c de
e ce
a
b king
i em
e ce
b2
a el
e ice
e ce
ad e i ing
e f mance
e ce
e n
e ce
ea n
cha ge
b king
i em
clean
ma k
c de
e ce
ecial
in ice
a el
e ice
e ce
ad e i ing
c
an f m
ad e i ing
e f mance
c m e
b king
channel
g
e ce
email
c n ac
e ce
i
c de
e ce
i
c de
dem g a hic
an f m
email
c n ac
e ce
cha ing
e
an f m
e n
an f m
cha ing
e
e ce
ge name
an f m
ai
l ca i n
e ce
b
l ca i n
an f m
b
l ca i n
e ce
i
c de
a
an f m
i
c de
an f m
i
c de
dem g a hic
an f m
i
c de
a
an f m
mailing
e n
an f m
a el
m i n
c ea e
mall
dimen i n
an f m an f m
a
an f m
che
an f m
a el
e ice
an f m
b king
i em
| @martin_loetzsch
Booking data pipeline @ Trendtours
Complexity handled well I
18

an fo m
o de
o igin
p epa e
in oice
heade
comp e
in oice
line
amo n
p epa e
o de
efe ence
p ep oce
cancelled
o de
p epa e
in oice
line
p epa e
ale
heade
comp e
open
line
amo n
ea ch
e change
o de
o igin
p epa e
ale
line
comp e
o e
line
amo n
p epa e
ale
heade
a chi e
p epa e
ale
line
a chi e
p epa e
o e
line
p epa e
c edi
heade
comp e
c edi
line
amo n
p epa e
c edi
efe ence
p epa e
c edi
line
p epa e
al e
en
p ep oce
o e
g o p
o de
i em
g o p
c edi
i em
p ep oce
e n
o de
an fo m
o e
p epa e
me ged
c ome
p ep oce
e p
c ome
an fo m
ind
g o p
p ep oce
con ac
ind
g o p
p ep oce
o de
p ep oce
email
con ac
p ep oce
me ged
c ome
calc la e
pa back
poin
an fo m
o de
i em
c ea e
nma ched
c edi
da a
e
comp e
cancelled
line
amo n
an fo m
cancelled
o de
an fo m
cancelled
o de
i em
an fo m
cancella ion
ea on
a en
ale
fac
p ep oce
o de
i em
an fo m
e n
o de
fac
c ea e
e n
da a e
an fo m
ale
fac a
fac
comp e
o de
agg ega e
an fo m
o de
p ep oce
b2b
acco n
p ep oce
b2b
acco n
hi o
p epa e
b2b
acco n
hi o
an fo m
b2b
acco n
hi o
an fo m
e p
c ome
an fo m
email
con ac
c ea e
o de
da a
e
a en
e o ce
fac
c ea e
e p
c ome
da a
e
c ea e
email
con ac
da a
e
load
calc la ed
bon
le
p ep oce
calc la ed
bon
c ea e
o e
cl e
c ea e
o e
da a
e
c ea e
e o ce
da a
e
c ea e
agg ega e
able
| @martin_loetzsch
Order data pipeline @ Lampenwelt
Complexity handled well II
19

l ad
da a
e a i nal
em
la f m
l ad
da a
la f m
e a i nal
em
l ad
ma ing
in
l ad
e
e change
a e
g gle
ad
me ged
e a i nal
em
la f m
da a
e
di a
e
a i
da e
ende k
n
b
3c
di a
e
m de
in
e
df
mail
m de
lie
mail
me ged
da a
e
c n ac
m men
d
da a
e
ma ke ing
c me
e ice
da a
e
bing
ad
c i e
faceb k
g gle
ad
e
g gle
ea ch
c n le
ma ke ing
da a
e
c me
da a
e
hi
me ic
e a i nal
c n i enc
check
me ged
c n i enc
check
c me
egmen a i n
e
a i
da e
a ge
edic i n
ca
ma ke ing
c n i enc
check
a360
a i
da e
la f m
c n i enc
check
a ge
edic i n
da a
e
c be
da e
e
a
afe
ca
Create separate pipelines for things that you
Want to develop independently
Want to test independently
Want to schedule independently
Ensure pipeline immutability by
Creating separate schemas for each pipeline
Avoiding modifying the schemas of other pipelines
Avoid circular dependencies
| @martin_loetzsch
Top level pipelines @ Helloprint
Split pipelines to separate concerns
20

Check for “lost” rows
SELECT util.assert_equal(
'The order items fact table should contain all order items',
'SELECT count(*) FROM os_dim.order_item',
'SELECT count(*) FROM os_dim.order_items_fact');
Check consistency across cubes / domains
SELECT util.assert_almost_equal(
'The number of first orders should be the same in '
|| 'orders and marketing touchpoints cube',
'SELECT count(net_order_id)
FROM os_dim.order
WHERE _net_order_rank = 1;',
'SELECT (SELECT sum(number_of_first_net_orders)
FROM m_dim.acquisition_performance)
/ (SELECT count(*)
FROM m_dim.performance_attribution_model)',
0.00001
);
Check completeness of source data
SELECT util.assert_not_found(
'Each adwords campaign must have the attribute "Channel"',
'SELECT DISTINCT campaign_name, account_name
FROM aw_tmp.ad
JOIN aw_dim.ad_performance ON ad_fk = ad_id
WHERE attributes->>''Channel'' IS NULL
AND impressions > 0
AND _date > now() - INTERVAL ''30 days''');
Check correctness of redistribution transformations
SELECT util.assert_almost_equal(
'The cost of non-converting touchpoints must match the'
|| 'redistributed customer acquisition and reactivation cost',
'SELECT sum(cost)
FROM m_tmp.cost_of_non_converting_touchpoints;',
'SELECT
(SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_acquisition_cost)
+ (SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_reactivation_cost);',
0.00001);
| @martin_loetzsch
Makes changing things easy
Write tests
21

Kimball style star schemas are not very user friendly Flatten all connected entities into a wide fact table
| @martin_loetzsch
Bring data in the shape that is right for consumers
Prepare dimensional model for data use cases
22
SELECT
order_item.order_item_id AS "Order item ID",
"order".order_id AS "Order ID",
"order".order_date AS "Order date",
order_customer.customer_id AS "Customer ID",
order_customer_favourite_product_category.main_category AS "Customer favo
order_customer_favourite_product_category.sub_category_1 AS "Customer fav
order_customer_first_order.order_date AS "Customer first order date",
product.sku AS "Product SKU",
product_product_category.main_category AS "Product category level 1",
product_product_category.sub_category_1 AS "Product category level 2",
order_item.order_item_id AS "# Order items",
order_item.order_fk AS "# Orders",
order_item.product_revenue AS "Product revenue",
order_item.revenue AS "Shipping revenue"
FROM dim.order_item order_item
LEFT JOIN dim."order" "order" ON order_item.order_fk = "order".order
LEFT JOIN dim.customer order_customer ON "order".customer_fk = order
LEFT JOIN dim.product_category order_customer_favourite_product_cate
order_customer_favourite_product_category.product_category_id
LEFT JOIN dim."order" order_customer_first_order ON order_customer.f
LEFT JOIN dim.product product ON order_item.product_fk = product.pro
LEFT JOIN dim.product_category product_product_category ON product.p

| @martin_loetzsch
Can be stressful
Don’t develop / test in production
23

Never:
Look at backend, create models that people might need
Very unlikely to create business value
Instead:
Collect business questions and derive analytical entities
And then build them, incrementally
| @martin_loetzsch
How to build my first pipeline?
How to start?
24

| @martin_loetzsch
Nobody is good at waterfall planning
Build things incrementally
25
Add entities, attributes & metrics step by step
Entity model evolution @ Amann Girrbach

| @martin_loetzsch
https://github.com/mara/mara-example-project-1
How to actually start? Use a template
26
Runnable example data warehouse
Based on a public e-commerce data set
Follows Project A modelling best practices
Copy and adapt

=
| @martin_loetzsch
The Project A data & analytics team is always happy to help!
Need help with any of this?
27

| @martin_loetzsch
Happy to answer your questions
Thank you!
28

Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?

Similar to Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse? (20)

Recently uploaded

Recently uploaded (20)

Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?