SlideShare a Scribd company logo
1 of 28
Download to read offline
Dr. Martin Loetzsch
Project A Data Modelling
Best Practices Part II:
How to Build a Data Warehouse?
Create a unified analytical model & activate data across whole company
Create tables in a database Challenges
Consistency & correctness
Changeability
Complexity
Transparency
| @martin_loetzsch
In a Kimball-style star / snowflake schema
Single analytical truth = a set of tables
2
application
databases
events
csv files
apis
central data
store/
data lake
single
analytical
truth
(marketing)
automation
reporting
machine
learning
…
| @martin_loetzsch
Two data engineers at work
Can be difficult at times
3
Today:
Best practices for coping with
complexity
4 | @martin_loetzsch
Not today:
Databases & orchestration tools
5 | @martin_loetzsch
Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code
Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
Functional data engineering
Reproducability, idempotency, (immutability).
Running a pipeline on the same inputs will always produce the same results.
No side effects.
| @martin_loetzsch
Apply standard software engineering best practices
Make changing and testing things easy
6
https://medium.com/@maximebeauchemin/functional-data-engineering-
a-modern-paradigm-for-batch-data-processing-2327ec32c42a
https://hakibenita.com/sql-for-data-analysis#sql-vs-pandas-performance Simple analysis
SELECT activated,
count(*) AS cnt
FROM users
GROUP BY activated
Processing data in Pandas creates quite some overhead
Original table 65MB, Pandas data frame ~300MB
| @martin_loetzsch
Processing data in Python / Java etc. is very expensive
Leave data in database
7
Target of computation
CREATE TABLE m_dim_next.region (
region_id SMALLINT PRIMARY KEY,
region_name TEXT NOT NULL UNIQUE,
country_id SMALLINT NOT NULL,
country_name TEXT NOT NULL,
_region_name TEXT NOT NULL
);
Do computation and store result in table
WITH raw_region
AS (SELECT DISTINCT
country,
region
FROM m_data.ga_session
ORDER BY country, region)
INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id,
CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1;
INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');
Speedup subsequent transformations
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']);
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']);
ANALYZE m_dim_next.region;
| @martin_loetzsch
Tables as (intermediate) results of processing steps
Use SQL as a data processing language
8
At the beginning of a transformation pipeline
DROP SCHEMA IF EXISTS m_dim_next CASCADE;
CREATE SCHEMA m_dim_next;
DROP SCHEMA IF EXISTS m_tmp CASCADE;
CREATE SCHEMA m_tmp;
Do all transformations
CREATE TABLE m_tmp.foo
AS SELECT n
FROM generate_series(0, 10000) n;
CREATE TABLE m_dim_next.foo
AS SELECT
n,
n + 1
FROM m_tmp.foo;
-- ..
At the end of pipeline, after checks
DROP SCHEMA IF EXISTS m_dim CASCADE;
ALTER SCHEMA m_dim_next RENAME TO m_dim;
Build m_dim_next schema while users are seeing m_dim
Data shown in frontend is always consistent
When ETL breaks, data just gets old
DBT users
Copy final reporting tables in a last step, after checks
Or abuse hooks/ macros
Explicit operations determine structure and content of DWH
Very developer / git friendly
Requirement: Must be possible to rebuild DWH from sources
Add incremental loading / processing when there is a pain
| @martin_loetzsch
Don’t bother with updates and/ or migrations
Always drop & rebuild, keep last working version
9
Renaming / dropping schemas requires exclusive locks
CREATE FUNCTION util.cancel_queries_on_schema(schema TEXT)
RETURNS BOOLEAN AS $$
SELECT pg_cancel_backend(pid)
FROM
(SELECT DISTINCT pid
FROM pg_locks
JOIN pg_database ON database = pg_database.oid
JOIN pg_class ON pg_class.oid = relation
JOIN pg_namespace ON relnamespace = pg_namespace.oid
WHERE datname = current_database() AND nspname = schema
AND pid != pg_backend_pid()) t;
$$ LANGUAGE SQL;
First rename, then drop
CREATE FUNCTION util.replace_schema(schemaname TEXT,
replace_with TEXT)
RETURNS VOID AS $$
BEGIN
PERFORM util.cancel_queries_on_schema(schemaname);
IF EXISTS(SELECT *
FROM information_schema.schemata s
WHERE s.schema_name = schemaname)
THEN
EXECUTE 'ALTER SCHEMA ' || schemaname
|| ' RENAME TO ' || schemaname || '_old;';
END IF;
EXECUTE 'ALTER SCHEMA ' || replace_with
|| ' RENAME TO ' || schemaname || ';';
-- again, for good measure
PERFORM util.cancel_queries_on_schema(schemaname);
EXECUTE 'DROP SCHEMA IF EXISTS ' || schemaname
|| '_old CASCADE;';
END;
$$ LANGUAGE 'plpgsql'
Nice one-liner
SELECT util.replace_schema('m_dim', 'm_dim_next');
| @martin_loetzsch
Transactional DDL FTW
Atomic & robust schema switches in PostgreSQL
10
(Re)-creating data sets (schemas)
def re_create_data_set(next_bq_dataset_id=next_bq_dataset_id):
from mara_db.bigquery import bigquery_client
client = bigquery_client(bq_db_alias)
print(f'deleting dataset {next_bq_dataset_id}')
client.delete_dataset(dataset=next_bq_dataset_id,
delete_contents=True, not_found_ok=True)
print(f'creating dataset {next_bq_dataset_id}')
client.create_dataset(dataset=next_bq_dataset_id, exists_ok=True)
return True
Replacing data sets
def replace_dataset(db_alias, dataset_id, next_dataset_id):
from mara_db.bigquery import bigquery_client
client = bigquery_client(db_alias)
client.create_dataset(dataset=dataset_id, exists_ok=True)
next_tables = set([table.table_id for table
in client.list_tables(next_dataset_id)])
ddl = 'n'
for table in client.list_tables(dataset_id):
if table.table_id not in next_tables:
ddl += f'DROP TABLE `{dataset_id}`.`{table.table_id}`;'
# hopefully atomic operation
for table_id in next_tables:
ddl += f'CREATE OR REPLACE TABLE `{dataset_id}`.`{table_id}`
AS SELECT * FROM `{next_dataset_id}`.`{table_id}`;n'
ddl += f'DROP TABLE `{next_dataset_id}`.`{table_id}`;n'
client.query(ddl)
retries = 1
while True:
try:
client.delete_dataset(next_dataset_id)
return
except BadRequest as e:
if retries <= 10:
print(e, file=sys.stderr)
seconds_to_sleep = retries * 4
print(f'Waiting {seconds_to_sleep} seconds')
time.sleep(seconds_to_sleep)
retries += 1
else:
raise e
| @martin_loetzsch
https://github.com/mara/mara-db/blob/master/mara_db/bigquery.py#L124
Schema switches are a bit less nice in BigQuery
11
ALTER SCHEMA m_dim_next SWAP WITH m_dim;
DROP SCHEMA m_dim CASCADE;
| @martin_loetzsch
They built a command for atomic schema switches
Very nice in Snowflake
12
It’s easy to make mistakes during ETL
DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
CREATE TABLE s.city (
city_id SMALLINT,
city_name TEXT,
country_name TEXT
);
INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', 'Hungary');
CREATE TABLE s.customer (
customer_id BIGINT,
city_fk SMALLINT
);
INSERT INTO s.customer VALUES
(1, 1),
(1, 2),
(2, 3);
Customers per country?
SELECT
country_name,
count(*) AS number_of_customers
FROM s.customer JOIN s.city
ON customer.city_fk = s.city.city_id
GROUP BY country_name;
Back up all assumptions about data by constraints
ALTER TABLE s.city ADD PRIMARY KEY (city_id);
ALTER TABLE s.city ADD UNIQUE (city_name);
ALTER TABLE s.city ADD UNIQUE (city_name, country_name);
ALTER TABLE s.customer ADD PRIMARY KEY (customer_id);
[23505] ERROR: could not create unique index "customer_pkey"
Detail: Key (customer_id)=(1) is duplicated.
ALTER TABLE s.customer ADD FOREIGN KEY (city_fk)
REFERENCES s.city (city_id);
[23503] ERROR: insert or update on table "customer" violates foreign
key constraint "customer_city_fk_fkey"
Detail: Key (city_fk)=(3) is not present in table "city"
| @martin_loetzsch
Only very little overhead, will save your ass
Avoid happy path programming
13
In all tables across the pipeline
DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
CREATE TABLE s.city (
city_id SMALLINT NOT NULL UNIQUE,
city_name TEXT NOT NULL,
country_name TEXT NOT NULL
);
INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', ‘Hungary'),
(-1, 'Unknown', ‘Unknown');
CREATE TABLE s.customer (
customer_id BIGINT NOT NULL UNIQUE,
city_fk SMALLINT NOT NULL
);
INSERT INTO s.customer VALUES
(1, 1),
(2, -1);
NOT NULL constraints
There needs to be a justification for making a column nullable
Helps you to detect problems very early on
Unique constraints
When you think it should be unique, add a constraint
Will save run times and cost
Unknown values
Adding another breakdown should not change the sum of a metric
| @martin_loetzsch
And sometimes add “Unknown” values
Make columns not nullable by default
14
| @martin_loetzsch
Never ever!
Don’t repeat business logic
15
“Processed order”?
SELECT sum(total_price) AS revenue
FROM os_data.order
WHERE status IN ('pending', 'accepted', 'completed',
'proposal_for_change');
SELECT CASE WHEN (status <> 'started'
AND payment_status = 'authorised'
AND order_type <> 'backend')
THEN o.order_id END AS processed_order_fk
FROM os_data.order;
SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed
FROM os_data.order;
Solution: refactor pipeline
Create separate table that computes everything we know about an order
| @martin_loetzsch
Never ever!
Don’t repeat business logic
16
Never ever!
Load → clean → preprocess → transform → flatten
Never query data from more than one layer above
If you need one more column, add it to each of these steps
Will save cost & runtime and make data lineage easier
Compute things as early as possible
It’s always the easiest to get something done based on finished fact tables
Don’t do that!
| @martin_loetzsch
Requires discipline
Organise processing steps in layers
17
load-product load-order load-customer
transform-product transform-order transform-customer
flatten-product-fact flatten-order-fact flatten-customer-fact
clean-product clean-order clean-customer
preprocess-product preprocess-order preprocess-customer
Unchanged raw data as ingested
Remove test orders, parse timestamps, correct garbage
Business logic: everything you know about an entity without looking at others
Dimensional model: establish links between entities, transfer metrics
Prepare data sets for reporting and other data activation use cases
clean
a d
clean
d
clean
b f
clean
c
clean
ei e
a ian e
clean
a e
d
clean
m f
clean
mad
e ce
ini ial
de
e ce
b king
e ce
e ce
m f
ak i
bi
e ce
mailing
e n
e ce
a el
e ce
che
e ce
a el
m i n
e ce
a e
d
e ce
che
a el
e ice
e ce
a el
e ice
e ce
b king
i em
e ce
aim3
c e
in en
a
adj men
b king
e ce
a
c m e
b king
e ence
an f m
a el
an f m
b king
e ce
c n
c de
e ce
a
b king
i em
e ce
b2
a el
e ice
e ce
ad e i ing
e f mance
e ce
e n
e ce
ea n
cha ge
b king
i em
clean
ma k
c de
e ce
ecial
in ice
a el
e ice
e ce
ad e i ing
c
an f m
ad e i ing
e f mance
c m e
b king
channel
g
e ce
email
c n ac
e ce
i
c de
e ce
i
c de
dem g a hic
an f m
email
c n ac
e ce
cha ing
e
an f m
e n
an f m
cha ing
e
e ce
ge name
an f m
ai
l ca i n
e ce
b
l ca i n
an f m
b
l ca i n
e ce
i
c de
a
an f m
i
c de
an f m
i
c de
dem g a hic
an f m
i
c de
a
an f m
mailing
e n
an f m
a el
m i n
c ea e
mall
dimen i n
an f m an f m
a
an f m
che
an f m
a el
e ice
an f m
b king
i em
| @martin_loetzsch
Booking data pipeline @ Trendtours
Complexity handled well I
18
an fo m
o de
o igin
p epa e
in oice
heade
comp e
in oice
line
amo n
p epa e
o de
efe ence
p ep oce
cancelled
o de
p epa e
in oice
line
p epa e
ale
heade
comp e
open
line
amo n
ea ch
e change
o de
o igin
p epa e
ale
line
comp e
o e
line
amo n
p epa e
ale
heade
a chi e
p epa e
ale
line
a chi e
p epa e
o e
line
p epa e
c edi
heade
comp e
c edi
line
amo n
p epa e
c edi
efe ence
p epa e
c edi
line
p epa e
al e
en
p ep oce
o e
g o p
o de
i em
g o p
c edi
i em
p ep oce
e n
o de
an fo m
o e
p epa e
me ged
c ome
p ep oce
e p
c ome
an fo m
ind
g o p
p ep oce
con ac
ind
g o p
p ep oce
o de
p ep oce
email
con ac
p ep oce
me ged
c ome
calc la e
pa back
poin
an fo m
o de
i em
c ea e
nma ched
c edi
da a
e
comp e
cancelled
line
amo n
an fo m
cancelled
o de
an fo m
cancelled
o de
i em
an fo m
cancella ion
ea on
a en
ale
fac
p ep oce
o de
i em
an fo m
e n
o de
fac
c ea e
e n
da a e
an fo m
ale
fac a
fac
comp e
o de
agg ega e
an fo m
o de
p ep oce
b2b
acco n
p ep oce
b2b
acco n
hi o
p epa e
b2b
acco n
hi o
an fo m
b2b
acco n
hi o
an fo m
e p
c ome
an fo m
email
con ac
c ea e
o de
da a
e
a en
e o ce
fac
c ea e
e p
c ome
da a
e
c ea e
email
con ac
da a
e
load
calc la ed
bon
le
p ep oce
calc la ed
bon
c ea e
o e
cl e
c ea e
o e
da a
e
c ea e
e o ce
da a
e
c ea e
agg ega e
able
| @martin_loetzsch
Order data pipeline @ Lampenwelt
Complexity handled well II
19
l ad
da a
e a i nal
em
la f m
l ad
da a
la f m
e a i nal
em
l ad
ma ing
in
l ad
e
e change
a e
g gle
ad
me ged
e a i nal
em
la f m
da a
e
di a
e
a i
da e
ende k
n
b
3c
di a
e
m de
in
e
df
mail
m de
lie
mail
me ged
da a
e
c n ac
m men
d
da a
e
ma ke ing
c me
e ice
da a
e
bing
ad
c i e
faceb k
g gle
ad
e
g gle
ea ch
c n le
ma ke ing
da a
e
c me
da a
e
hi
me ic
e a i nal
c n i enc
check
me ged
c n i enc
check
c me
egmen a i n
e
a i
da e
a ge
edic i n
ca
ma ke ing
c n i enc
check
a360
a i
da e
la f m
c n i enc
check
a ge
edic i n
da a
e
c be
da e
e
a
afe
ca
Create separate pipelines for things that you
Want to develop independently
Want to test independently
Want to schedule independently
Ensure pipeline immutability by
Creating separate schemas for each pipeline
Avoiding modifying the schemas of other pipelines
Avoid circular dependencies
| @martin_loetzsch
Top level pipelines @ Helloprint
Split pipelines to separate concerns
20
Check for “lost” rows
SELECT util.assert_equal(
'The order items fact table should contain all order items',
'SELECT count(*) FROM os_dim.order_item',
'SELECT count(*) FROM os_dim.order_items_fact');
Check consistency across cubes / domains
SELECT util.assert_almost_equal(
'The number of first orders should be the same in '
|| 'orders and marketing touchpoints cube',
'SELECT count(net_order_id)
FROM os_dim.order
WHERE _net_order_rank = 1;',
'SELECT (SELECT sum(number_of_first_net_orders)
FROM m_dim.acquisition_performance)
/ (SELECT count(*)
FROM m_dim.performance_attribution_model)',
0.00001
);
Check completeness of source data
SELECT util.assert_not_found(
'Each adwords campaign must have the attribute "Channel"',
'SELECT DISTINCT campaign_name, account_name
FROM aw_tmp.ad
JOIN aw_dim.ad_performance ON ad_fk = ad_id
WHERE attributes->>''Channel'' IS NULL
AND impressions > 0
AND _date > now() - INTERVAL ''30 days''');
Check correctness of redistribution transformations
SELECT util.assert_almost_equal(
'The cost of non-converting touchpoints must match the'
|| 'redistributed customer acquisition and reactivation cost',
'SELECT sum(cost)
FROM m_tmp.cost_of_non_converting_touchpoints;',
'SELECT
(SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_acquisition_cost)
+ (SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_reactivation_cost);',
0.00001);
| @martin_loetzsch
Makes changing things easy
Write tests
21
Kimball style star schemas are not very user friendly Flatten all connected entities into a wide fact table
| @martin_loetzsch
Bring data in the shape that is right for consumers
Prepare dimensional model for data use cases
22
SELECT
order_item.order_item_id AS "Order item ID",
"order".order_id AS "Order ID",
"order".order_date AS "Order date",
order_customer.customer_id AS "Customer ID",
order_customer_favourite_product_category.main_category AS "Customer favo
order_customer_favourite_product_category.sub_category_1 AS "Customer fav
order_customer_first_order.order_date AS "Customer first order date",
product.sku AS "Product SKU",
product_product_category.main_category AS "Product category level 1",
product_product_category.sub_category_1 AS "Product category level 2",
order_item.order_item_id AS "# Order items",
order_item.order_fk AS "# Orders",
order_item.product_revenue AS "Product revenue",
order_item.revenue AS "Shipping revenue"
FROM dim.order_item order_item
LEFT JOIN dim."order" "order" ON order_item.order_fk = "order".order
LEFT JOIN dim.customer order_customer ON "order".customer_fk = order
LEFT JOIN dim.product_category order_customer_favourite_product_cate
order_customer_favourite_product_category.product_category_id
LEFT JOIN dim."order" order_customer_first_order ON order_customer.f
LEFT JOIN dim.product product ON order_item.product_fk = product.pro
LEFT JOIN dim.product_category product_product_category ON product.p
| @martin_loetzsch
Can be stressful
Don’t develop / test in production
23
Never:
Look at backend, create models that people might need
Very unlikely to create business value
Instead:
Collect business questions and derive analytical entities
And then build them, incrementally
| @martin_loetzsch
How to build my first pipeline?
How to start?
24
| @martin_loetzsch
Nobody is good at waterfall planning
Build things incrementally
25
Add entities, attributes & metrics step by step
Entity model evolution @ Amann Girrbach
| @martin_loetzsch
https://github.com/mara/mara-example-project-1
How to actually start? Use a template
26
Runnable example data warehouse
Based on a public e-commerce data set
Follows Project A modelling best practices
Copy and adapt
=
| @martin_loetzsch
The Project A data & analytics team is always happy to help!
Need help with any of this?
27
| @martin_loetzsch
Happy to answer your questions
Thank you!
28

More Related Content

What's hot

Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HAharoonm
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Business Intelligence (BI) and Data Management Basics
Business Intelligence (BI) and Data Management  Basics Business Intelligence (BI) and Data Management  Basics
Business Intelligence (BI) and Data Management Basics amorshed
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An IntroductionDenodo
 
Stream Processing Frameworks
Stream Processing FrameworksStream Processing Frameworks
Stream Processing FrameworksSirKetchup
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflakeSunil Gurav
 
MongoDB Aggregations Indexing and Profiling
MongoDB Aggregations Indexing and ProfilingMongoDB Aggregations Indexing and Profiling
MongoDB Aggregations Indexing and ProfilingManish Kapoor
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumChengKuan Gan
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Lect 08 materialized view
Lect 08 materialized viewLect 08 materialized view
Lect 08 materialized viewBilal khan
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...HostedbyConfluent
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfIlham31574
 

What's hot (20)

Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HA
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Business Intelligence (BI) and Data Management Basics
Business Intelligence (BI) and Data Management  Basics Business Intelligence (BI) and Data Management  Basics
Business Intelligence (BI) and Data Management Basics
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Stream Processing Frameworks
Stream Processing FrameworksStream Processing Frameworks
Stream Processing Frameworks
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflake
 
MongoDB Aggregations Indexing and Profiling
MongoDB Aggregations Indexing and ProfilingMongoDB Aggregations Indexing and Profiling
MongoDB Aggregations Indexing and Profiling
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Lect 08 materialized view
Lect 08 materialized viewLect 08 materialized view
Lect 08 materialized view
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
Data mesh
Data meshData mesh
Data mesh
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 

Similar to Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?

DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADataconomy Media
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)Jerome Eteve
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cqlzznate
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesMartin Loetzsch
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Class 12 computer sample paper with answers
Class 12 computer sample paper with answersClass 12 computer sample paper with answers
Class 12 computer sample paper with answersdebarghyamukherjee60
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable CassandraCassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandraaaronmorton
 
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: CassandraThe Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: CassandraDataStax Academy
 
Web Developer make the most out of your Database !
Web Developer make the most out of your Database !Web Developer make the most out of your Database !
Web Developer make the most out of your Database !Jean-Marc Desvaux
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionFranck Pachot
 

Similar to Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse? (20)

DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companies
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Sqlapi0.1
Sqlapi0.1Sqlapi0.1
Sqlapi0.1
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Class 12 computer sample paper with answers
Class 12 computer sample paper with answersClass 12 computer sample paper with answers
Class 12 computer sample paper with answers
 
LECTURE NOTES.pdf
LECTURE NOTES.pdfLECTURE NOTES.pdf
LECTURE NOTES.pdf
 
LECTURE NOTES.pdf
LECTURE NOTES.pdfLECTURE NOTES.pdf
LECTURE NOTES.pdf
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable CassandraCassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
 
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: CassandraThe Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
 
Web Developer make the most out of your Database !
Web Developer make the most out of your Database !Web Developer make the most out of your Database !
Web Developer make the most out of your Database !
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
 

Recently uploaded

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Recently uploaded (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?

  • 1. Dr. Martin Loetzsch Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
  • 2. Create a unified analytical model & activate data across whole company Create tables in a database Challenges Consistency & correctness Changeability Complexity Transparency | @martin_loetzsch In a Kimball-style star / snowflake schema Single analytical truth = a set of tables 2 application databases events csv files apis central data store/ data lake single analytical truth (marketing) automation reporting machine learning …
  • 3. | @martin_loetzsch Two data engineers at work Can be difficult at times 3
  • 4. Today: Best practices for coping with complexity 4 | @martin_loetzsch
  • 5. Not today: Databases & orchestration tools 5 | @martin_loetzsch
  • 6. Data pipelines as code SQL files, python & shell scripts Structure & content of data warehouse are result of running code Easy to debug & inspect Develop locally, test on staging system, then deploy to production Functional data engineering Reproducability, idempotency, (immutability). Running a pipeline on the same inputs will always produce the same results. No side effects. | @martin_loetzsch Apply standard software engineering best practices Make changing and testing things easy 6 https://medium.com/@maximebeauchemin/functional-data-engineering- a-modern-paradigm-for-batch-data-processing-2327ec32c42a
  • 7. https://hakibenita.com/sql-for-data-analysis#sql-vs-pandas-performance Simple analysis SELECT activated, count(*) AS cnt FROM users GROUP BY activated Processing data in Pandas creates quite some overhead Original table 65MB, Pandas data frame ~300MB | @martin_loetzsch Processing data in Python / Java etc. is very expensive Leave data in database 7
  • 8. Target of computation CREATE TABLE m_dim_next.region ( region_id SMALLINT PRIMARY KEY, region_name TEXT NOT NULL UNIQUE, country_id SMALLINT NOT NULL, country_name TEXT NOT NULL, _region_name TEXT NOT NULL ); Do computation and store result in table WITH raw_region AS (SELECT DISTINCT country, region FROM m_data.ga_session ORDER BY country, region) INSERT INTO m_dim_next.region SELECT row_number() OVER (ORDER BY country, region ) AS region_id, CASE WHEN (SELECT count(DISTINCT country) FROM raw_region r2 WHERE r2.region = r1.region) > 1 THEN region || ' / ' || country ELSE region END AS region_name, dense_rank() OVER (ORDER BY country) AS country_id, country AS country_name, region AS _region_name FROM raw_region r1; INSERT INTO m_dim_next.region VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown'); Speedup subsequent transformations SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['_region_name', ‘country_name', 'region_id']); SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['country_id', 'region_id']); ANALYZE m_dim_next.region; | @martin_loetzsch Tables as (intermediate) results of processing steps Use SQL as a data processing language 8
  • 9. At the beginning of a transformation pipeline DROP SCHEMA IF EXISTS m_dim_next CASCADE; CREATE SCHEMA m_dim_next; DROP SCHEMA IF EXISTS m_tmp CASCADE; CREATE SCHEMA m_tmp; Do all transformations CREATE TABLE m_tmp.foo AS SELECT n FROM generate_series(0, 10000) n; CREATE TABLE m_dim_next.foo AS SELECT n, n + 1 FROM m_tmp.foo; -- .. At the end of pipeline, after checks DROP SCHEMA IF EXISTS m_dim CASCADE; ALTER SCHEMA m_dim_next RENAME TO m_dim; Build m_dim_next schema while users are seeing m_dim Data shown in frontend is always consistent When ETL breaks, data just gets old DBT users Copy final reporting tables in a last step, after checks Or abuse hooks/ macros Explicit operations determine structure and content of DWH Very developer / git friendly Requirement: Must be possible to rebuild DWH from sources Add incremental loading / processing when there is a pain | @martin_loetzsch Don’t bother with updates and/ or migrations Always drop & rebuild, keep last working version 9
  • 10. Renaming / dropping schemas requires exclusive locks CREATE FUNCTION util.cancel_queries_on_schema(schema TEXT) RETURNS BOOLEAN AS $$ SELECT pg_cancel_backend(pid) FROM (SELECT DISTINCT pid FROM pg_locks JOIN pg_database ON database = pg_database.oid JOIN pg_class ON pg_class.oid = relation JOIN pg_namespace ON relnamespace = pg_namespace.oid WHERE datname = current_database() AND nspname = schema AND pid != pg_backend_pid()) t; $$ LANGUAGE SQL; First rename, then drop CREATE FUNCTION util.replace_schema(schemaname TEXT, replace_with TEXT) RETURNS VOID AS $$ BEGIN PERFORM util.cancel_queries_on_schema(schemaname); IF EXISTS(SELECT * FROM information_schema.schemata s WHERE s.schema_name = schemaname) THEN EXECUTE 'ALTER SCHEMA ' || schemaname || ' RENAME TO ' || schemaname || '_old;'; END IF; EXECUTE 'ALTER SCHEMA ' || replace_with || ' RENAME TO ' || schemaname || ';'; -- again, for good measure PERFORM util.cancel_queries_on_schema(schemaname); EXECUTE 'DROP SCHEMA IF EXISTS ' || schemaname || '_old CASCADE;'; END; $$ LANGUAGE 'plpgsql' Nice one-liner SELECT util.replace_schema('m_dim', 'm_dim_next'); | @martin_loetzsch Transactional DDL FTW Atomic & robust schema switches in PostgreSQL 10
  • 11. (Re)-creating data sets (schemas) def re_create_data_set(next_bq_dataset_id=next_bq_dataset_id): from mara_db.bigquery import bigquery_client client = bigquery_client(bq_db_alias) print(f'deleting dataset {next_bq_dataset_id}') client.delete_dataset(dataset=next_bq_dataset_id, delete_contents=True, not_found_ok=True) print(f'creating dataset {next_bq_dataset_id}') client.create_dataset(dataset=next_bq_dataset_id, exists_ok=True) return True Replacing data sets def replace_dataset(db_alias, dataset_id, next_dataset_id): from mara_db.bigquery import bigquery_client client = bigquery_client(db_alias) client.create_dataset(dataset=dataset_id, exists_ok=True) next_tables = set([table.table_id for table in client.list_tables(next_dataset_id)]) ddl = 'n' for table in client.list_tables(dataset_id): if table.table_id not in next_tables: ddl += f'DROP TABLE `{dataset_id}`.`{table.table_id}`;' # hopefully atomic operation for table_id in next_tables: ddl += f'CREATE OR REPLACE TABLE `{dataset_id}`.`{table_id}` AS SELECT * FROM `{next_dataset_id}`.`{table_id}`;n' ddl += f'DROP TABLE `{next_dataset_id}`.`{table_id}`;n' client.query(ddl) retries = 1 while True: try: client.delete_dataset(next_dataset_id) return except BadRequest as e: if retries <= 10: print(e, file=sys.stderr) seconds_to_sleep = retries * 4 print(f'Waiting {seconds_to_sleep} seconds') time.sleep(seconds_to_sleep) retries += 1 else: raise e | @martin_loetzsch https://github.com/mara/mara-db/blob/master/mara_db/bigquery.py#L124 Schema switches are a bit less nice in BigQuery 11
  • 12. ALTER SCHEMA m_dim_next SWAP WITH m_dim; DROP SCHEMA m_dim CASCADE; | @martin_loetzsch They built a command for atomic schema switches Very nice in Snowflake 12
  • 13. It’s easy to make mistakes during ETL DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; CREATE TABLE s.city ( city_id SMALLINT, city_name TEXT, country_name TEXT ); INSERT INTO s.city VALUES (1, 'Berlin', 'Germany'), (2, 'Budapest', 'Hungary'); CREATE TABLE s.customer ( customer_id BIGINT, city_fk SMALLINT ); INSERT INTO s.customer VALUES (1, 1), (1, 2), (2, 3); Customers per country? SELECT country_name, count(*) AS number_of_customers FROM s.customer JOIN s.city ON customer.city_fk = s.city.city_id GROUP BY country_name; Back up all assumptions about data by constraints ALTER TABLE s.city ADD PRIMARY KEY (city_id); ALTER TABLE s.city ADD UNIQUE (city_name); ALTER TABLE s.city ADD UNIQUE (city_name, country_name); ALTER TABLE s.customer ADD PRIMARY KEY (customer_id); [23505] ERROR: could not create unique index "customer_pkey" Detail: Key (customer_id)=(1) is duplicated. ALTER TABLE s.customer ADD FOREIGN KEY (city_fk) REFERENCES s.city (city_id); [23503] ERROR: insert or update on table "customer" violates foreign key constraint "customer_city_fk_fkey" Detail: Key (city_fk)=(3) is not present in table "city" | @martin_loetzsch Only very little overhead, will save your ass Avoid happy path programming 13
  • 14. In all tables across the pipeline DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; CREATE TABLE s.city ( city_id SMALLINT NOT NULL UNIQUE, city_name TEXT NOT NULL, country_name TEXT NOT NULL ); INSERT INTO s.city VALUES (1, 'Berlin', 'Germany'), (2, 'Budapest', ‘Hungary'), (-1, 'Unknown', ‘Unknown'); CREATE TABLE s.customer ( customer_id BIGINT NOT NULL UNIQUE, city_fk SMALLINT NOT NULL ); INSERT INTO s.customer VALUES (1, 1), (2, -1); NOT NULL constraints There needs to be a justification for making a column nullable Helps you to detect problems very early on Unique constraints When you think it should be unique, add a constraint Will save run times and cost Unknown values Adding another breakdown should not change the sum of a metric | @martin_loetzsch And sometimes add “Unknown” values Make columns not nullable by default 14
  • 15. | @martin_loetzsch Never ever! Don’t repeat business logic 15
  • 16. “Processed order”? SELECT sum(total_price) AS revenue FROM os_data.order WHERE status IN ('pending', 'accepted', 'completed', 'proposal_for_change'); SELECT CASE WHEN (status <> 'started' AND payment_status = 'authorised' AND order_type <> 'backend') THEN o.order_id END AS processed_order_fk FROM os_data.order; SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed FROM os_data.order; Solution: refactor pipeline Create separate table that computes everything we know about an order | @martin_loetzsch Never ever! Don’t repeat business logic 16 Never ever!
  • 17. Load → clean → preprocess → transform → flatten Never query data from more than one layer above If you need one more column, add it to each of these steps Will save cost & runtime and make data lineage easier Compute things as early as possible It’s always the easiest to get something done based on finished fact tables Don’t do that! | @martin_loetzsch Requires discipline Organise processing steps in layers 17 load-product load-order load-customer transform-product transform-order transform-customer flatten-product-fact flatten-order-fact flatten-customer-fact clean-product clean-order clean-customer preprocess-product preprocess-order preprocess-customer Unchanged raw data as ingested Remove test orders, parse timestamps, correct garbage Business logic: everything you know about an entity without looking at others Dimensional model: establish links between entities, transfer metrics Prepare data sets for reporting and other data activation use cases
  • 18. clean a d clean d clean b f clean c clean ei e a ian e clean a e d clean m f clean mad e ce ini ial de e ce b king e ce e ce m f ak i bi e ce mailing e n e ce a el e ce che e ce a el m i n e ce a e d e ce che a el e ice e ce a el e ice e ce b king i em e ce aim3 c e in en a adj men b king e ce a c m e b king e ence an f m a el an f m b king e ce c n c de e ce a b king i em e ce b2 a el e ice e ce ad e i ing e f mance e ce e n e ce ea n cha ge b king i em clean ma k c de e ce ecial in ice a el e ice e ce ad e i ing c an f m ad e i ing e f mance c m e b king channel g e ce email c n ac e ce i c de e ce i c de dem g a hic an f m email c n ac e ce cha ing e an f m e n an f m cha ing e e ce ge name an f m ai l ca i n e ce b l ca i n an f m b l ca i n e ce i c de a an f m i c de an f m i c de dem g a hic an f m i c de a an f m mailing e n an f m a el m i n c ea e mall dimen i n an f m an f m a an f m che an f m a el e ice an f m b king i em | @martin_loetzsch Booking data pipeline @ Trendtours Complexity handled well I 18
  • 19. an fo m o de o igin p epa e in oice heade comp e in oice line amo n p epa e o de efe ence p ep oce cancelled o de p epa e in oice line p epa e ale heade comp e open line amo n ea ch e change o de o igin p epa e ale line comp e o e line amo n p epa e ale heade a chi e p epa e ale line a chi e p epa e o e line p epa e c edi heade comp e c edi line amo n p epa e c edi efe ence p epa e c edi line p epa e al e en p ep oce o e g o p o de i em g o p c edi i em p ep oce e n o de an fo m o e p epa e me ged c ome p ep oce e p c ome an fo m ind g o p p ep oce con ac ind g o p p ep oce o de p ep oce email con ac p ep oce me ged c ome calc la e pa back poin an fo m o de i em c ea e nma ched c edi da a e comp e cancelled line amo n an fo m cancelled o de an fo m cancelled o de i em an fo m cancella ion ea on a en ale fac p ep oce o de i em an fo m e n o de fac c ea e e n da a e an fo m ale fac a fac comp e o de agg ega e an fo m o de p ep oce b2b acco n p ep oce b2b acco n hi o p epa e b2b acco n hi o an fo m b2b acco n hi o an fo m e p c ome an fo m email con ac c ea e o de da a e a en e o ce fac c ea e e p c ome da a e c ea e email con ac da a e load calc la ed bon le p ep oce calc la ed bon c ea e o e cl e c ea e o e da a e c ea e e o ce da a e c ea e agg ega e able | @martin_loetzsch Order data pipeline @ Lampenwelt Complexity handled well II 19
  • 20. l ad da a e a i nal em la f m l ad da a la f m e a i nal em l ad ma ing in l ad e e change a e g gle ad me ged e a i nal em la f m da a e di a e a i da e ende k n b 3c di a e m de in e df mail m de lie mail me ged da a e c n ac m men d da a e ma ke ing c me e ice da a e bing ad c i e faceb k g gle ad e g gle ea ch c n le ma ke ing da a e c me da a e hi me ic e a i nal c n i enc check me ged c n i enc check c me egmen a i n e a i da e a ge edic i n ca ma ke ing c n i enc check a360 a i da e la f m c n i enc check a ge edic i n da a e c be da e e a afe ca Create separate pipelines for things that you Want to develop independently Want to test independently Want to schedule independently Ensure pipeline immutability by Creating separate schemas for each pipeline Avoiding modifying the schemas of other pipelines Avoid circular dependencies | @martin_loetzsch Top level pipelines @ Helloprint Split pipelines to separate concerns 20
  • 21. Check for “lost” rows SELECT util.assert_equal( 'The order items fact table should contain all order items', 'SELECT count(*) FROM os_dim.order_item', 'SELECT count(*) FROM os_dim.order_items_fact'); Check consistency across cubes / domains SELECT util.assert_almost_equal( 'The number of first orders should be the same in ' || 'orders and marketing touchpoints cube', 'SELECT count(net_order_id) FROM os_dim.order WHERE _net_order_rank = 1;', 'SELECT (SELECT sum(number_of_first_net_orders) FROM m_dim.acquisition_performance) / (SELECT count(*) FROM m_dim.performance_attribution_model)', 0.00001 ); Check completeness of source data SELECT util.assert_not_found( 'Each adwords campaign must have the attribute "Channel"', 'SELECT DISTINCT campaign_name, account_name FROM aw_tmp.ad JOIN aw_dim.ad_performance ON ad_fk = ad_id WHERE attributes->>''Channel'' IS NULL AND impressions > 0 AND _date > now() - INTERVAL ''30 days'''); Check correctness of redistribution transformations SELECT util.assert_almost_equal( 'The cost of non-converting touchpoints must match the' || 'redistributed customer acquisition and reactivation cost', 'SELECT sum(cost) FROM m_tmp.cost_of_non_converting_touchpoints;', 'SELECT (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_acquisition_cost) + (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_reactivation_cost);', 0.00001); | @martin_loetzsch Makes changing things easy Write tests 21
  • 22. Kimball style star schemas are not very user friendly Flatten all connected entities into a wide fact table | @martin_loetzsch Bring data in the shape that is right for consumers Prepare dimensional model for data use cases 22 SELECT order_item.order_item_id AS "Order item ID", "order".order_id AS "Order ID", "order".order_date AS "Order date", order_customer.customer_id AS "Customer ID", order_customer_favourite_product_category.main_category AS "Customer favo order_customer_favourite_product_category.sub_category_1 AS "Customer fav order_customer_first_order.order_date AS "Customer first order date", product.sku AS "Product SKU", product_product_category.main_category AS "Product category level 1", product_product_category.sub_category_1 AS "Product category level 2", order_item.order_item_id AS "# Order items", order_item.order_fk AS "# Orders", order_item.product_revenue AS "Product revenue", order_item.revenue AS "Shipping revenue" FROM dim.order_item order_item LEFT JOIN dim."order" "order" ON order_item.order_fk = "order".order LEFT JOIN dim.customer order_customer ON "order".customer_fk = order LEFT JOIN dim.product_category order_customer_favourite_product_cate order_customer_favourite_product_category.product_category_id LEFT JOIN dim."order" order_customer_first_order ON order_customer.f LEFT JOIN dim.product product ON order_item.product_fk = product.pro LEFT JOIN dim.product_category product_product_category ON product.p
  • 23. | @martin_loetzsch Can be stressful Don’t develop / test in production 23
  • 24. Never: Look at backend, create models that people might need Very unlikely to create business value Instead: Collect business questions and derive analytical entities And then build them, incrementally | @martin_loetzsch How to build my first pipeline? How to start? 24
  • 25. | @martin_loetzsch Nobody is good at waterfall planning Build things incrementally 25 Add entities, attributes & metrics step by step Entity model evolution @ Amann Girrbach
  • 26. | @martin_loetzsch https://github.com/mara/mara-example-project-1 How to actually start? Use a template 26 Runnable example data warehouse Based on a public e-commerce data set Follows Project A modelling best practices Copy and adapt
  • 27. = | @martin_loetzsch The Project A data & analytics team is always happy to help! Need help with any of this? 27
  • 28. | @martin_loetzsch Happy to answer your questions Thank you! 28