SlideShare a Scribd company logo
1 of 28
Download to read offline
Dr. Martin Loetzsch
Project A Data Modelling
Best Practices Part II:
How to Build a Data Warehouse?
Create a unified analytical model & activate data across whole company
Create tables in a database Challenges
Consistency & correctness
Changeability
Complexity
Transparency
| @martin_loetzsch
In a Kimball-style star / snowflake schema
Single analytical truth = a set of tables
2
application
databases
events
csv files
apis
central data
store/
data lake
single
analytical
truth
(marketing)
automation
reporting
machine
learning
…
| @martin_loetzsch
Two data engineers at work
Can be difficult at times
3
Today:
Best practices for coping with
complexity
4 | @martin_loetzsch
Not today:
Databases & orchestration tools
5 | @martin_loetzsch
Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code
Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
Functional data engineering
Reproducability, idempotency, (immutability).
Running a pipeline on the same inputs will always produce the same results.
No side effects.
| @martin_loetzsch
Apply standard software engineering best practices
Make changing and testing things easy
6
https://medium.com/@maximebeauchemin/functional-data-engineering-
a-modern-paradigm-for-batch-data-processing-2327ec32c42a
https://hakibenita.com/sql-for-data-analysis#sql-vs-pandas-performance Simple analysis
SELECT activated,
count(*) AS cnt
FROM users
GROUP BY activated
Processing data in Pandas creates quite some overhead
Original table 65MB, Pandas data frame ~300MB
| @martin_loetzsch
Processing data in Python / Java etc. is very expensive
Leave data in database
7
Target of computation
CREATE TABLE m_dim_next.region (
region_id SMALLINT PRIMARY KEY,
region_name TEXT NOT NULL UNIQUE,
country_id SMALLINT NOT NULL,
country_name TEXT NOT NULL,
_region_name TEXT NOT NULL
);
Do computation and store result in table
WITH raw_region
AS (SELECT DISTINCT
country,
region
FROM m_data.ga_session
ORDER BY country, region)
INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id,
CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1;
INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');
Speedup subsequent transformations
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']);
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']);
ANALYZE m_dim_next.region;
| @martin_loetzsch
Tables as (intermediate) results of processing steps
Use SQL as a data processing language
8
At the beginning of a transformation pipeline
DROP SCHEMA IF EXISTS m_dim_next CASCADE;
CREATE SCHEMA m_dim_next;
DROP SCHEMA IF EXISTS m_tmp CASCADE;
CREATE SCHEMA m_tmp;
Do all transformations
CREATE TABLE m_tmp.foo
AS SELECT n
FROM generate_series(0, 10000) n;
CREATE TABLE m_dim_next.foo
AS SELECT
n,
n + 1
FROM m_tmp.foo;
-- ..
At the end of pipeline, after checks
DROP SCHEMA IF EXISTS m_dim CASCADE;
ALTER SCHEMA m_dim_next RENAME TO m_dim;
Build m_dim_next schema while users are seeing m_dim
Data shown in frontend is always consistent
When ETL breaks, data just gets old
DBT users
Copy final reporting tables in a last step, after checks
Or abuse hooks/ macros
Explicit operations determine structure and content of DWH
Very developer / git friendly
Requirement: Must be possible to rebuild DWH from sources
Add incremental loading / processing when there is a pain
| @martin_loetzsch
Don’t bother with updates and/ or migrations
Always drop & rebuild, keep last working version
9
Renaming / dropping schemas requires exclusive locks
CREATE FUNCTION util.cancel_queries_on_schema(schema TEXT)
RETURNS BOOLEAN AS $$
SELECT pg_cancel_backend(pid)
FROM
(SELECT DISTINCT pid
FROM pg_locks
JOIN pg_database ON database = pg_database.oid
JOIN pg_class ON pg_class.oid = relation
JOIN pg_namespace ON relnamespace = pg_namespace.oid
WHERE datname = current_database() AND nspname = schema
AND pid != pg_backend_pid()) t;
$$ LANGUAGE SQL;
First rename, then drop
CREATE FUNCTION util.replace_schema(schemaname TEXT,
replace_with TEXT)
RETURNS VOID AS $$
BEGIN
PERFORM util.cancel_queries_on_schema(schemaname);
IF EXISTS(SELECT *
FROM information_schema.schemata s
WHERE s.schema_name = schemaname)
THEN
EXECUTE 'ALTER SCHEMA ' || schemaname
|| ' RENAME TO ' || schemaname || '_old;';
END IF;
EXECUTE 'ALTER SCHEMA ' || replace_with
|| ' RENAME TO ' || schemaname || ';';
-- again, for good measure
PERFORM util.cancel_queries_on_schema(schemaname);
EXECUTE 'DROP SCHEMA IF EXISTS ' || schemaname
|| '_old CASCADE;';
END;
$$ LANGUAGE 'plpgsql'
Nice one-liner
SELECT util.replace_schema('m_dim', 'm_dim_next');
| @martin_loetzsch
Transactional DDL FTW
Atomic & robust schema switches in PostgreSQL
10
(Re)-creating data sets (schemas)
def re_create_data_set(next_bq_dataset_id=next_bq_dataset_id):
from mara_db.bigquery import bigquery_client
client = bigquery_client(bq_db_alias)
print(f'deleting dataset {next_bq_dataset_id}')
client.delete_dataset(dataset=next_bq_dataset_id,
delete_contents=True, not_found_ok=True)
print(f'creating dataset {next_bq_dataset_id}')
client.create_dataset(dataset=next_bq_dataset_id, exists_ok=True)
return True
Replacing data sets
def replace_dataset(db_alias, dataset_id, next_dataset_id):
from mara_db.bigquery import bigquery_client
client = bigquery_client(db_alias)
client.create_dataset(dataset=dataset_id, exists_ok=True)
next_tables = set([table.table_id for table
in client.list_tables(next_dataset_id)])
ddl = 'n'
for table in client.list_tables(dataset_id):
if table.table_id not in next_tables:
ddl += f'DROP TABLE `{dataset_id}`.`{table.table_id}`;'
# hopefully atomic operation
for table_id in next_tables:
ddl += f'CREATE OR REPLACE TABLE `{dataset_id}`.`{table_id}`
AS SELECT * FROM `{next_dataset_id}`.`{table_id}`;n'
ddl += f'DROP TABLE `{next_dataset_id}`.`{table_id}`;n'
client.query(ddl)
retries = 1
while True:
try:
client.delete_dataset(next_dataset_id)
return
except BadRequest as e:
if retries <= 10:
print(e, file=sys.stderr)
seconds_to_sleep = retries * 4
print(f'Waiting {seconds_to_sleep} seconds')
time.sleep(seconds_to_sleep)
retries += 1
else:
raise e
| @martin_loetzsch
https://github.com/mara/mara-db/blob/master/mara_db/bigquery.py#L124
Schema switches are a bit less nice in BigQuery
11
ALTER SCHEMA m_dim_next SWAP WITH m_dim;
DROP SCHEMA m_dim CASCADE;
| @martin_loetzsch
They built a command for atomic schema switches
Very nice in Snowflake
12
It’s easy to make mistakes during ETL
DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
CREATE TABLE s.city (
city_id SMALLINT,
city_name TEXT,
country_name TEXT
);
INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', 'Hungary');
CREATE TABLE s.customer (
customer_id BIGINT,
city_fk SMALLINT
);
INSERT INTO s.customer VALUES
(1, 1),
(1, 2),
(2, 3);
Customers per country?
SELECT
country_name,
count(*) AS number_of_customers
FROM s.customer JOIN s.city
ON customer.city_fk = s.city.city_id
GROUP BY country_name;
Back up all assumptions about data by constraints
ALTER TABLE s.city ADD PRIMARY KEY (city_id);
ALTER TABLE s.city ADD UNIQUE (city_name);
ALTER TABLE s.city ADD UNIQUE (city_name, country_name);
ALTER TABLE s.customer ADD PRIMARY KEY (customer_id);
[23505] ERROR: could not create unique index "customer_pkey"
Detail: Key (customer_id)=(1) is duplicated.
ALTER TABLE s.customer ADD FOREIGN KEY (city_fk)
REFERENCES s.city (city_id);
[23503] ERROR: insert or update on table "customer" violates foreign
key constraint "customer_city_fk_fkey"
Detail: Key (city_fk)=(3) is not present in table "city"
| @martin_loetzsch
Only very little overhead, will save your ass
Avoid happy path programming
13
In all tables across the pipeline
DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
CREATE TABLE s.city (
city_id SMALLINT NOT NULL UNIQUE,
city_name TEXT NOT NULL,
country_name TEXT NOT NULL
);
INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', ‘Hungary'),
(-1, 'Unknown', ‘Unknown');
CREATE TABLE s.customer (
customer_id BIGINT NOT NULL UNIQUE,
city_fk SMALLINT NOT NULL
);
INSERT INTO s.customer VALUES
(1, 1),
(2, -1);
NOT NULL constraints
There needs to be a justification for making a column nullable
Helps you to detect problems very early on
Unique constraints
When you think it should be unique, add a constraint
Will save run times and cost
Unknown values
Adding another breakdown should not change the sum of a metric
| @martin_loetzsch
And sometimes add “Unknown” values
Make columns not nullable by default
14
| @martin_loetzsch
Never ever!
Don’t repeat business logic
15
“Processed order”?
SELECT sum(total_price) AS revenue
FROM os_data.order
WHERE status IN ('pending', 'accepted', 'completed',
'proposal_for_change');
SELECT CASE WHEN (status <> 'started'
AND payment_status = 'authorised'
AND order_type <> 'backend')
THEN o.order_id END AS processed_order_fk
FROM os_data.order;
SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed
FROM os_data.order;
Solution: refactor pipeline
Create separate table that computes everything we know about an order
| @martin_loetzsch
Never ever!
Don’t repeat business logic
16
Never ever!
Load → clean → preprocess → transform → flatten
Never query data from more than one layer above
If you need one more column, add it to each of these steps
Will save cost & runtime and make data lineage easier
Compute things as early as possible
It’s always the easiest to get something done based on finished fact tables
Don’t do that!
| @martin_loetzsch
Requires discipline
Organise processing steps in layers
17
load-product load-order load-customer
transform-product transform-order transform-customer
flatten-product-fact flatten-order-fact flatten-customer-fact
clean-product clean-order clean-customer
preprocess-product preprocess-order preprocess-customer
Unchanged raw data as ingested
Remove test orders, parse timestamps, correct garbage
Business logic: everything you know about an entity without looking at others
Dimensional model: establish links between entities, transfer metrics
Prepare data sets for reporting and other data activation use cases
clean
a d
clean
d
clean
b f
clean
c
clean
ei e
a ian e
clean
a e
d
clean
m f
clean
mad
e ce
ini ial
de
e ce
b king
e ce
e ce
m f
ak i
bi
e ce
mailing
e n
e ce
a el
e ce
che
e ce
a el
m i n
e ce
a e
d
e ce
che
a el
e ice
e ce
a el
e ice
e ce
b king
i em
e ce
aim3
c e
in en
a
adj men
b king
e ce
a
c m e
b king
e ence
an f m
a el
an f m
b king
e ce
c n
c de
e ce
a
b king
i em
e ce
b2
a el
e ice
e ce
ad e i ing
e f mance
e ce
e n
e ce
ea n
cha ge
b king
i em
clean
ma k
c de
e ce
ecial
in ice
a el
e ice
e ce
ad e i ing
c
an f m
ad e i ing
e f mance
c m e
b king
channel
g
e ce
email
c n ac
e ce
i
c de
e ce
i
c de
dem g a hic
an f m
email
c n ac
e ce
cha ing
e
an f m
e n
an f m
cha ing
e
e ce
ge name
an f m
ai
l ca i n
e ce
b
l ca i n
an f m
b
l ca i n
e ce
i
c de
a
an f m
i
c de
an f m
i
c de
dem g a hic
an f m
i
c de
a
an f m
mailing
e n
an f m
a el
m i n
c ea e
mall
dimen i n
an f m an f m
a
an f m
che
an f m
a el
e ice
an f m
b king
i em
| @martin_loetzsch
Booking data pipeline @ Trendtours
Complexity handled well I
18
an fo m
o de
o igin
p epa e
in oice
heade
comp e
in oice
line
amo n
p epa e
o de
efe ence
p ep oce
cancelled
o de
p epa e
in oice
line
p epa e
ale
heade
comp e
open
line
amo n
ea ch
e change
o de
o igin
p epa e
ale
line
comp e
o e
line
amo n
p epa e
ale
heade
a chi e
p epa e
ale
line
a chi e
p epa e
o e
line
p epa e
c edi
heade
comp e
c edi
line
amo n
p epa e
c edi
efe ence
p epa e
c edi
line
p epa e
al e
en
p ep oce
o e
g o p
o de
i em
g o p
c edi
i em
p ep oce
e n
o de
an fo m
o e
p epa e
me ged
c ome
p ep oce
e p
c ome
an fo m
ind
g o p
p ep oce
con ac
ind
g o p
p ep oce
o de
p ep oce
email
con ac
p ep oce
me ged
c ome
calc la e
pa back
poin
an fo m
o de
i em
c ea e
nma ched
c edi
da a
e
comp e
cancelled
line
amo n
an fo m
cancelled
o de
an fo m
cancelled
o de
i em
an fo m
cancella ion
ea on
a en
ale
fac
p ep oce
o de
i em
an fo m
e n
o de
fac
c ea e
e n
da a e
an fo m
ale
fac a
fac
comp e
o de
agg ega e
an fo m
o de
p ep oce
b2b
acco n
p ep oce
b2b
acco n
hi o
p epa e
b2b
acco n
hi o
an fo m
b2b
acco n
hi o
an fo m
e p
c ome
an fo m
email
con ac
c ea e
o de
da a
e
a en
e o ce
fac
c ea e
e p
c ome
da a
e
c ea e
email
con ac
da a
e
load
calc la ed
bon
le
p ep oce
calc la ed
bon
c ea e
o e
cl e
c ea e
o e
da a
e
c ea e
e o ce
da a
e
c ea e
agg ega e
able
| @martin_loetzsch
Order data pipeline @ Lampenwelt
Complexity handled well II
19
l ad
da a
e a i nal
em
la f m
l ad
da a
la f m
e a i nal
em
l ad
ma ing
in
l ad
e
e change
a e
g gle
ad
me ged
e a i nal
em
la f m
da a
e
di a
e
a i
da e
ende k
n
b
3c
di a
e
m de
in
e
df
mail
m de
lie
mail
me ged
da a
e
c n ac
m men
d
da a
e
ma ke ing
c me
e ice
da a
e
bing
ad
c i e
faceb k
g gle
ad
e
g gle
ea ch
c n le
ma ke ing
da a
e
c me
da a
e
hi
me ic
e a i nal
c n i enc
check
me ged
c n i enc
check
c me
egmen a i n
e
a i
da e
a ge
edic i n
ca
ma ke ing
c n i enc
check
a360
a i
da e
la f m
c n i enc
check
a ge
edic i n
da a
e
c be
da e
e
a
afe
ca
Create separate pipelines for things that you
Want to develop independently
Want to test independently
Want to schedule independently
Ensure pipeline immutability by
Creating separate schemas for each pipeline
Avoiding modifying the schemas of other pipelines
Avoid circular dependencies
| @martin_loetzsch
Top level pipelines @ Helloprint
Split pipelines to separate concerns
20
Check for “lost” rows
SELECT util.assert_equal(
'The order items fact table should contain all order items',
'SELECT count(*) FROM os_dim.order_item',
'SELECT count(*) FROM os_dim.order_items_fact');
Check consistency across cubes / domains
SELECT util.assert_almost_equal(
'The number of first orders should be the same in '
|| 'orders and marketing touchpoints cube',
'SELECT count(net_order_id)
FROM os_dim.order
WHERE _net_order_rank = 1;',
'SELECT (SELECT sum(number_of_first_net_orders)
FROM m_dim.acquisition_performance)
/ (SELECT count(*)
FROM m_dim.performance_attribution_model)',
0.00001
);
Check completeness of source data
SELECT util.assert_not_found(
'Each adwords campaign must have the attribute "Channel"',
'SELECT DISTINCT campaign_name, account_name
FROM aw_tmp.ad
JOIN aw_dim.ad_performance ON ad_fk = ad_id
WHERE attributes->>''Channel'' IS NULL
AND impressions > 0
AND _date > now() - INTERVAL ''30 days''');
Check correctness of redistribution transformations
SELECT util.assert_almost_equal(
'The cost of non-converting touchpoints must match the'
|| 'redistributed customer acquisition and reactivation cost',
'SELECT sum(cost)
FROM m_tmp.cost_of_non_converting_touchpoints;',
'SELECT
(SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_acquisition_cost)
+ (SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_reactivation_cost);',
0.00001);
| @martin_loetzsch
Makes changing things easy
Write tests
21
Kimball style star schemas are not very user friendly Flatten all connected entities into a wide fact table
| @martin_loetzsch
Bring data in the shape that is right for consumers
Prepare dimensional model for data use cases
22
SELECT
order_item.order_item_id AS "Order item ID",
"order".order_id AS "Order ID",
"order".order_date AS "Order date",
order_customer.customer_id AS "Customer ID",
order_customer_favourite_product_category.main_category AS "Customer favo
order_customer_favourite_product_category.sub_category_1 AS "Customer fav
order_customer_first_order.order_date AS "Customer first order date",
product.sku AS "Product SKU",
product_product_category.main_category AS "Product category level 1",
product_product_category.sub_category_1 AS "Product category level 2",
order_item.order_item_id AS "# Order items",
order_item.order_fk AS "# Orders",
order_item.product_revenue AS "Product revenue",
order_item.revenue AS "Shipping revenue"
FROM dim.order_item order_item
LEFT JOIN dim."order" "order" ON order_item.order_fk = "order".order
LEFT JOIN dim.customer order_customer ON "order".customer_fk = order
LEFT JOIN dim.product_category order_customer_favourite_product_cate
order_customer_favourite_product_category.product_category_id
LEFT JOIN dim."order" order_customer_first_order ON order_customer.f
LEFT JOIN dim.product product ON order_item.product_fk = product.pro
LEFT JOIN dim.product_category product_product_category ON product.p
| @martin_loetzsch
Can be stressful
Don’t develop / test in production
23
Never:
Look at backend, create models that people might need
Very unlikely to create business value
Instead:
Collect business questions and derive analytical entities
And then build them, incrementally
| @martin_loetzsch
How to build my first pipeline?
How to start?
24
| @martin_loetzsch
Nobody is good at waterfall planning
Build things incrementally
25
Add entities, attributes & metrics step by step
Entity model evolution @ Amann Girrbach
| @martin_loetzsch
https://github.com/mara/mara-example-project-1
How to actually start? Use a template
26
Runnable example data warehouse
Based on a public e-commerce data set
Follows Project A modelling best practices
Copy and adapt
=
| @martin_loetzsch
The Project A data & analytics team is always happy to help!
Need help with any of this?
27
| @martin_loetzsch
Happy to answer your questions
Thank you!
28

More Related Content

What's hot

Oracle Parallel Distribution and 12c Adaptive Plans
Oracle Parallel Distribution and 12c Adaptive PlansOracle Parallel Distribution and 12c Adaptive Plans
Oracle Parallel Distribution and 12c Adaptive PlansFranck Pachot
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guidethomasmary607
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureDATAVERSITY
 
MS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningMS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningDataminingTools Inc
 
SQL Server Integration Services
SQL Server Integration ServicesSQL Server Integration Services
SQL Server Integration ServicesRobert MacLean
 
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Enabling a Data Mesh Architecture and Data Sharing Culture with DenodoEnabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Enabling a Data Mesh Architecture and Data Sharing Culture with DenodoDenodo
 
Data warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaData warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaRadhika Kotecha
 
Basic introduction to power query
Basic introduction to power queryBasic introduction to power query
Basic introduction to power querydisha parmar
 
(Lecture 4)Slowly Changing Dimensions.pdf
(Lecture 4)Slowly Changing Dimensions.pdf(Lecture 4)Slowly Changing Dimensions.pdf
(Lecture 4)Slowly Changing Dimensions.pdfMobeenMasoudi
 
Data warehouse Soccer Project
Data warehouse Soccer Project Data warehouse Soccer Project
Data warehouse Soccer Project Sagar Singh
 
Advanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceAdvanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceZohar Elkayam
 
Data warehouse presentaion
Data warehouse presentaionData warehouse presentaion
Data warehouse presentaionsridhark1981
 
110 Introduction To Xbrl Taxonomies And Instance Documents Sept 2007 Print Ve...
110 Introduction To Xbrl Taxonomies And Instance Documents Sept 2007 Print Ve...110 Introduction To Xbrl Taxonomies And Instance Documents Sept 2007 Print Ve...
110 Introduction To Xbrl Taxonomies And Instance Documents Sept 2007 Print Ve...helggeist
 
Master Data Management (MDM) con Talend
Master Data Management (MDM) con TalendMaster Data Management (MDM) con Talend
Master Data Management (MDM) con TalendStratebi
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksAmazon Web Services
 

What's hot (20)

Oracle Parallel Distribution and 12c Adaptive Plans
Oracle Parallel Distribution and 12c Adaptive PlansOracle Parallel Distribution and 12c Adaptive Plans
Oracle Parallel Distribution and 12c Adaptive Plans
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
03 namespace
03 namespace03 namespace
03 namespace
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
MS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningMS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data mining
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
SQL Server Integration Services
SQL Server Integration ServicesSQL Server Integration Services
SQL Server Integration Services
 
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Enabling a Data Mesh Architecture and Data Sharing Culture with DenodoEnabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
 
Data warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaData warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika Kotecha
 
Basic introduction to power query
Basic introduction to power queryBasic introduction to power query
Basic introduction to power query
 
Database
DatabaseDatabase
Database
 
(Lecture 4)Slowly Changing Dimensions.pdf
(Lecture 4)Slowly Changing Dimensions.pdf(Lecture 4)Slowly Changing Dimensions.pdf
(Lecture 4)Slowly Changing Dimensions.pdf
 
Data warehouse Soccer Project
Data warehouse Soccer Project Data warehouse Soccer Project
Data warehouse Soccer Project
 
080827 abramson inmon vs kimball
080827 abramson   inmon vs kimball080827 abramson   inmon vs kimball
080827 abramson inmon vs kimball
 
Advanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceAdvanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better Performance
 
Data warehouse presentaion
Data warehouse presentaionData warehouse presentaion
Data warehouse presentaion
 
110 Introduction To Xbrl Taxonomies And Instance Documents Sept 2007 Print Ve...
110 Introduction To Xbrl Taxonomies And Instance Documents Sept 2007 Print Ve...110 Introduction To Xbrl Taxonomies And Instance Documents Sept 2007 Print Ve...
110 Introduction To Xbrl Taxonomies And Instance Documents Sept 2007 Print Ve...
 
Master Data Management (MDM) con Talend
Master Data Management (MDM) con TalendMaster Data Management (MDM) con Talend
Master Data Management (MDM) con Talend
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
 

Similar to Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?

DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADataconomy Media
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)Jerome Eteve
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cqlzznate
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with PythonMartin Loetzsch
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesMartin Loetzsch
 
ETL Patterns with Postgres
ETL Patterns with PostgresETL Patterns with Postgres
ETL Patterns with PostgresMartin Loetzsch
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Class 12 computer sample paper with answers
Class 12 computer sample paper with answersClass 12 computer sample paper with answers
Class 12 computer sample paper with answersdebarghyamukherjee60
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable CassandraCassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandraaaronmorton
 
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: CassandraThe Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: CassandraDataStax Academy
 
Web Developer make the most out of your Database !
Web Developer make the most out of your Database !Web Developer make the most out of your Database !
Web Developer make the most out of your Database !Jean-Marc Desvaux
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 

Similar to Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse? (20)

DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companies
 
ETL Patterns with Postgres
ETL Patterns with PostgresETL Patterns with Postgres
ETL Patterns with Postgres
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Sqlapi0.1
Sqlapi0.1Sqlapi0.1
Sqlapi0.1
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Class 12 computer sample paper with answers
Class 12 computer sample paper with answersClass 12 computer sample paper with answers
Class 12 computer sample paper with answers
 
LECTURE NOTES.pdf
LECTURE NOTES.pdfLECTURE NOTES.pdf
LECTURE NOTES.pdf
 
LECTURE NOTES.pdf
LECTURE NOTES.pdfLECTURE NOTES.pdf
LECTURE NOTES.pdf
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable CassandraCassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
 
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: CassandraThe Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
 
Web Developer make the most out of your Database !
Web Developer make the most out of your Database !Web Developer make the most out of your Database !
Web Developer make the most out of your Database !
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 

Recently uploaded

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?

  • 1. Dr. Martin Loetzsch Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
  • 2. Create a unified analytical model & activate data across whole company Create tables in a database Challenges Consistency & correctness Changeability Complexity Transparency | @martin_loetzsch In a Kimball-style star / snowflake schema Single analytical truth = a set of tables 2 application databases events csv files apis central data store/ data lake single analytical truth (marketing) automation reporting machine learning …
  • 3. | @martin_loetzsch Two data engineers at work Can be difficult at times 3
  • 4. Today: Best practices for coping with complexity 4 | @martin_loetzsch
  • 5. Not today: Databases & orchestration tools 5 | @martin_loetzsch
  • 6. Data pipelines as code SQL files, python & shell scripts Structure & content of data warehouse are result of running code Easy to debug & inspect Develop locally, test on staging system, then deploy to production Functional data engineering Reproducability, idempotency, (immutability). Running a pipeline on the same inputs will always produce the same results. No side effects. | @martin_loetzsch Apply standard software engineering best practices Make changing and testing things easy 6 https://medium.com/@maximebeauchemin/functional-data-engineering- a-modern-paradigm-for-batch-data-processing-2327ec32c42a
  • 7. https://hakibenita.com/sql-for-data-analysis#sql-vs-pandas-performance Simple analysis SELECT activated, count(*) AS cnt FROM users GROUP BY activated Processing data in Pandas creates quite some overhead Original table 65MB, Pandas data frame ~300MB | @martin_loetzsch Processing data in Python / Java etc. is very expensive Leave data in database 7
  • 8. Target of computation CREATE TABLE m_dim_next.region ( region_id SMALLINT PRIMARY KEY, region_name TEXT NOT NULL UNIQUE, country_id SMALLINT NOT NULL, country_name TEXT NOT NULL, _region_name TEXT NOT NULL ); Do computation and store result in table WITH raw_region AS (SELECT DISTINCT country, region FROM m_data.ga_session ORDER BY country, region) INSERT INTO m_dim_next.region SELECT row_number() OVER (ORDER BY country, region ) AS region_id, CASE WHEN (SELECT count(DISTINCT country) FROM raw_region r2 WHERE r2.region = r1.region) > 1 THEN region || ' / ' || country ELSE region END AS region_name, dense_rank() OVER (ORDER BY country) AS country_id, country AS country_name, region AS _region_name FROM raw_region r1; INSERT INTO m_dim_next.region VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown'); Speedup subsequent transformations SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['_region_name', ‘country_name', 'region_id']); SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['country_id', 'region_id']); ANALYZE m_dim_next.region; | @martin_loetzsch Tables as (intermediate) results of processing steps Use SQL as a data processing language 8
  • 9. At the beginning of a transformation pipeline DROP SCHEMA IF EXISTS m_dim_next CASCADE; CREATE SCHEMA m_dim_next; DROP SCHEMA IF EXISTS m_tmp CASCADE; CREATE SCHEMA m_tmp; Do all transformations CREATE TABLE m_tmp.foo AS SELECT n FROM generate_series(0, 10000) n; CREATE TABLE m_dim_next.foo AS SELECT n, n + 1 FROM m_tmp.foo; -- .. At the end of pipeline, after checks DROP SCHEMA IF EXISTS m_dim CASCADE; ALTER SCHEMA m_dim_next RENAME TO m_dim; Build m_dim_next schema while users are seeing m_dim Data shown in frontend is always consistent When ETL breaks, data just gets old DBT users Copy final reporting tables in a last step, after checks Or abuse hooks/ macros Explicit operations determine structure and content of DWH Very developer / git friendly Requirement: Must be possible to rebuild DWH from sources Add incremental loading / processing when there is a pain | @martin_loetzsch Don’t bother with updates and/ or migrations Always drop & rebuild, keep last working version 9
  • 10. Renaming / dropping schemas requires exclusive locks CREATE FUNCTION util.cancel_queries_on_schema(schema TEXT) RETURNS BOOLEAN AS $$ SELECT pg_cancel_backend(pid) FROM (SELECT DISTINCT pid FROM pg_locks JOIN pg_database ON database = pg_database.oid JOIN pg_class ON pg_class.oid = relation JOIN pg_namespace ON relnamespace = pg_namespace.oid WHERE datname = current_database() AND nspname = schema AND pid != pg_backend_pid()) t; $$ LANGUAGE SQL; First rename, then drop CREATE FUNCTION util.replace_schema(schemaname TEXT, replace_with TEXT) RETURNS VOID AS $$ BEGIN PERFORM util.cancel_queries_on_schema(schemaname); IF EXISTS(SELECT * FROM information_schema.schemata s WHERE s.schema_name = schemaname) THEN EXECUTE 'ALTER SCHEMA ' || schemaname || ' RENAME TO ' || schemaname || '_old;'; END IF; EXECUTE 'ALTER SCHEMA ' || replace_with || ' RENAME TO ' || schemaname || ';'; -- again, for good measure PERFORM util.cancel_queries_on_schema(schemaname); EXECUTE 'DROP SCHEMA IF EXISTS ' || schemaname || '_old CASCADE;'; END; $$ LANGUAGE 'plpgsql' Nice one-liner SELECT util.replace_schema('m_dim', 'm_dim_next'); | @martin_loetzsch Transactional DDL FTW Atomic & robust schema switches in PostgreSQL 10
  • 11. (Re)-creating data sets (schemas) def re_create_data_set(next_bq_dataset_id=next_bq_dataset_id): from mara_db.bigquery import bigquery_client client = bigquery_client(bq_db_alias) print(f'deleting dataset {next_bq_dataset_id}') client.delete_dataset(dataset=next_bq_dataset_id, delete_contents=True, not_found_ok=True) print(f'creating dataset {next_bq_dataset_id}') client.create_dataset(dataset=next_bq_dataset_id, exists_ok=True) return True Replacing data sets def replace_dataset(db_alias, dataset_id, next_dataset_id): from mara_db.bigquery import bigquery_client client = bigquery_client(db_alias) client.create_dataset(dataset=dataset_id, exists_ok=True) next_tables = set([table.table_id for table in client.list_tables(next_dataset_id)]) ddl = 'n' for table in client.list_tables(dataset_id): if table.table_id not in next_tables: ddl += f'DROP TABLE `{dataset_id}`.`{table.table_id}`;' # hopefully atomic operation for table_id in next_tables: ddl += f'CREATE OR REPLACE TABLE `{dataset_id}`.`{table_id}` AS SELECT * FROM `{next_dataset_id}`.`{table_id}`;n' ddl += f'DROP TABLE `{next_dataset_id}`.`{table_id}`;n' client.query(ddl) retries = 1 while True: try: client.delete_dataset(next_dataset_id) return except BadRequest as e: if retries <= 10: print(e, file=sys.stderr) seconds_to_sleep = retries * 4 print(f'Waiting {seconds_to_sleep} seconds') time.sleep(seconds_to_sleep) retries += 1 else: raise e | @martin_loetzsch https://github.com/mara/mara-db/blob/master/mara_db/bigquery.py#L124 Schema switches are a bit less nice in BigQuery 11
  • 12. ALTER SCHEMA m_dim_next SWAP WITH m_dim; DROP SCHEMA m_dim CASCADE; | @martin_loetzsch They built a command for atomic schema switches Very nice in Snowflake 12
  • 13. It’s easy to make mistakes during ETL DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; CREATE TABLE s.city ( city_id SMALLINT, city_name TEXT, country_name TEXT ); INSERT INTO s.city VALUES (1, 'Berlin', 'Germany'), (2, 'Budapest', 'Hungary'); CREATE TABLE s.customer ( customer_id BIGINT, city_fk SMALLINT ); INSERT INTO s.customer VALUES (1, 1), (1, 2), (2, 3); Customers per country? SELECT country_name, count(*) AS number_of_customers FROM s.customer JOIN s.city ON customer.city_fk = s.city.city_id GROUP BY country_name; Back up all assumptions about data by constraints ALTER TABLE s.city ADD PRIMARY KEY (city_id); ALTER TABLE s.city ADD UNIQUE (city_name); ALTER TABLE s.city ADD UNIQUE (city_name, country_name); ALTER TABLE s.customer ADD PRIMARY KEY (customer_id); [23505] ERROR: could not create unique index "customer_pkey" Detail: Key (customer_id)=(1) is duplicated. ALTER TABLE s.customer ADD FOREIGN KEY (city_fk) REFERENCES s.city (city_id); [23503] ERROR: insert or update on table "customer" violates foreign key constraint "customer_city_fk_fkey" Detail: Key (city_fk)=(3) is not present in table "city" | @martin_loetzsch Only very little overhead, will save your ass Avoid happy path programming 13
  • 14. In all tables across the pipeline DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; CREATE TABLE s.city ( city_id SMALLINT NOT NULL UNIQUE, city_name TEXT NOT NULL, country_name TEXT NOT NULL ); INSERT INTO s.city VALUES (1, 'Berlin', 'Germany'), (2, 'Budapest', ‘Hungary'), (-1, 'Unknown', ‘Unknown'); CREATE TABLE s.customer ( customer_id BIGINT NOT NULL UNIQUE, city_fk SMALLINT NOT NULL ); INSERT INTO s.customer VALUES (1, 1), (2, -1); NOT NULL constraints There needs to be a justification for making a column nullable Helps you to detect problems very early on Unique constraints When you think it should be unique, add a constraint Will save run times and cost Unknown values Adding another breakdown should not change the sum of a metric | @martin_loetzsch And sometimes add “Unknown” values Make columns not nullable by default 14
  • 15. | @martin_loetzsch Never ever! Don’t repeat business logic 15
  • 16. “Processed order”? SELECT sum(total_price) AS revenue FROM os_data.order WHERE status IN ('pending', 'accepted', 'completed', 'proposal_for_change'); SELECT CASE WHEN (status <> 'started' AND payment_status = 'authorised' AND order_type <> 'backend') THEN o.order_id END AS processed_order_fk FROM os_data.order; SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed FROM os_data.order; Solution: refactor pipeline Create separate table that computes everything we know about an order | @martin_loetzsch Never ever! Don’t repeat business logic 16 Never ever!
  • 17. Load → clean → preprocess → transform → flatten Never query data from more than one layer above If you need one more column, add it to each of these steps Will save cost & runtime and make data lineage easier Compute things as early as possible It’s always the easiest to get something done based on finished fact tables Don’t do that! | @martin_loetzsch Requires discipline Organise processing steps in layers 17 load-product load-order load-customer transform-product transform-order transform-customer flatten-product-fact flatten-order-fact flatten-customer-fact clean-product clean-order clean-customer preprocess-product preprocess-order preprocess-customer Unchanged raw data as ingested Remove test orders, parse timestamps, correct garbage Business logic: everything you know about an entity without looking at others Dimensional model: establish links between entities, transfer metrics Prepare data sets for reporting and other data activation use cases
  • 18. clean a d clean d clean b f clean c clean ei e a ian e clean a e d clean m f clean mad e ce ini ial de e ce b king e ce e ce m f ak i bi e ce mailing e n e ce a el e ce che e ce a el m i n e ce a e d e ce che a el e ice e ce a el e ice e ce b king i em e ce aim3 c e in en a adj men b king e ce a c m e b king e ence an f m a el an f m b king e ce c n c de e ce a b king i em e ce b2 a el e ice e ce ad e i ing e f mance e ce e n e ce ea n cha ge b king i em clean ma k c de e ce ecial in ice a el e ice e ce ad e i ing c an f m ad e i ing e f mance c m e b king channel g e ce email c n ac e ce i c de e ce i c de dem g a hic an f m email c n ac e ce cha ing e an f m e n an f m cha ing e e ce ge name an f m ai l ca i n e ce b l ca i n an f m b l ca i n e ce i c de a an f m i c de an f m i c de dem g a hic an f m i c de a an f m mailing e n an f m a el m i n c ea e mall dimen i n an f m an f m a an f m che an f m a el e ice an f m b king i em | @martin_loetzsch Booking data pipeline @ Trendtours Complexity handled well I 18
  • 19. an fo m o de o igin p epa e in oice heade comp e in oice line amo n p epa e o de efe ence p ep oce cancelled o de p epa e in oice line p epa e ale heade comp e open line amo n ea ch e change o de o igin p epa e ale line comp e o e line amo n p epa e ale heade a chi e p epa e ale line a chi e p epa e o e line p epa e c edi heade comp e c edi line amo n p epa e c edi efe ence p epa e c edi line p epa e al e en p ep oce o e g o p o de i em g o p c edi i em p ep oce e n o de an fo m o e p epa e me ged c ome p ep oce e p c ome an fo m ind g o p p ep oce con ac ind g o p p ep oce o de p ep oce email con ac p ep oce me ged c ome calc la e pa back poin an fo m o de i em c ea e nma ched c edi da a e comp e cancelled line amo n an fo m cancelled o de an fo m cancelled o de i em an fo m cancella ion ea on a en ale fac p ep oce o de i em an fo m e n o de fac c ea e e n da a e an fo m ale fac a fac comp e o de agg ega e an fo m o de p ep oce b2b acco n p ep oce b2b acco n hi o p epa e b2b acco n hi o an fo m b2b acco n hi o an fo m e p c ome an fo m email con ac c ea e o de da a e a en e o ce fac c ea e e p c ome da a e c ea e email con ac da a e load calc la ed bon le p ep oce calc la ed bon c ea e o e cl e c ea e o e da a e c ea e e o ce da a e c ea e agg ega e able | @martin_loetzsch Order data pipeline @ Lampenwelt Complexity handled well II 19
  • 20. l ad da a e a i nal em la f m l ad da a la f m e a i nal em l ad ma ing in l ad e e change a e g gle ad me ged e a i nal em la f m da a e di a e a i da e ende k n b 3c di a e m de in e df mail m de lie mail me ged da a e c n ac m men d da a e ma ke ing c me e ice da a e bing ad c i e faceb k g gle ad e g gle ea ch c n le ma ke ing da a e c me da a e hi me ic e a i nal c n i enc check me ged c n i enc check c me egmen a i n e a i da e a ge edic i n ca ma ke ing c n i enc check a360 a i da e la f m c n i enc check a ge edic i n da a e c be da e e a afe ca Create separate pipelines for things that you Want to develop independently Want to test independently Want to schedule independently Ensure pipeline immutability by Creating separate schemas for each pipeline Avoiding modifying the schemas of other pipelines Avoid circular dependencies | @martin_loetzsch Top level pipelines @ Helloprint Split pipelines to separate concerns 20
  • 21. Check for “lost” rows SELECT util.assert_equal( 'The order items fact table should contain all order items', 'SELECT count(*) FROM os_dim.order_item', 'SELECT count(*) FROM os_dim.order_items_fact'); Check consistency across cubes / domains SELECT util.assert_almost_equal( 'The number of first orders should be the same in ' || 'orders and marketing touchpoints cube', 'SELECT count(net_order_id) FROM os_dim.order WHERE _net_order_rank = 1;', 'SELECT (SELECT sum(number_of_first_net_orders) FROM m_dim.acquisition_performance) / (SELECT count(*) FROM m_dim.performance_attribution_model)', 0.00001 ); Check completeness of source data SELECT util.assert_not_found( 'Each adwords campaign must have the attribute "Channel"', 'SELECT DISTINCT campaign_name, account_name FROM aw_tmp.ad JOIN aw_dim.ad_performance ON ad_fk = ad_id WHERE attributes->>''Channel'' IS NULL AND impressions > 0 AND _date > now() - INTERVAL ''30 days'''); Check correctness of redistribution transformations SELECT util.assert_almost_equal( 'The cost of non-converting touchpoints must match the' || 'redistributed customer acquisition and reactivation cost', 'SELECT sum(cost) FROM m_tmp.cost_of_non_converting_touchpoints;', 'SELECT (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_acquisition_cost) + (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_reactivation_cost);', 0.00001); | @martin_loetzsch Makes changing things easy Write tests 21
  • 22. Kimball style star schemas are not very user friendly Flatten all connected entities into a wide fact table | @martin_loetzsch Bring data in the shape that is right for consumers Prepare dimensional model for data use cases 22 SELECT order_item.order_item_id AS "Order item ID", "order".order_id AS "Order ID", "order".order_date AS "Order date", order_customer.customer_id AS "Customer ID", order_customer_favourite_product_category.main_category AS "Customer favo order_customer_favourite_product_category.sub_category_1 AS "Customer fav order_customer_first_order.order_date AS "Customer first order date", product.sku AS "Product SKU", product_product_category.main_category AS "Product category level 1", product_product_category.sub_category_1 AS "Product category level 2", order_item.order_item_id AS "# Order items", order_item.order_fk AS "# Orders", order_item.product_revenue AS "Product revenue", order_item.revenue AS "Shipping revenue" FROM dim.order_item order_item LEFT JOIN dim."order" "order" ON order_item.order_fk = "order".order LEFT JOIN dim.customer order_customer ON "order".customer_fk = order LEFT JOIN dim.product_category order_customer_favourite_product_cate order_customer_favourite_product_category.product_category_id LEFT JOIN dim."order" order_customer_first_order ON order_customer.f LEFT JOIN dim.product product ON order_item.product_fk = product.pro LEFT JOIN dim.product_category product_product_category ON product.p
  • 23. | @martin_loetzsch Can be stressful Don’t develop / test in production 23
  • 24. Never: Look at backend, create models that people might need Very unlikely to create business value Instead: Collect business questions and derive analytical entities And then build them, incrementally | @martin_loetzsch How to build my first pipeline? How to start? 24
  • 25. | @martin_loetzsch Nobody is good at waterfall planning Build things incrementally 25 Add entities, attributes & metrics step by step Entity model evolution @ Amann Girrbach
  • 26. | @martin_loetzsch https://github.com/mara/mara-example-project-1 How to actually start? Use a template 26 Runnable example data warehouse Based on a public e-commerce data set Follows Project A modelling best practices Copy and adapt
  • 27. = | @martin_loetzsch The Project A data & analytics team is always happy to help! Need help with any of this? 27
  • 28. | @martin_loetzsch Happy to answer your questions Thank you! 28