Designing a schema
for a Data Warehouse
Why a Data
Warehouse?
DWH
A company data is scattered over:
● Different databases
● Internal applications
● SaaS applications
The latter can be accessible via APIs or downloadable files
Why a DWH
This means:
● Different wire protocols, query languages
● Different schemas, methodically UNdocumented
● Designed to retrieve a single row, not aggregations
● On a technology designed for OLTP
● Conflicting / redundant / incorrect / missing data
● Business metrics are mixed with PII
Why a DWH
Instead, you want data analysts to…
● Connect to a signle SQL database
● With a well-known, standard schema
● Designed for analytical queries
● On a technology designed to run analytical queries
Why a DWH
This standard schema is designed for analytics queries:
● JOIN
● WHERE
● GROUP BY
Why a DWH
Why a DWH
It's called a Star Schema. Its most basic concepts are:
● A star represents an event: customer buys product
● A dimension is any event characteristic we might use for
filtering and ordering: purchase date, delivery date, product
name, product category, customer city…
● The grain defined how specific dimensions are: date or
month? city or postcode?
● Facts are the measurements we take: cost, discount,
number or product bought, etc
Why a DWH
DWH design
Designing a DWH is a technological activity
that requires some business knowledge
How to design
FALSE!!!
How to design
FALSE!!!
How to design
FALSE!!!
How to design
Designing a DWH is a business activity
that requires technical skills
How to design
It starts by identifying business processes you want to have
more information about
Example processes:
● A customer buys a product
● A Google Ads campaign runs
● A courier delivers a pizza
How to design
● While doing so, write a dictionary of business terms
● Everyone should understand the terms
● In many companies different teams use some terms with
different meanings
How to design
Discuss each event with all the people who need information
about it
Typically people from multiple departments
How to design
Find out:
● Facts - numerical measurements to take (cost, discount,
number or product bought, etc)
● Dimensions - Event characteristics that can be used for
filtering and grouping: purchase date, delivery date, store,
product name, product category, customer city…
● The grain defines how specific dimensions are: date or
month? city or postcode?
How to design
Modify the event statement by adding the time and the
dimensions that affect its grain
● A customer buys a product
● Customers buy products in a day, in a city
● Customers buy products in a month, in a country
How to design
Dimensions
Dimensions are the criterias that will be used to
● Aggregate
● Filter
● Order
the numbers.
Dimensions
Example:
● Average amount spent
● By customers over 40, in 2024, in France
● Aggregated by store_city, date
Dimensions
Table: fact_in_store_purchase
Dimensions
date country city customer_dob prod_count total_price
2024/01/15 FR Paris 1950/02/02 2 150.00
2024/01/15 FR Paris 1952/02/02 1 999.99
2024/01/15 FR Avignon 1962/10/04 1 22.50
2024/01/16 FR Paris 1978/12/02 2 10.00
2024/01/16 FR Nice 1977/11/09 1 199.50
With this simplistic design:
● Adding dimensional columns is a pain
● Loading data into the table is harder
● We can't query a dimension alone
● We can't get a list of things that didn't happen
Dimensions
Dimensions
Dimensions usually look like this:
● Stored in separate table
● Denormalised.
Hierarchies are represented by repeating data
● They have an ID that is unique to the DWH and has no
meaning
● Human readable information is stored in other columns
● Which are indexed
Dimensions
Table: dim_city
Dimensions
continent country city local_name language population
Europe Italy Rome Roma it 10000
Europe Italy Milan Milano it 20000
Europe Italy Alghero Alghero it 30000
Europe Italy Alghero Alghero ca 30000
Europe Scotland Edinburgh Edinburgh en 12345
Facts
A fact table usually contains:
● References / foreign keys to Dimension tables
● One or more numeric columns (facts)
Facts
Table: fact_in_store_purchase
Facts
date country city customer_dob prod_count total_price
20240115 15 32 19500202 2 150.00
20240115 15 32 19520202 1 999.99
20240115 15 44 19621004 1 22.50
20240116 15 32 19781202 2 10.00
20240116 15 71 19771109 1 199.50
To join fact to dimensions:
SELECT f.*, dt.date, ct.city
FROM fact_in_store_purchase f
NATURAL JOIN dim_date dt
NATURAL JOIN dim_country ct
NATURAL JOIN dim_customer cu
WHERE dt.date > 20240000
AND dt.week_day BETWEEN 1 AND 5
AND ct.country = 'France'
AND cu.dob BETWEEN 19800000 AND 20000000
GROUP BY dt.date, ct.city
ORDER BY dt.date, ct.city
Facts
There are 3 types of fact tables:
● Transaction fact tables
○ The company buys products
● Periodic snapshots fact tables
○ Monthly inventory
● Accumulating snapshots fact tables
○ Multi-step: courier delivers pizza
Facts
Factless fact tables are a special type of fact tables.
They don't have any fact column.
They are boolean facts: an existing row is TRUE, a non-existing
row is FALSE.
Facts
Table: fact_customer_care_call
Facts
date customer_id operator_id
20240201 87612 927
20240201 999111 2250
20240201 8825 822
20240202 19166 1002
20240202 38410 948
Time Dimensions
General rules for time dimensions:
● One dimension for date only, without time
● Primary key: an integer id in the form yyyymmdd
● Add columns for any significant information: year, month,
month day, week day, workday, leap year…
Facts
Store day time in a separate column, if needed
● Primary key: integer id in the format hhmm
● Add separate columns for hours, minutes and any other
information you might need
● Depending on your needs, add a row for every minute, or
hour in the working hours, or half an hour, etc
Facts
Constellation
schemas
● You typically have multiple star schemas linked together
(Constellation Schema)
● Most dimensions should be shared across multiple stars
(Conformed Dimensions)
● Two stars might represent the same data with different
granularity, so some facts are present in multiple tables
● Make sure that facts are names consistently across stars
(Conformed Facts)
Constellation schemas
But DWH is a
complex matter…
We left out many topics, for example…
● How to represent invoice or bill of lading dimensions
(1 invoice contains multiple items)
● How to represent dimensions that change over time
● Role playing dimensions and other dimension types
● Data marts, data lakes
● DWH to feed Machine Learning
● …and more
Interested? Contact us for a training!
What we left out

Webinar: Designing a schema for a Data Warehouse

  • 1.
    Designing a schema fora Data Warehouse
  • 2.
  • 3.
  • 4.
    A company datais scattered over: ● Different databases ● Internal applications ● SaaS applications The latter can be accessible via APIs or downloadable files Why a DWH
  • 5.
    This means: ● Differentwire protocols, query languages ● Different schemas, methodically UNdocumented ● Designed to retrieve a single row, not aggregations ● On a technology designed for OLTP ● Conflicting / redundant / incorrect / missing data ● Business metrics are mixed with PII Why a DWH
  • 6.
    Instead, you wantdata analysts to… ● Connect to a signle SQL database ● With a well-known, standard schema ● Designed for analytical queries ● On a technology designed to run analytical queries Why a DWH
  • 7.
    This standard schemais designed for analytics queries: ● JOIN ● WHERE ● GROUP BY Why a DWH
  • 8.
  • 9.
    It's called aStar Schema. Its most basic concepts are: ● A star represents an event: customer buys product ● A dimension is any event characteristic we might use for filtering and ordering: purchase date, delivery date, product name, product category, customer city… ● The grain defined how specific dimensions are: date or month? city or postcode? ● Facts are the measurements we take: cost, discount, number or product bought, etc Why a DWH
  • 10.
  • 11.
    Designing a DWHis a technological activity that requires some business knowledge How to design
  • 12.
  • 13.
  • 14.
  • 15.
    Designing a DWHis a business activity that requires technical skills How to design
  • 16.
    It starts byidentifying business processes you want to have more information about Example processes: ● A customer buys a product ● A Google Ads campaign runs ● A courier delivers a pizza How to design
  • 17.
    ● While doingso, write a dictionary of business terms ● Everyone should understand the terms ● In many companies different teams use some terms with different meanings How to design
  • 18.
    Discuss each eventwith all the people who need information about it Typically people from multiple departments How to design
  • 19.
    Find out: ● Facts- numerical measurements to take (cost, discount, number or product bought, etc) ● Dimensions - Event characteristics that can be used for filtering and grouping: purchase date, delivery date, store, product name, product category, customer city… ● The grain defines how specific dimensions are: date or month? city or postcode? How to design
  • 20.
    Modify the eventstatement by adding the time and the dimensions that affect its grain ● A customer buys a product ● Customers buy products in a day, in a city ● Customers buy products in a month, in a country How to design
  • 21.
  • 22.
    Dimensions are thecriterias that will be used to ● Aggregate ● Filter ● Order the numbers. Dimensions
  • 23.
    Example: ● Average amountspent ● By customers over 40, in 2024, in France ● Aggregated by store_city, date Dimensions
  • 24.
    Table: fact_in_store_purchase Dimensions date countrycity customer_dob prod_count total_price 2024/01/15 FR Paris 1950/02/02 2 150.00 2024/01/15 FR Paris 1952/02/02 1 999.99 2024/01/15 FR Avignon 1962/10/04 1 22.50 2024/01/16 FR Paris 1978/12/02 2 10.00 2024/01/16 FR Nice 1977/11/09 1 199.50
  • 25.
    With this simplisticdesign: ● Adding dimensional columns is a pain ● Loading data into the table is harder ● We can't query a dimension alone ● We can't get a list of things that didn't happen Dimensions
  • 26.
  • 27.
    Dimensions usually looklike this: ● Stored in separate table ● Denormalised. Hierarchies are represented by repeating data ● They have an ID that is unique to the DWH and has no meaning ● Human readable information is stored in other columns ● Which are indexed Dimensions
  • 28.
    Table: dim_city Dimensions continent countrycity local_name language population Europe Italy Rome Roma it 10000 Europe Italy Milan Milano it 20000 Europe Italy Alghero Alghero it 30000 Europe Italy Alghero Alghero ca 30000 Europe Scotland Edinburgh Edinburgh en 12345
  • 29.
  • 30.
    A fact tableusually contains: ● References / foreign keys to Dimension tables ● One or more numeric columns (facts) Facts
  • 31.
    Table: fact_in_store_purchase Facts date countrycity customer_dob prod_count total_price 20240115 15 32 19500202 2 150.00 20240115 15 32 19520202 1 999.99 20240115 15 44 19621004 1 22.50 20240116 15 32 19781202 2 10.00 20240116 15 71 19771109 1 199.50
  • 32.
    To join factto dimensions: SELECT f.*, dt.date, ct.city FROM fact_in_store_purchase f NATURAL JOIN dim_date dt NATURAL JOIN dim_country ct NATURAL JOIN dim_customer cu WHERE dt.date > 20240000 AND dt.week_day BETWEEN 1 AND 5 AND ct.country = 'France' AND cu.dob BETWEEN 19800000 AND 20000000 GROUP BY dt.date, ct.city ORDER BY dt.date, ct.city Facts
  • 33.
    There are 3types of fact tables: ● Transaction fact tables ○ The company buys products ● Periodic snapshots fact tables ○ Monthly inventory ● Accumulating snapshots fact tables ○ Multi-step: courier delivers pizza Facts
  • 34.
    Factless fact tablesare a special type of fact tables. They don't have any fact column. They are boolean facts: an existing row is TRUE, a non-existing row is FALSE. Facts
  • 35.
    Table: fact_customer_care_call Facts date customer_idoperator_id 20240201 87612 927 20240201 999111 2250 20240201 8825 822 20240202 19166 1002 20240202 38410 948
  • 36.
  • 37.
    General rules fortime dimensions: ● One dimension for date only, without time ● Primary key: an integer id in the form yyyymmdd ● Add columns for any significant information: year, month, month day, week day, workday, leap year… Facts
  • 38.
    Store day timein a separate column, if needed ● Primary key: integer id in the format hhmm ● Add separate columns for hours, minutes and any other information you might need ● Depending on your needs, add a row for every minute, or hour in the working hours, or half an hour, etc Facts
  • 39.
  • 40.
    ● You typicallyhave multiple star schemas linked together (Constellation Schema) ● Most dimensions should be shared across multiple stars (Conformed Dimensions) ● Two stars might represent the same data with different granularity, so some facts are present in multiple tables ● Make sure that facts are names consistently across stars (Conformed Facts) Constellation schemas
  • 41.
    But DWH isa complex matter…
  • 42.
    We left outmany topics, for example… ● How to represent invoice or bill of lading dimensions (1 invoice contains multiple items) ● How to represent dimensions that change over time ● Role playing dimensions and other dimension types ● Data marts, data lakes ● DWH to feed Machine Learning ● …and more Interested? Contact us for a training! What we left out