SlideShare a Scribd company logo
Designing a schema
for a Data Warehouse
Why a Data
Warehouse?
DWH
A company data is scattered over:
● Different databases
● Internal applications
● SaaS applications
The latter can be accessible via APIs or downloadable files
Why a DWH
This means:
● Different wire protocols, query languages
● Different schemas, methodically UNdocumented
● Designed to retrieve a single row, not aggregations
● On a technology designed for OLTP
● Conflicting / redundant / incorrect / missing data
● Business metrics are mixed with PII
Why a DWH
Instead, you want data analysts to…
● Connect to a signle SQL database
● With a well-known, standard schema
● Designed for analytical queries
● On a technology designed to run analytical queries
Why a DWH
This standard schema is designed for analytics queries:
● JOIN
● WHERE
● GROUP BY
Why a DWH
Why a DWH
It's called a Star Schema. Its most basic concepts are:
● A star represents an event: customer buys product
● A dimension is any event characteristic we might use for
filtering and ordering: purchase date, delivery date, product
name, product category, customer city…
● The grain defined how specific dimensions are: date or
month? city or postcode?
● Facts are the measurements we take: cost, discount,
number or product bought, etc
Why a DWH
DWH design
Designing a DWH is a technological activity
that requires some business knowledge
How to design
FALSE!!!
How to design
FALSE!!!
How to design
FALSE!!!
How to design
Designing a DWH is a business activity
that requires technical skills
How to design
It starts by identifying business processes you want to have
more information about
Example processes:
● A customer buys a product
● A Google Ads campaign runs
● A courier delivers a pizza
How to design
● While doing so, write a dictionary of business terms
● Everyone should understand the terms
● In many companies different teams use some terms with
different meanings
How to design
Discuss each event with all the people who need information
about it
Typically people from multiple departments
How to design
Find out:
● Facts - numerical measurements to take (cost, discount,
number or product bought, etc)
● Dimensions - Event characteristics that can be used for
filtering and grouping: purchase date, delivery date, store,
product name, product category, customer city…
● The grain defines how specific dimensions are: date or
month? city or postcode?
How to design
Modify the event statement by adding the time and the
dimensions that affect its grain
● A customer buys a product
● Customers buy products in a day, in a city
● Customers buy products in a month, in a country
How to design
Dimensions
Dimensions are the criterias that will be used to
● Aggregate
● Filter
● Order
the numbers.
Dimensions
Example:
● Average amount spent
● By customers over 40, in 2024, in France
● Aggregated by store_city, date
Dimensions
Table: fact_in_store_purchase
Dimensions
date country city customer_dob prod_count total_price
2024/01/15 FR Paris 1950/02/02 2 150.00
2024/01/15 FR Paris 1952/02/02 1 999.99
2024/01/15 FR Avignon 1962/10/04 1 22.50
2024/01/16 FR Paris 1978/12/02 2 10.00
2024/01/16 FR Nice 1977/11/09 1 199.50
With this simplistic design:
● Adding dimensional columns is a pain
● Loading data into the table is harder
● We can't query a dimension alone
● We can't get a list of things that didn't happen
Dimensions
Dimensions
Dimensions usually look like this:
● Stored in separate table
● Denormalised.
Hierarchies are represented by repeating data
● They have an ID that is unique to the DWH and has no
meaning
● Human readable information is stored in other columns
● Which are indexed
Dimensions
Table: dim_city
Dimensions
continent country city local_name language population
Europe Italy Rome Roma it 10000
Europe Italy Milan Milano it 20000
Europe Italy Alghero Alghero it 30000
Europe Italy Alghero Alghero ca 30000
Europe Scotland Edinburgh Edinburgh en 12345
Facts
A fact table usually contains:
● References / foreign keys to Dimension tables
● One or more numeric columns (facts)
Facts
Table: fact_in_store_purchase
Facts
date country city customer_dob prod_count total_price
20240115 15 32 19500202 2 150.00
20240115 15 32 19520202 1 999.99
20240115 15 44 19621004 1 22.50
20240116 15 32 19781202 2 10.00
20240116 15 71 19771109 1 199.50
To join fact to dimensions:
SELECT f.*, dt.date, ct.city
FROM fact_in_store_purchase f
NATURAL JOIN dim_date dt
NATURAL JOIN dim_country ct
NATURAL JOIN dim_customer cu
WHERE dt.date > 20240000
AND dt.week_day BETWEEN 1 AND 5
AND ct.country = 'France'
AND cu.dob BETWEEN 19800000 AND 20000000
GROUP BY dt.date, ct.city
ORDER BY dt.date, ct.city
Facts
There are 3 types of fact tables:
● Transaction fact tables
○ The company buys products
● Periodic snapshots fact tables
○ Monthly inventory
● Accumulating snapshots fact tables
○ Multi-step: courier delivers pizza
Facts
Factless fact tables are a special type of fact tables.
They don't have any fact column.
They are boolean facts: an existing row is TRUE, a non-existing
row is FALSE.
Facts
Table: fact_customer_care_call
Facts
date customer_id operator_id
20240201 87612 927
20240201 999111 2250
20240201 8825 822
20240202 19166 1002
20240202 38410 948
Time Dimensions
General rules for time dimensions:
● One dimension for date only, without time
● Primary key: an integer id in the form yyyymmdd
● Add columns for any significant information: year, month,
month day, week day, workday, leap year…
Facts
Store day time in a separate column, if needed
● Primary key: integer id in the format hhmm
● Add separate columns for hours, minutes and any other
information you might need
● Depending on your needs, add a row for every minute, or
hour in the working hours, or half an hour, etc
Facts
Constellation
schemas
● You typically have multiple star schemas linked together
(Constellation Schema)
● Most dimensions should be shared across multiple stars
(Conformed Dimensions)
● Two stars might represent the same data with different
granularity, so some facts are present in multiple tables
● Make sure that facts are names consistently across stars
(Conformed Facts)
Constellation schemas
But DWH is a
complex matter…
We left out many topics, for example…
● How to represent invoice or bill of lading dimensions
(1 invoice contains multiple items)
● How to represent dimensions that change over time
● Role playing dimensions and other dimension types
● Data marts, data lakes
● DWH to feed Machine Learning
● …and more
Interested? Contact us for a training!
What we left out

More Related Content

Similar to Webinar: Designing a schema for a Data Warehouse

Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Gurpreet Singh Sachdeva
 
SALES_FORECASTING of sparkflows.pdf
SALES_FORECASTING of sparkflows.pdfSALES_FORECASTING of sparkflows.pdf
SALES_FORECASTING of sparkflows.pdf
Sparkflows
 
(Lecture 3) Star Schema.pdf
(Lecture 3) Star Schema.pdf(Lecture 3) Star Schema.pdf
(Lecture 3) Star Schema.pdf
MobeenMasoudi
 
Modelado Dimensional 4 Etapas
Modelado Dimensional 4 EtapasModelado Dimensional 4 Etapas
Modelado Dimensional 4 Etapas
Roberto Espinosa
 
Franconnect report
Franconnect reportFranconnect report
Franconnect report
SOUMIT KAR
 
Data warehouse and Business Intelligence for a Sports Goods Company
Data warehouse and Business Intelligence for a Sports Goods CompanyData warehouse and Business Intelligence for a Sports Goods Company
Data warehouse and Business Intelligence for a Sports Goods Company
Balaji Katakam
 
1 introductory slides (1)
1 introductory slides (1)1 introductory slides (1)
1 introductory slides (1)
tafosepsdfasg
 
Basics+of+Datawarehousing
Basics+of+DatawarehousingBasics+of+Datawarehousing
Basics+of+Datawarehousing
theextraaedge
 
207828627 sap-bootcamp-quiz-sd
207828627 sap-bootcamp-quiz-sd207828627 sap-bootcamp-quiz-sd
207828627 sap-bootcamp-quiz-sd
homeworkping8
 
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.BI
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional Modelling
Ashish Chandwani
 
[V14] Odoo functional training Pinakin Nayi
[V14] Odoo functional training Pinakin Nayi[V14] Odoo functional training Pinakin Nayi
[V14] Odoo functional training Pinakin Nayi
Pinakin Nayi
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
Prithwis Mukerjee
 
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Dataiku
 
Data Warehousing for students educationpptx
Data Warehousing for students educationpptxData Warehousing for students educationpptx
Data Warehousing for students educationpptx
jainyshah20
 
Olap fundamentals
Olap fundamentalsOlap fundamentals
Olap fundamentals
Amit Sharma
 
Data Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Back to Basics: Dimensional ModelingData Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Back to Basics: Dimensional Modeling
Dunn Solutions Group
 
Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClasses
InformaticaTrainingClasses
 
About startpoint
About startpointAbout startpoint
About startpoint
Michel Bruchet
 
About startpoint
About startpointAbout startpoint
About startpoint
Michel Bruchet
 

Similar to Webinar: Designing a schema for a Data Warehouse (20)

Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
SALES_FORECASTING of sparkflows.pdf
SALES_FORECASTING of sparkflows.pdfSALES_FORECASTING of sparkflows.pdf
SALES_FORECASTING of sparkflows.pdf
 
(Lecture 3) Star Schema.pdf
(Lecture 3) Star Schema.pdf(Lecture 3) Star Schema.pdf
(Lecture 3) Star Schema.pdf
 
Modelado Dimensional 4 Etapas
Modelado Dimensional 4 EtapasModelado Dimensional 4 Etapas
Modelado Dimensional 4 Etapas
 
Franconnect report
Franconnect reportFranconnect report
Franconnect report
 
Data warehouse and Business Intelligence for a Sports Goods Company
Data warehouse and Business Intelligence for a Sports Goods CompanyData warehouse and Business Intelligence for a Sports Goods Company
Data warehouse and Business Intelligence for a Sports Goods Company
 
1 introductory slides (1)
1 introductory slides (1)1 introductory slides (1)
1 introductory slides (1)
 
Basics+of+Datawarehousing
Basics+of+DatawarehousingBasics+of+Datawarehousing
Basics+of+Datawarehousing
 
207828627 sap-bootcamp-quiz-sd
207828627 sap-bootcamp-quiz-sd207828627 sap-bootcamp-quiz-sd
207828627 sap-bootcamp-quiz-sd
 
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional Modelling
 
[V14] Odoo functional training Pinakin Nayi
[V14] Odoo functional training Pinakin Nayi[V14] Odoo functional training Pinakin Nayi
[V14] Odoo functional training Pinakin Nayi
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Data Warehousing for students educationpptx
Data Warehousing for students educationpptxData Warehousing for students educationpptx
Data Warehousing for students educationpptx
 
Olap fundamentals
Olap fundamentalsOlap fundamentals
Olap fundamentals
 
Data Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Back to Basics: Dimensional ModelingData Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Back to Basics: Dimensional Modeling
 
Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClasses
 
About startpoint
About startpointAbout startpoint
About startpoint
 
About startpoint
About startpointAbout startpoint
About startpoint
 

More from Federico Razzoli

High-level architecture of a complete MariaDB deployment
High-level architecture of a complete MariaDB deploymentHigh-level architecture of a complete MariaDB deployment
High-level architecture of a complete MariaDB deployment
Federico Razzoli
 
Webinar - Unleash AI power with MySQL and MindsDB
Webinar - Unleash AI power with MySQL and MindsDBWebinar - Unleash AI power with MySQL and MindsDB
Webinar - Unleash AI power with MySQL and MindsDB
Federico Razzoli
 
MariaDB Security Best Practices
MariaDB Security Best PracticesMariaDB Security Best Practices
MariaDB Security Best Practices
Federico Razzoli
 
A first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use themA first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use them
Federico Razzoli
 
MariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
Federico Razzoli
 
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstrationWebinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstration
Federico Razzoli
 
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Federico Razzoli
 
MariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsMariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAs
Federico Razzoli
 
Recent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy lifeRecent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy life
Federico Razzoli
 
Advanced MariaDB features that developers love.pdf
Advanced MariaDB features that developers love.pdfAdvanced MariaDB features that developers love.pdf
Advanced MariaDB features that developers love.pdf
Federico Razzoli
 
Automate MariaDB Galera clusters deployments with Ansible
Automate MariaDB Galera clusters deployments with AnsibleAutomate MariaDB Galera clusters deployments with Ansible
Automate MariaDB Galera clusters deployments with Ansible
Federico Razzoli
 
Creating Vagrant development machines with MariaDB
Creating Vagrant development machines with MariaDBCreating Vagrant development machines with MariaDB
Creating Vagrant development machines with MariaDB
Federico Razzoli
 
MariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructuresMariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructures
Federico Razzoli
 
Playing with the CONNECT storage engine
Playing with the CONNECT storage enginePlaying with the CONNECT storage engine
Playing with the CONNECT storage engine
Federico Razzoli
 
MariaDB Temporal Tables
MariaDB Temporal TablesMariaDB Temporal Tables
MariaDB Temporal Tables
Federico Razzoli
 
Database Design most common pitfalls
Database Design most common pitfallsDatabase Design most common pitfalls
Database Design most common pitfalls
Federico Razzoli
 
MySQL and MariaDB Backups
MySQL and MariaDB BackupsMySQL and MariaDB Backups
MySQL and MariaDB Backups
Federico Razzoli
 
JSON in MySQL and MariaDB Databases
JSON in MySQL and MariaDB DatabasesJSON in MySQL and MariaDB Databases
JSON in MySQL and MariaDB Databases
Federico Razzoli
 
How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2
Federico Razzoli
 
MySQL Transaction Isolation Levels (lightning talk)
MySQL Transaction Isolation Levels (lightning talk)MySQL Transaction Isolation Levels (lightning talk)
MySQL Transaction Isolation Levels (lightning talk)
Federico Razzoli
 

More from Federico Razzoli (20)

High-level architecture of a complete MariaDB deployment
High-level architecture of a complete MariaDB deploymentHigh-level architecture of a complete MariaDB deployment
High-level architecture of a complete MariaDB deployment
 
Webinar - Unleash AI power with MySQL and MindsDB
Webinar - Unleash AI power with MySQL and MindsDBWebinar - Unleash AI power with MySQL and MindsDB
Webinar - Unleash AI power with MySQL and MindsDB
 
MariaDB Security Best Practices
MariaDB Security Best PracticesMariaDB Security Best Practices
MariaDB Security Best Practices
 
A first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use themA first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use them
 
MariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
 
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstrationWebinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstration
 
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
 
MariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsMariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAs
 
Recent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy lifeRecent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy life
 
Advanced MariaDB features that developers love.pdf
Advanced MariaDB features that developers love.pdfAdvanced MariaDB features that developers love.pdf
Advanced MariaDB features that developers love.pdf
 
Automate MariaDB Galera clusters deployments with Ansible
Automate MariaDB Galera clusters deployments with AnsibleAutomate MariaDB Galera clusters deployments with Ansible
Automate MariaDB Galera clusters deployments with Ansible
 
Creating Vagrant development machines with MariaDB
Creating Vagrant development machines with MariaDBCreating Vagrant development machines with MariaDB
Creating Vagrant development machines with MariaDB
 
MariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructuresMariaDB, MySQL and Ansible: automating database infrastructures
MariaDB, MySQL and Ansible: automating database infrastructures
 
Playing with the CONNECT storage engine
Playing with the CONNECT storage enginePlaying with the CONNECT storage engine
Playing with the CONNECT storage engine
 
MariaDB Temporal Tables
MariaDB Temporal TablesMariaDB Temporal Tables
MariaDB Temporal Tables
 
Database Design most common pitfalls
Database Design most common pitfallsDatabase Design most common pitfalls
Database Design most common pitfalls
 
MySQL and MariaDB Backups
MySQL and MariaDB BackupsMySQL and MariaDB Backups
MySQL and MariaDB Backups
 
JSON in MySQL and MariaDB Databases
JSON in MySQL and MariaDB DatabasesJSON in MySQL and MariaDB Databases
JSON in MySQL and MariaDB Databases
 
How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2
 
MySQL Transaction Isolation Levels (lightning talk)
MySQL Transaction Isolation Levels (lightning talk)MySQL Transaction Isolation Levels (lightning talk)
MySQL Transaction Isolation Levels (lightning talk)
 

Recently uploaded

Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
FIDO Alliance
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
Patch Tuesday de julio
Patch Tuesday de julioPatch Tuesday de julio
Patch Tuesday de julio
Ivanti
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
Step-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From ScratchStep-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From Scratch
softsuave
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
Baishakhi Ray
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
KIRAN KV
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 

Recently uploaded (20)

Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
Patch Tuesday de julio
Patch Tuesday de julioPatch Tuesday de julio
Patch Tuesday de julio
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
Step-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From ScratchStep-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From Scratch
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 

Webinar: Designing a schema for a Data Warehouse

  • 1. Designing a schema for a Data Warehouse
  • 3. DWH
  • 4. A company data is scattered over: ● Different databases ● Internal applications ● SaaS applications The latter can be accessible via APIs or downloadable files Why a DWH
  • 5. This means: ● Different wire protocols, query languages ● Different schemas, methodically UNdocumented ● Designed to retrieve a single row, not aggregations ● On a technology designed for OLTP ● Conflicting / redundant / incorrect / missing data ● Business metrics are mixed with PII Why a DWH
  • 6. Instead, you want data analysts to… ● Connect to a signle SQL database ● With a well-known, standard schema ● Designed for analytical queries ● On a technology designed to run analytical queries Why a DWH
  • 7. This standard schema is designed for analytics queries: ● JOIN ● WHERE ● GROUP BY Why a DWH
  • 9. It's called a Star Schema. Its most basic concepts are: ● A star represents an event: customer buys product ● A dimension is any event characteristic we might use for filtering and ordering: purchase date, delivery date, product name, product category, customer city… ● The grain defined how specific dimensions are: date or month? city or postcode? ● Facts are the measurements we take: cost, discount, number or product bought, etc Why a DWH
  • 11. Designing a DWH is a technological activity that requires some business knowledge How to design
  • 15. Designing a DWH is a business activity that requires technical skills How to design
  • 16. It starts by identifying business processes you want to have more information about Example processes: ● A customer buys a product ● A Google Ads campaign runs ● A courier delivers a pizza How to design
  • 17. ● While doing so, write a dictionary of business terms ● Everyone should understand the terms ● In many companies different teams use some terms with different meanings How to design
  • 18. Discuss each event with all the people who need information about it Typically people from multiple departments How to design
  • 19. Find out: ● Facts - numerical measurements to take (cost, discount, number or product bought, etc) ● Dimensions - Event characteristics that can be used for filtering and grouping: purchase date, delivery date, store, product name, product category, customer city… ● The grain defines how specific dimensions are: date or month? city or postcode? How to design
  • 20. Modify the event statement by adding the time and the dimensions that affect its grain ● A customer buys a product ● Customers buy products in a day, in a city ● Customers buy products in a month, in a country How to design
  • 22. Dimensions are the criterias that will be used to ● Aggregate ● Filter ● Order the numbers. Dimensions
  • 23. Example: ● Average amount spent ● By customers over 40, in 2024, in France ● Aggregated by store_city, date Dimensions
  • 24. Table: fact_in_store_purchase Dimensions date country city customer_dob prod_count total_price 2024/01/15 FR Paris 1950/02/02 2 150.00 2024/01/15 FR Paris 1952/02/02 1 999.99 2024/01/15 FR Avignon 1962/10/04 1 22.50 2024/01/16 FR Paris 1978/12/02 2 10.00 2024/01/16 FR Nice 1977/11/09 1 199.50
  • 25. With this simplistic design: ● Adding dimensional columns is a pain ● Loading data into the table is harder ● We can't query a dimension alone ● We can't get a list of things that didn't happen Dimensions
  • 27. Dimensions usually look like this: ● Stored in separate table ● Denormalised. Hierarchies are represented by repeating data ● They have an ID that is unique to the DWH and has no meaning ● Human readable information is stored in other columns ● Which are indexed Dimensions
  • 28. Table: dim_city Dimensions continent country city local_name language population Europe Italy Rome Roma it 10000 Europe Italy Milan Milano it 20000 Europe Italy Alghero Alghero it 30000 Europe Italy Alghero Alghero ca 30000 Europe Scotland Edinburgh Edinburgh en 12345
  • 29. Facts
  • 30. A fact table usually contains: ● References / foreign keys to Dimension tables ● One or more numeric columns (facts) Facts
  • 31. Table: fact_in_store_purchase Facts date country city customer_dob prod_count total_price 20240115 15 32 19500202 2 150.00 20240115 15 32 19520202 1 999.99 20240115 15 44 19621004 1 22.50 20240116 15 32 19781202 2 10.00 20240116 15 71 19771109 1 199.50
  • 32. To join fact to dimensions: SELECT f.*, dt.date, ct.city FROM fact_in_store_purchase f NATURAL JOIN dim_date dt NATURAL JOIN dim_country ct NATURAL JOIN dim_customer cu WHERE dt.date > 20240000 AND dt.week_day BETWEEN 1 AND 5 AND ct.country = 'France' AND cu.dob BETWEEN 19800000 AND 20000000 GROUP BY dt.date, ct.city ORDER BY dt.date, ct.city Facts
  • 33. There are 3 types of fact tables: ● Transaction fact tables ○ The company buys products ● Periodic snapshots fact tables ○ Monthly inventory ● Accumulating snapshots fact tables ○ Multi-step: courier delivers pizza Facts
  • 34. Factless fact tables are a special type of fact tables. They don't have any fact column. They are boolean facts: an existing row is TRUE, a non-existing row is FALSE. Facts
  • 35. Table: fact_customer_care_call Facts date customer_id operator_id 20240201 87612 927 20240201 999111 2250 20240201 8825 822 20240202 19166 1002 20240202 38410 948
  • 37. General rules for time dimensions: ● One dimension for date only, without time ● Primary key: an integer id in the form yyyymmdd ● Add columns for any significant information: year, month, month day, week day, workday, leap year… Facts
  • 38. Store day time in a separate column, if needed ● Primary key: integer id in the format hhmm ● Add separate columns for hours, minutes and any other information you might need ● Depending on your needs, add a row for every minute, or hour in the working hours, or half an hour, etc Facts
  • 40. ● You typically have multiple star schemas linked together (Constellation Schema) ● Most dimensions should be shared across multiple stars (Conformed Dimensions) ● Two stars might represent the same data with different granularity, so some facts are present in multiple tables ● Make sure that facts are names consistently across stars (Conformed Facts) Constellation schemas
  • 41. But DWH is a complex matter…
  • 42. We left out many topics, for example… ● How to represent invoice or bill of lading dimensions (1 invoice contains multiple items) ● How to represent dimensions that change over time ● Role playing dimensions and other dimension types ● Data marts, data lakes ● DWH to feed Machine Learning ● …and more Interested? Contact us for a training! What we left out