This document defines key concepts in data warehousing including dimensions, facts, star schemas, and snowflake schemas. Dimensions categorize data and provide structure through labeling and tagging. Facts contain measures and aggregates. A star schema consists of dimension tables modeled around a fact table, optimized for querying large datasets. A snowflake schema is similar to a star schema but with normalized dimension tables to separate null data for faster lookups. Conformed dimensions have a common structure that can be applied across multiple fact tables.
Data Warehousing –Dimensions | Star and
Snowflake Schemas
Eric Matthews - DataWithUs
2.
Defining Some KeyTerms
Dimension
• Data Element
• Categorizes each item in a data set
• Provides Structured Labeling/Tagging
• Dimensions can consist of hierarchies. For example: Date |
Month, Quarter, Year
• Dimension tables contain appropriate foreign keys to join
to fact tables.
Dimension – Primary Role
• Data Filtering
• Data Grouping
• Data Labeling
Fact
• Measures, Counted, or aggregate event. For example:
Sales, Admissions, Blood Pressure, Inventory can all be
construed as “facts”
• Fact Tables contain appropriate joining keys
3.
Defining Some KeyTerms (continued)
Conformed Dimension
• Common set of data structures/attributes
• Can cut across many facts, but…
• The row headers in an answer must be able to exactly
match, or…
• Can be an exact subset
These definitions will come into brighter light as we look at some
examples.
4.
Star Schema
• Most atomic form of dimension modeling
• Consists of dimension table(s) modeled around a fact table
• Optimized for querying large data sets
Star Schema –Talking Points for Next Diagram
Note: Have original table schema as point of reference.
• Discuss aggregation from source table to fact table rolling
up totals (How this needed to be done).
• Discuss the notion of rolling up fact tables to create other
fact tables (use account type, financial class, and service
code columns in the fact table for basis of discussion)
• Discuss some of the pitfalls of dimension tables by using
the physician dimension as an example (example:
Physicians can change jobs)
• Discuss the Date Dimension from the perspective of the
data in the table… which transitions us to a key point…
…which is similar to how one needs to resolve foreign keys in
reporting the dimension table is a table form of the same
concept.
Additionally, If one has well defined master data then populating
the dimension tables can be done using a columnar subset of the
source master data table.
7.
Fact Table: AcctFin Rollup
Dimension Table
Date Dimension Table
ACCT_NUM Patient
WEEK ACCT_PTPTR
YEAR ACCT_PTPTR
ACCT_GUARANTOR_ID PATIENT_NAME
QUARTER ACCT_REFERRING_MD
MONTH CITY
ACCT_START_DATE STATE
ACCT_END_DATE ZIP
PLAN_SEQ1
ACCT_TYPE
Dimension Table FC
Insurance Plan/Carrier HOSPITAL_SERVICE_CODE
PLAN_SEQ1
PLAN_NAME TOT_TOTAL_CHARGES
Dimension Table
CARRIER TOT_TOTAL_PAYMENTS
Referring Physician
CITY TOT_TOTAL_ADJUSTMENTS
TOT_BALANCE ACCT_REFERRING_MD
STATE
PHYSICIAN_NAME
ZIP
AFFILIATION
AFFILIATION_CITY
AFFILIATION_STATE
AFFILIATION_ZIP
8.
Snowflake Schema
• Think Star Schema where the dimension tables are
normalized
• Can be used to segregate rows in dimension tables that
have a high percentage of null data (for faster lookup, you
cannot index null )
9.
Snowflake Schema
Fact Table
product_key
Dimension Table
Units product_key
Cost Per Unit supplier_key
Product Info Dimension Table
supplier_key
Supplier Info
10.
Conformed Dimension
A conformed dimension is a set of data attributes that have been
physically implemented in multiple tables using the same structure. A
conformed dimension can be applied to different fact tables. For
example:
Dimension Table
Patient
Demographics
(Gender, Age)
Fact Table
Hypertension
Studies
Note: The classic example for
a conformed dimension is Fact Table
date. I wanted to offer a
different example. Lab Results
Fact Table
Diabetes
Assessment
11.
Transition to NextPoint of Discussion
Star and Snowflake schemas are optimized for
querying large data sets.
They should support:
• OLAP cubes
• Business Intelligence and Analytic Applications
• Ad hoc queries