Definition• “A database model is a specificationdescribing how a database isstructured and used” – Wikipedia
Definition• “A database model is a specificationdescribing how a database isstructured and used” – Wikipedia• “A data model describes how thedata entities are related to each otherin the real world” - Terry
Data Model Characteristics• Organize/Structure like Data Elements• Define relationships between DataEntities• Highly Cohesive• Loosely Coupled
Data Modeling- Chemistry• I like to think about the similaritiesbetween Data Modeling andChemistry
Data Modeling- Chemistry• Organize items that share the samecharacteristics• Create standard abstractions torepresent characteristics– Solid– Liquid– Gas
Data Modeling- Chemistry• Molecules– Define the relationships between andwithin the standard abstractions– Those relationships form patterns that canbe re-used and describe the behaviour ofthe data in real life
Data Modeling- Chemistry• Ultimately this abstraction, structure,and patterns allow for the creation ofmodel that:– Allows for predictability– Maximizes re-use and leverage– Allows for flexibility and adaptability– Describes reality
Two design methods• Relational– “Database normalizationis the process of organizingthe fields and tables of a relational database tominimize redundancy and dependency. Normalizationusually involvesdividinglarge tables into smaller (and lessredundant) tables and defining relationships betweenthem. The objectiveis to isolate data so that additions,deletions, and modifications of a field can be made in justone table and then propagated through the rest of thedatabase via the defined relationships.”.”
Two design methods• Dimensional– “Dimensional modeling always uses the concepts of facts(measures), and dimensions (context).Facts are typically(but not always)numeric values that can be aggregated,and dimensions are groups of hierarchies and descriptorsthat define the facts
Relational• Relational Analysis– Database design is usually in Third NormalForm– Database is optimized for transactionprocessing. (OLTP)– Normalized tables are optimized formodification rather than retrieval
Normal forms• 1st - Under first normal form, all occurrences of arecord type must contain the same number offields.• 2nd - Second normal form is violated when a non-key field is a fact about a subset of a key. It is onlyrelevant when the key is composite• 3rd - Third normal form is violated when a non-keyfield is a fact about another non-key fieldSource: William Kent - 1982
Dimensional• Dimensional Analysis– Star Schema/Snowflake– Database is optimized for analyticalprocessing. (OLAP)– Facts and Dimensions optimized forretrieval• Facts – Business events – Transactions• Dimensions – context for Transactions– People– Accounts– Products– Date
Relational• 3 Dimensions• Spatial Model– No historical components except fortransactional tables• Relational – Models the one truth ofthe data– One account „11‟– One person „Terry Bunio‟– One transaction of „$100.00‟ on April 10th
Dimensional• 4 Dimensions• Temporal Model– All tables have a time component• Dimensional – Models the data overtime– Multiple versions of Accounts over time– Multiple versions of people over time– One transaction• Transactions are already temporal
Kimball-lytes• Bottom-up - incremental– Operational systems feed the DataWarehouse– Data Warehouse is a corporatedimensional model that Data Marts aresourced from– Data Warehouse is the consolidation ofData Marts– Sometimes the Data Warehouse isgenerated from Subject area Data Marts
Inmon-ians• Top-down– Corporate Information Factory– Operational systems feed the DataWarehouse– Enterprise Data Warehouse is a corporaterelational model that Data Marts aresourced from– Enterprise Data Warehouse is the sourceof Data Marts
The gist…• Kimball‟s approach is easier toimplement as you are dealing withseparate subject areas, but can be anightmare to integrate• Inmon‟s approach has more upfronteffort to avoid these consistencyproblems, but takes longer toimplement.
Fact Tables• Contains the measurements or factsabout a business process• Are thin and deep• Usually is:– Business transaction– Business Event• The grain of a Fact table is the level ofthe data recorded.
Fact Tables• Contains the following elements– Primary Key - Surrogate– Timestamp– Measure or Metrics• Transaction Amounts– Foreign Keys to Dimensions– Degenerate Dimensions• Transaction indicators or Flags
Fact Tables• Types of Measures– Additive - Measures that can be addedacross any dimensions.• Amounts– Non Additive - Measures that cannot beadded across any dimension.• Rates– Semi Additive - Measures that can beadded across some dimensions.
Fact Tables• Types of Fact tables– Transactional - A transactional table is the most basicand fundamental. The grain associated with atransactional fact table is usually specified as "onerow per line in a transaction“.– Periodic snapshots - The periodic snapshot, as thename implies, takes a "picture of the moment", wherethe moment could be any defined period of time.– Accumulating snapshots - This type of fact table isused to show the activity of a process that has a well-defined beginning and end, e.g., the processing ofan order. An order moves through specific steps untilit is fully processed. As steps towards fulfilling the orderare completed, the associated row in the fact table isupdated.
Special Fact Tables• Degenerate Dimensions– Degenerate Dimensions are Dimensionsthat can typically provide additionalcontext about a Fact• For example, flags that describe a transaction• Degenerate Dimensions can either bea separate Dimension table or becollapsed onto the Fact table– My preference is the latter
Special Fact Tables• If Degenerate Dimensions are notcollapsed on a Fact table, they arecalled Junk Dimensions and remain aDimension table• Junk Dimensions can also haveattributes from different dimensions– Not recommended
Dimension Tables• Unlike fact tables, dimension tablescontain descriptive attributes that aretypically textual fields• These attributes are designed to servetwo critical purposes:– query constraining and/or filtering– query result set labeling.Source: Wikipedia
Dimension Tables• Shallow and Wide• Usually corresponds to entities that thebusiness interacts with– People– Locations– Products– Accounts
Time Dimension• All Dimensional Models need a timecomponent• This is either a:– Separate Time Dimension(recommended)– Time attributes on each Fact Table
Dimension Tables• Contains the following elements– Primary Key – Surrogate– Business Natural Key• Person ID– Effective and Expiry Dates– Descriptive Attributes• Includes de-normalized reference tables
Behavioural Dimensions• A Dimension that is computed basedon Facts is termed a behaviouraldimension
Junk Dimensions• A Junk Dimension can be a collectionof attributes associated to a Fact –discussed earlier• It can also be a common location tostore information for convenience– I wouldn‟t recommend this
Mini-Dimensions• Splitting a Dimension up due to theactivity of change for a set ofattributes• Helps to reduce the growth of theDimension table
Slowly Changing Dimensions• Type 1 – Overwrite the row with thenew values and update the effectivedate– Pre-existing Facts now refer to theupdated Dimension– May cause inconsistent reports
Slowly Changing Dimensions• Type 2 – Insert a new Dimension row withthe new data and new effective date– Update the expiry date on the prior row• Don‟t update old Facts that refer to the oldrow– Only new Facts will refer to this new Dimensionrow• Type 2 Slowly Changing Dimensionmaintains the historical context of the data
Slowly Changing Dimensions• A type 2 change results in multipledimension rows for a given natural key• A type 2 change results in multipledimension rows for a given natural key• A type 2 change results in multipledimension rows for a given natural key
Slowly Changing Dimensions• No longer to I have one row torepresent:– Account 10123– Terry Bunio– Sales Representative 11092• This changes the mindset and querysyntax to retrieve data
Slowly Changing Dimensions• Type 3 – The Dimension stores multipleversions for the attribute in question• This usually involves a current andprevious value for the attribute• When a change occurs, no rows areadded but both the current andprevious attributes are updated• Like Type 1, Type 3 does not retain fullhistorical context
Slowly Changing Dimensions• You can also create hybrid versions ofType 1, Type 2, and Type 3 based onyour business requirements
Type 1/Type 2 Hybrid• Most common hybrid• Used when you need history AND thecurrent name for some types ofstatutory reporting
Frozen Attributes• Some times it is required to freezesome attributes so that they are notType 1, Type 2, or Type 3• Usually for audit or regulatoryrequirements
Recall - Kimball-lytes• Bottom-up - incremental– Operational systems feed the DataWarehouse– Data Warehouse is a corporatedimensional model that Data Marts aresourced from– Data Warehouse is the consolidation ofData Marts– Sometimes the Data Warehouse isgenerated from Subject area Data Marts
The problem• Kimball‟s approach can led toDimensions that are not conforming• This is due to the fact that separatedepartments define what a client orproduct is– Some times their definitions do not agree
Conforming Dimension• A Dimension is said to be conforming if:– A conformed dimension is a set of dataattributes that have been physicallyreferenced in multiple database tables usingthe same key value to refer to the samestructure, attributes, domain values,definitions and concepts. A conformeddimension cuts across many facts.• Dimensions are conformed when theyare either exactly the same (includingkeys) or one is a perfect subset of theother.
If you take one thing away• Ensure that your Dimensions areconformed
Snowflake vs Star Schema• These extra table are termedoutriggers• They are used to address real worldcomplexities with the data– Excessive row length– Repeating groups of data within theDimension• I will use outriggers in a limited way forrepeating data
Multi-Valued Dimensions• Multi-Valued Dimensions are when aFact needs to connect more thanonce to a Dimension– Primary Sales Representative– Secondary Sales Representative
Multi-Valued Dimensions• Two possible solutions– Create copies of the Dimensions for eachrole– Create a Bridge table to resolve the manyto many relationship
Bridge Tables• Bridge Tables can be used to resolve anymany to many relationships• This is frequently required with morecomplex data areas• These bridge tables need to beconsidered a Dimension and they needto use the same Slowly ChangingDimension Design as the base Dimension– My Recommendation
Multi-Valued Attributes• In some cases, you will need to keepmultiple values for an attribute or setsof attributes• Three solutions– Outriggers or Snowflake (1:M)– Bridge Table (M:M)– Repeat attributes on the Dimension• Simplest solution but can be hard to queryand causes long record length
Factless Facts• Fact table with no metrics or measures• Used for two purposes:– Records the occurrence of activities.Although no facts are stored explicitly, theseevents can be counted, producingmeaningful process measurements.– Records significant information that is notpart of a business activity. Examples ofconditions include eligibility of people forprograms and the assignment of SalesRepresentatives to Clients
Hierarchies and RecursiveHierarchies• We would need a separate session tocover this topic• Solution involves defining Dimensiontables to record the Hierarchy with aspecial solution to address the SlowlyChanging Dimension Hierarchy• Any change in the Hierarchy can resultin needing to duplicate the Hierarchydownstream
Why?• Why Dimensional Model?• Allows for a concise representation ofdata for reporting. This is especiallyimportant for Self-Service Reporting– We reduced from 300+ tables in ourOperational Data Store to 40+ tables inour Data Warehouse– Aligns with real world business concepts
Why?• The most important reason –– Requires detailed understanding of thedata– Validates the solution– Uncovers inconsistencies and errors in theNormalized Model• Easy for inconsistencies and errors to hide in300+ tables• No place to hide when those tables arereduced down
Why?• Ultimately there must be a businessrequirement for a temporal datamodel and not just a spatial one.• Although you could go through theexercise to validate yourunderstanding and not implement theDimensional Data Model
How?• Start with your simplest Dimension and Facttables and define the Natural Keys for them– i.e. People, Product, Transaction, Time• De-Normalize Reference tables to Dimensions(And possibly Facts based on how large theFact tables will be)– I place both codes and descriptions on theDimension and Fact tables• Look to De-normalize other tables with thesame Cardinality into one Dimension– Validate the Natural Keys still define one row
How?• Don‟t force entities on the sameDimension– Tempting but you will find it doesn‟trepresent the data and will cause issuesfor loading or retrieval– Bridge table or mini-snowflakes are notbad• I don‟t like a deep snowflake, but shallowsnowflakes can be appropriate• Don‟t fall into the Star-Schema/Snowflake HolyWar – Let your data define the solution
How?• Iterate, Iterate, Iterate– Your initial solution will be wrong– Create it and start to define the loadprocess and reports– You will learn more by using the data thanmonths of analysis to try and get themodel right• Come to SDEC 13 if you want to hearhow our project technically did that– Star Trek Theme
Top 101. Copy the design for the Time Dimensionfrom the Web. Lots of good solutionswith scripts to prepopulate thedimension2. Make all your attributes Not-Null. Thismakes Self-Service Report writing easy3. Create a single Surrogate Primary Keyfor Dimensions – This will help to simplifythe design and table width– These FKs get created on Fact tables !
Top 104. Never reject a record– Create an Dummy Invalid record on EachDimension. Allows you to store a Fact recordwhen the relationship is missing5. Choose a Type 2 Slowly ChangingDimension as your default6. Use Effective and Expiry dates on yourDimensions to allow for maximumhistorical information– If they are Type 2!
Top 107. SSIS 2012 has some built-infunctionality for processing SlowlyChanging Dimensions – Check it out!8. Add “Current_ind” and “Dummy_ind”attributes to each Dimension to assistin Report writing9. Iterate, Iterate, Iterate10. Read this book