Dimensional modeling primer


Published on

Introduction to Dimne

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dimensional modeling primer

  1. 1. Dimensional Data Modeling – A PrimerTerry Bunio
  2. 2. Dimensional Data ModelingA Primer
  3. 3. @tbuniotbunio@protegra.combornagainagilist.wordpress.comwww.protegra.com
  4. 4. Agenda• Data Modeling• Relational vs Dimensional• Dimensional concepts– Facts– Dimensions• Complex Concept Introduction• Why and How?• My Top 10 Dimensional ModelingRecommendations
  5. 5. What is Data Modeling?
  6. 6. Definition• “A database model is a specificationdescribing how a database isstructured and used” – Wikipedia
  7. 7. Definition• “A database model is a specificationdescribing how a database isstructured and used” – Wikipedia• “A data model describes how thedata entities are related to each otherin the real world” - Terry
  8. 8. Data Model Characteristics• Organize/Structure like Data Elements• Define relationships between DataEntities• Highly Cohesive• Loosely Coupled
  9. 9. Data Modeling- Chemistry• I like to think about the similaritiesbetween Data Modeling andChemistry
  10. 10. Data Modeling- Chemistry• Organize items that share the samecharacteristics• Create standard abstractions torepresent characteristics– Solid– Liquid– Gas
  11. 11. Data Modeling- Chemistry• Molecules– Define the relationships between andwithin the standard abstractions– Those relationships form patterns that canbe re-used and describe the behaviour ofthe data in real life
  12. 12. Data Modeling- Chemistry• Ultimately this abstraction, structure,and patterns allow for the creation ofmodel that:– Allows for predictability– Maximizes re-use and leverage– Allows for flexibility and adaptability– Describes reality
  13. 13. Database Design
  14. 14. Two design methods• Relational– “Database normalizationis the process of organizingthe fields and tables of a relational database tominimize redundancy and dependency. Normalizationusually involvesdividinglarge tables into smaller (and lessredundant) tables and defining relationships betweenthem. The objectiveis to isolate data so that additions,deletions, and modifications of a field can be made in justone table and then propagated through the rest of thedatabase via the defined relationships.”.”
  15. 15. Two design methods• Dimensional– “Dimensional modeling always uses the concepts of facts(measures), and dimensions (context).Facts are typically(but not always)numeric values that can be aggregated,and dimensions are groups of hierarchies and descriptorsthat define the facts
  16. 16. Relational
  17. 17. Relational• Relational Analysis– Database design is usually in Third NormalForm– Database is optimized for transactionprocessing. (OLTP)– Normalized tables are optimized formodification rather than retrieval
  18. 18. Normal forms• 1st - Under first normal form, all occurrences of arecord type must contain the same number offields.• 2nd - Second normal form is violated when a non-key field is a fact about a subset of a key. It is onlyrelevant when the key is composite• 3rd - Third normal form is violated when a non-keyfield is a fact about another non-key fieldSource: William Kent - 1982
  19. 19. Dimensional
  20. 20. Dimensional• Dimensional Analysis– Star Schema/Snowflake– Database is optimized for analyticalprocessing. (OLAP)– Facts and Dimensions optimized forretrieval• Facts – Business events – Transactions• Dimensions – context for Transactions– People– Accounts– Products– Date
  21. 21. Relational• 3 Dimensions• Spatial Model– No historical components except fortransactional tables• Relational – Models the one truth ofthe data– One account „11‟– One person „Terry Bunio‟– One transaction of „$100.00‟ on April 10th
  22. 22. Dimensional• 4 Dimensions• Temporal Model– All tables have a time component• Dimensional – Models the data overtime– Multiple versions of Accounts over time– Multiple versions of people over time– One transaction• Transactions are already temporal
  23. 23. Kimball-lytes• Bottom-up - incremental– Operational systems feed the DataWarehouse– Data Warehouse is a corporatedimensional model that Data Marts aresourced from– Data Warehouse is the consolidation ofData Marts– Sometimes the Data Warehouse isgenerated from Subject area Data Marts
  24. 24. Inmon-ians• Top-down– Corporate Information Factory– Operational systems feed the DataWarehouse– Enterprise Data Warehouse is a corporaterelational model that Data Marts aresourced from– Enterprise Data Warehouse is the sourceof Data Marts
  25. 25. The gist…• Kimball‟s approach is easier toimplement as you are dealing withseparate subject areas, but can be anightmare to integrate• Inmon‟s approach has more upfronteffort to avoid these consistencyproblems, but takes longer toimplement.
  26. 26. Facts
  27. 27. Fact Tables• Contains the measurements or factsabout a business process• Are thin and deep• Usually is:– Business transaction– Business Event• The grain of a Fact table is the level ofthe data recorded.
  28. 28. Fact Tables• Contains the following elements– Primary Key - Surrogate– Timestamp– Measure or Metrics• Transaction Amounts– Foreign Keys to Dimensions– Degenerate Dimensions• Transaction indicators or Flags
  29. 29. Fact Tables• Types of Measures– Additive - Measures that can be addedacross any dimensions.• Amounts– Non Additive - Measures that cannot beadded across any dimension.• Rates– Semi Additive - Measures that can beadded across some dimensions.
  30. 30. Fact Tables• Types of Fact tables– Transactional - A transactional table is the most basicand fundamental. The grain associated with atransactional fact table is usually specified as "onerow per line in a transaction“.– Periodic snapshots - The periodic snapshot, as thename implies, takes a "picture of the moment", wherethe moment could be any defined period of time.– Accumulating snapshots - This type of fact table isused to show the activity of a process that has a well-defined beginning and end, e.g., the processing ofan order. An order moves through specific steps untilit is fully processed. As steps towards fulfilling the orderare completed, the associated row in the fact table isupdated.
  31. 31. Special Fact Tables• Degenerate Dimensions– Degenerate Dimensions are Dimensionsthat can typically provide additionalcontext about a Fact• For example, flags that describe a transaction• Degenerate Dimensions can either bea separate Dimension table or becollapsed onto the Fact table– My preference is the latter
  32. 32. Special Fact Tables• If Degenerate Dimensions are notcollapsed on a Fact table, they arecalled Junk Dimensions and remain aDimension table• Junk Dimensions can also haveattributes from different dimensions– Not recommended
  33. 33. Dimensions
  34. 34. Dimension Tables• Unlike fact tables, dimension tablescontain descriptive attributes that aretypically textual fields• These attributes are designed to servetwo critical purposes:– query constraining and/or filtering– query result set labeling.Source: Wikipedia
  35. 35. Dimension Tables• Shallow and Wide• Usually corresponds to entities that thebusiness interacts with– People– Locations– Products– Accounts
  36. 36. Time Dimension
  37. 37. Time Dimension• All Dimensional Models need a timecomponent• This is either a:– Separate Time Dimension(recommended)– Time attributes on each Fact Table
  38. 38. Dimension Tables• Contains the following elements– Primary Key – Surrogate– Business Natural Key• Person ID– Effective and Expiry Dates– Descriptive Attributes• Includes de-normalized reference tables
  39. 39. Behavioural Dimensions• A Dimension that is computed basedon Facts is termed a behaviouraldimension
  40. 40. Junk Dimensions• A Junk Dimension can be a collectionof attributes associated to a Fact –discussed earlier• It can also be a common location tostore information for convenience– I wouldn‟t recommend this
  41. 41. Mini-Dimensions
  42. 42. Mini-Dimensions• Splitting a Dimension up due to theactivity of change for a set ofattributes• Helps to reduce the growth of theDimension table
  43. 43. Slowly Changing Dimensions• Type 1 – Overwrite the row with thenew values and update the effectivedate– Pre-existing Facts now refer to theupdated Dimension– May cause inconsistent reports
  44. 44. Slowly Changing Dimensions• Type 2 – Insert a new Dimension row withthe new data and new effective date– Update the expiry date on the prior row• Don‟t update old Facts that refer to the oldrow– Only new Facts will refer to this new Dimensionrow• Type 2 Slowly Changing Dimensionmaintains the historical context of the data
  45. 45. Slowly Changing Dimensions• A type 2 change results in multipledimension rows for a given natural key• A type 2 change results in multipledimension rows for a given natural key• A type 2 change results in multipledimension rows for a given natural key
  46. 46. Slowly Changing Dimensions• No longer to I have one row torepresent:– Account 10123– Terry Bunio– Sales Representative 11092• This changes the mindset and querysyntax to retrieve data
  47. 47. Slowly Changing Dimensions• Type 3 – The Dimension stores multipleversions for the attribute in question• This usually involves a current andprevious value for the attribute• When a change occurs, no rows areadded but both the current andprevious attributes are updated• Like Type 1, Type 3 does not retain fullhistorical context
  48. 48. Slowly Changing Dimensions• You can also create hybrid versions ofType 1, Type 2, and Type 3 based onyour business requirements
  49. 49. Type 1/Type 2 Hybrid• Most common hybrid• Used when you need history AND thecurrent name for some types ofstatutory reporting
  50. 50. Frozen Attributes• Some times it is required to freezesome attributes so that they are notType 1, Type 2, or Type 3• Usually for audit or regulatoryrequirements
  51. 51. Conformity
  52. 52. Recall - Kimball-lytes• Bottom-up - incremental– Operational systems feed the DataWarehouse– Data Warehouse is a corporatedimensional model that Data Marts aresourced from– Data Warehouse is the consolidation ofData Marts– Sometimes the Data Warehouse isgenerated from Subject area Data Marts
  53. 53. The problem• Kimball‟s approach can led toDimensions that are not conforming• This is due to the fact that separatedepartments define what a client orproduct is– Some times their definitions do not agree
  54. 54. Conforming Dimension• A Dimension is said to be conforming if:– A conformed dimension is a set of dataattributes that have been physicallyreferenced in multiple database tables usingthe same key value to refer to the samestructure, attributes, domain values,definitions and concepts. A conformeddimension cuts across many facts.• Dimensions are conformed when theyare either exactly the same (includingkeys) or one is a perfect subset of theother.
  55. 55. If you take one thing away• Ensure that your Dimensions areconformed
  56. 56. Complexity
  57. 57. Complexity• Most textbooks stop here only showthe simplest Dimensional Models• Unfortunately, I‟ve never run into aDimensional Model like that
  58. 58. Simple
  59. 59. More Complex
  60. 60. Real World
  61. 61. Complex Concept Introduction• Snowflake vs Star Schema• Multi-Valued Dimensions and Bridges• Multi-Valued Attributes• Factless Facts• Recursive Hierarchies
  62. 62. Snowflake vs Star Schema
  63. 63. Snowflake vs Star Schema
  64. 64. Snowflake vs Star Schema• These extra table are termedoutriggers• They are used to address real worldcomplexities with the data– Excessive row length– Repeating groups of data within theDimension• I will use outriggers in a limited way forrepeating data
  65. 65. Multi-Valued Dimensions• Multi-Valued Dimensions are when aFact needs to connect more thanonce to a Dimension– Primary Sales Representative– Secondary Sales Representative
  66. 66. Multi-Valued Dimensions• Two possible solutions– Create copies of the Dimensions for eachrole– Create a Bridge table to resolve the manyto many relationship
  67. 67. Multi-Valued Dimensions
  68. 68. Bridge Tables
  69. 69. Bridge Tables• Bridge Tables can be used to resolve anymany to many relationships• This is frequently required with morecomplex data areas• These bridge tables need to beconsidered a Dimension and they needto use the same Slowly ChangingDimension Design as the base Dimension– My Recommendation
  70. 70. Multi-Valued Attributes• In some cases, you will need to keepmultiple values for an attribute or setsof attributes• Three solutions– Outriggers or Snowflake (1:M)– Bridge Table (M:M)– Repeat attributes on the Dimension• Simplest solution but can be hard to queryand causes long record length
  71. 71. Factless Facts• Fact table with no metrics or measures• Used for two purposes:– Records the occurrence of activities.Although no facts are stored explicitly, theseevents can be counted, producingmeaningful process measurements.– Records significant information that is notpart of a business activity. Examples ofconditions include eligibility of people forprograms and the assignment of SalesRepresentatives to Clients
  72. 72. Hierarchies and RecursiveHierarchies
  73. 73. Hierarchies and RecursiveHierarchies• We would need a separate session tocover this topic• Solution involves defining Dimensiontables to record the Hierarchy with aspecial solution to address the SlowlyChanging Dimension Hierarchy• Any change in the Hierarchy can resultin needing to duplicate the Hierarchydownstream
  74. 74. Why?• Why Dimensional Model?• Allows for a concise representation ofdata for reporting. This is especiallyimportant for Self-Service Reporting– We reduced from 300+ tables in ourOperational Data Store to 40+ tables inour Data Warehouse– Aligns with real world business concepts
  75. 75. Why?• The most important reason –– Requires detailed understanding of thedata– Validates the solution– Uncovers inconsistencies and errors in theNormalized Model• Easy for inconsistencies and errors to hide in300+ tables• No place to hide when those tables arereduced down
  76. 76. Why?• Ultimately there must be a businessrequirement for a temporal datamodel and not just a spatial one.• Although you could go through theexercise to validate yourunderstanding and not implement theDimensional Data Model
  77. 77. How?
  78. 78. How?• Start with your simplest Dimension and Facttables and define the Natural Keys for them– i.e. People, Product, Transaction, Time• De-Normalize Reference tables to Dimensions(And possibly Facts based on how large theFact tables will be)– I place both codes and descriptions on theDimension and Fact tables• Look to De-normalize other tables with thesame Cardinality into one Dimension– Validate the Natural Keys still define one row
  79. 79. How?• Don‟t force entities on the sameDimension– Tempting but you will find it doesn‟trepresent the data and will cause issuesfor loading or retrieval– Bridge table or mini-snowflakes are notbad• I don‟t like a deep snowflake, but shallowsnowflakes can be appropriate• Don‟t fall into the Star-Schema/Snowflake HolyWar – Let your data define the solution
  80. 80. How?• Iterate, Iterate, Iterate– Your initial solution will be wrong– Create it and start to define the loadprocess and reports– You will learn more by using the data thanmonths of analysis to try and get themodel right• Come to SDEC 13 if you want to hearhow our project technically did that– Star Trek Theme
  81. 81. Top 10
  82. 82. Top 101. Copy the design for the Time Dimensionfrom the Web. Lots of good solutionswith scripts to prepopulate thedimension2. Make all your attributes Not-Null. Thismakes Self-Service Report writing easy3. Create a single Surrogate Primary Keyfor Dimensions – This will help to simplifythe design and table width– These FKs get created on Fact tables !
  83. 83. Top 104. Never reject a record– Create an Dummy Invalid record on EachDimension. Allows you to store a Fact recordwhen the relationship is missing5. Choose a Type 2 Slowly ChangingDimension as your default6. Use Effective and Expiry dates on yourDimensions to allow for maximumhistorical information– If they are Type 2!
  84. 84. Top 107. SSIS 2012 has some built-infunctionality for processing SlowlyChanging Dimensions – Check it out!8. Add “Current_ind” and “Dummy_ind”attributes to each Dimension to assistin Report writing9. Iterate, Iterate, Iterate10. Read this book
  85. 85. Want More?
  86. 86. Whew! Questions?