@tbunio
tbunio@protegra.com
agilevoyageur.wordpress.com
www.protegra.com
Agenda
• Data Modeling
• The Project
• Hot DB topics
• Relational vs Dimensional
• Dimensional concepts
– Facts
– Dimensio...
What is Data Modeling?
Definition
• “A database model is a specification
describing how a database is
structured and used” – Wikipedia
Definition
• “A database model is a specification
describing how a database is
structured and used” – Wikipedia
• “A data ...
Data Model Characteristics
• Organize/Structure like Data Elements
• Define relationships between Data
Entities
• Highly C...
The Project
• Major Health Service provider is
switching their claims system to SAP
• As part of this, they are totally
re...
The Project
• 3+ years duration
• 100+ Integration Projects
• 200+ Systems Personnel
Integration Project Streams
• Client Administration – Policy Systems
• Data Warehouse
• Legacy – Conversion from Legacy
• ...
Data Warehouse Team
• Terry – Data Architect/Modeler and
PM
• Hanaa – Lead Analyst and Data
Analyst
• Kevin – Lead Data Mi...
Current State
• Sybase Data Warehouse
– Combination of Normalized and
Dimensional design
• Data Migration
– Series of SQL ...
Target State
• SQL Server 2012
• SQL Server Integration Services for
Data Migration
• SQL Server Reporting Services for
Re...
Target Solution
• Initial load moves 2.5 Terabytes of data
• Initial load runs once
• Incremental load runs every hour
Target Solution
• Operational Data Store
– Normalized
– 400+ tables
• Data Warehouse
– Dimensional
– 60+ tables
• Why both...
Our #1 Challenge
• We needed to be Agile like the other
projects!
– We are now on revision 3500
• We spent weeks planning ...
Beef?
• Where are the hot topics like
– Big Data
– NoSQL
– MySQL
– Data Warehouse Appliances
– Cloud
– Open Source Databas...
Big Data
• “Commercial Databases” have come
a long way to handle large data
volume
• Big Data is still important but proba...
Big Data
• For example, many of the Big Data
solutions featured ColumnStore
Indexing
• Now almost all commercial databases...
NoSQL
• NoSQL was heralded a few years ago
as the death of structured databases
• Mainly promoted from the developer
commu...
MySQL
• MySQL was also promoted as a great
lightweight, high performance option’
• We actually investigated it as an
optio...
MySQL
• All of the great MySQL benchmarks use
the simplest database engine with no
ACID compliance
– MySQL has the option ...
Data Warehouse Appliances
• “marketing term for an integrated set
of servers, storage, operating
system(s), DBMS and softw...
Data Warehouse Appliances
• Recently in the Data Warehouse
Industry, there has been the rise of a
the Data Warehouse appli...
Data Warehouse Appliances
• Cool Names like:
– Teradata
– GreenPlum
– Netezza
– InfoSphere
– EMC
• Like Big Data these sol...
Cloud
• Great to store pictures and music – the
concept still makes businesses nervous
– Also regulatory requirements some...
Open Source Databases
• We investigated Open Sources
databases for our solution. We looked
at:
– MySQL
– PostgreSQL
– othe...
Open Sources Databases
• We were surprised to learn that once
you factor in all the things you get from
SQL Server, it act...
Foundational DB Practices
Two design methods
• Relational
– “Database normalization is the process of organizing
the fields and tables of a relation...
Two design methods
• Dimensional
– “Dimensional modeling always uses the concepts of facts
(measures), and dimensions (con...
Relational
Relational
• Relational Analysis
– Database design is usually in Third Normal
Form
– Database is optimized for transaction...
Normal forms
• 1st - Under first normal form, all occurrences of a
record type must contain the same number of
fields.
• 2...
Dimensional
Dimensional
• Dimensional Analysis
– Star Schema/Snowflake
– Database is optimized for analytical
processing. (OLAP)
– Fac...
Relational
• 3 Dimensions
• Spatial Model
– No historical components except for
transactional tables
• Relational – Models...
Dimensional
• 4 Dimensions
• Temporal Model
– All tables have a time component
• Dimensional – Models the one truth of
the...
Fact Tables
• Contains the measurements or facts
about a business process
• Are thin and deep
• Usually is:
– Business tra...
Special Fact Tables
• Degenerate Dimensions
– Degenerate Dimensions are Dimensions
that can typically provide additional
c...
Dimension Tables
• Unlike fact tables, dimension tables
contain descriptive attributes that are
typically textual fields
•...
Dimension Tables
• Shallow and Wide
• Usually corresponds to entities that the
business interacts with
– People
– Location...
Time Dimension
• All Dimensional Models need a time
component
• This is either a:
– Separate Time Dimension
(recommended)
...
Mini-Dimensions
Mini-Dimensions
• Splitting a Dimension up due to the
activity of change for a set of
attributes
• Helps to reduce the gro...
Slowly Changing Dimensions
• Type 1 – Overwrite the row with the
new values and update the effective
date
– Pre-existing F...
Slowly Changing Dimensions
• Type 2 – Insert a new Dimension row with
the new data and new effective date
– Update the exp...
Slowly Changing Dimensions
• No longer to I have one row to
represent:
– Account 10123
– Terry Bunio
– Sales Representativ...
Slowly Changing Dimensions
• Type 3 – The Dimension stores multiple
versions for the attribute in question
• This usually ...
Complexity
• Most textbooks stop here only show
the simplest Dimensional Models
• Unfortunately, I’ve never run into a
Dim...
Complex Concept Introduction
• Snowflake vs Star Schema
• Multi-Valued Dimensions and Bridges
• Recursive Hierarchies
Snowflake vs Star Schema
Snowflake vs Star Schema
Dimensional
Snowflake vs Star Schema
• These extra tables are termed
outriggers
• They are used to address real world
complexities wit...
Multi-Valued Dimensions
• Multi-Valued Dimensions are when a
Fact needs to connect more than
once to a Dimension
– Primary...
Multi-Valued Dimensions
• Two possible solutions
– Create copies of the Dimensions for each
role
– Create a Bridge table t...
Multi-Valued Dimensions
Bridge Tables
Bridge Tables
• Bridge Tables can be used to resolve any
many to many relationships
• This is frequently required with mor...
Hierarchies and Recursive
Hierarchies
Why?
• Why Dimensional Model?
• Allows for a concise representation of
data for reporting. This is especially
important fo...
Why?
• The most important reason –
– Requires detailed understanding of the
data
– Validates the solution
– Uncovers incon...
Why?
• Ultimately there must be a business
requirement for a temporal data
model and not just a spatial one.
• Although yo...
How?
• Start with your simplest Dimension and Fact
tables and define the Natural Keys for them
– i.e. People, Product, Tra...
How?
• Don’t force entities on the same
Dimension
– Tempting but you will find it doesn’t
represent the data and will caus...
How?
• Iterate, Iterate, Iterate
– Your initial solution will be wrong
– Create it and start to define the load
process an...
Two things to Ponder
Two things to Ponder
• In the Information Age ahead,
databases will be used more for
analysis than operational
– More Dime...
Two things to Ponder
• Critical skills going forward will be:
– Data Modeling/Data Architecture
– Data Migration
• There i...
Whew! Questions?
Asper database presentation - Data Modeling Topics
Upcoming SlideShare
Loading in...5
×

Asper database presentation - Data Modeling Topics

226

Published on

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
226
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Asper database presentation - Data Modeling Topics"

  1. 1. @tbunio tbunio@protegra.com agilevoyageur.wordpress.com www.protegra.com
  2. 2. Agenda • Data Modeling • The Project • Hot DB topics • Relational vs Dimensional • Dimensional concepts – Facts – Dimensions • Complex Concept Introduction
  3. 3. What is Data Modeling?
  4. 4. Definition • “A database model is a specification describing how a database is structured and used” – Wikipedia
  5. 5. Definition • “A database model is a specification describing how a database is structured and used” – Wikipedia • “A data model describes how the data entities are related to each other in the real world” - Terry
  6. 6. Data Model Characteristics • Organize/Structure like Data Elements • Define relationships between Data Entities • Highly Cohesive • Loosely Coupled
  7. 7. The Project • Major Health Service provider is switching their claims system to SAP • As part of this, they are totally redeveloping their entire Data Warehouse solution
  8. 8. The Project • 3+ years duration • 100+ Integration Projects • 200+ Systems Personnel
  9. 9. Integration Project Streams • Client Administration – Policy Systems • Data Warehouse • Legacy – Conversion from Legacy • Queries – Queries for internal and external • Web – External Web Applications
  10. 10. Data Warehouse Team • Terry – Data Architect/Modeler and PM • Hanaa – Lead Analyst and Data Analyst • Kevin – Lead Data Migration Developer • Lisa – Lead Report Analyst • Les – Lead Report Developer
  11. 11. Current State • Sybase Data Warehouse – Combination of Normalized and Dimensional design • Data Migration – Series of SQL Scripts that move data from Legacy (Cobol) and Java Applications • Impromptu – 1000+ Reports
  12. 12. Target State • SQL Server 2012 • SQL Server Integration Services for Data Migration • SQL Server Reporting Services for Report Development • Sharepoint for Report Portal
  13. 13. Target Solution • Initial load moves 2.5 Terabytes of data • Initial load runs once • Incremental load runs every hour
  14. 14. Target Solution • Operational Data Store – Normalized – 400+ tables • Data Warehouse – Dimensional – 60+ tables • Why both? – ODS does not have history (Just Transactions)
  15. 15. Our #1 Challenge • We needed to be Agile like the other projects! – We are now on revision 3500 • We spent weeks planning of how to be flexible • Instead of spending time planning, we spent time planning how we could quickly change and adapt • This also meant we created a new automated test framework
  16. 16. Beef? • Where are the hot topics like – Big Data – NoSQL – MySQL – Data Warehouse Appliances – Cloud – Open Source Databases
  17. 17. Big Data • “Commercial Databases” have come a long way to handle large data volume • Big Data is still important but probably is not required for the vast majority of databases – But it applicable for the Facebooks and Amazons out there
  18. 18. Big Data • For example, many of the Big Data solutions featured ColumnStore Indexing • Now almost all commercial databases offer ColumnStore Indexes
  19. 19. NoSQL • NoSQL was heralded a few years ago as the death of structured databases • Mainly promoted from the developer community • Seems to have found a niche for supporting more mainly unstructured and dynamic data • Traditional databases still the most efficient for structured data
  20. 20. MySQL • MySQL was also promoted as a great lightweight, high performance option’ • We actually investigated it as an option for the project • Great example of never trusting what you hear
  21. 21. MySQL • All of the great MySQL benchmarks use the simplest database engine with no ACID compliance – MySQL has the option to use different engines with different features • Once you use the ACID compliant engine, the performance is equivalent(or worse) to SQL Server and PostgreSQL
  22. 22. Data Warehouse Appliances • “marketing term for an integrated set of servers, storage, operating system(s), DBMS and software specifically pre-installed and pre- optimized for data warehousing”
  23. 23. Data Warehouse Appliances • Recently in the Data Warehouse Industry, there has been the rise of a the Data Warehouse appliances • These appliances are a one-stop solution that builds in Big Data capabilities
  24. 24. Data Warehouse Appliances • Cool Names like: – Teradata – GreenPlum – Netezza – InfoSphere – EMC • Like Big Data these solution are valuable if you need to play in the Big Data/Big Analysis arena • Most solutions don’t require them
  25. 25. Cloud • Great to store pictures and music – the concept still makes businesses nervous – Also regulatory requirements sometime prevent it • Business is starting to become more comfortable – Still a ways to go • Very few business go to the Cloud unless they have to – Amazon/Microsoft is changing this with their services
  26. 26. Open Source Databases • We investigated Open Sources databases for our solution. We looked at: – MySQL – PostgreSQL – others
  27. 27. Open Sources Databases • We were surprised to learn that once you factor in all the things you get from SQL Server, it actually is cheaper over 10 years than Open Source!! • So we select SQL Server
  28. 28. Foundational DB Practices
  29. 29. Two design methods • Relational – “Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.”.”
  30. 30. Two design methods • Dimensional – “Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts
  31. 31. Relational
  32. 32. Relational • Relational Analysis – Database design is usually in Third Normal Form – Database is optimized for transaction processing. (OLTP) – Normalized tables are optimized for modification rather than retrieval
  33. 33. Normal forms • 1st - Under first normal form, all occurrences of a record type must contain the same number of fields. • 2nd - Second normal form is violated when a non- key field is a fact about a subset of a key. It is only relevant when the key is composite • 3rd - Third normal form is violated when a non-key field is a fact about another non-key field Source: William Kent - 1982
  34. 34. Dimensional
  35. 35. Dimensional • Dimensional Analysis – Star Schema/Snowflake – Database is optimized for analytical processing. (OLAP) – Facts and Dimensions optimized for retrieval • Facts – Business events – Transactions • Dimensions – context for Transactions – People – Accounts – Products – Date
  36. 36. Relational • 3 Dimensions • Spatial Model – No historical components except for transactional tables • Relational – Models the one truth of the data – One account ‘11’ – One person ‘Terry Bunio’ – One transaction of ‘$100.00’ on April 10th
  37. 37. Dimensional • 4 Dimensions • Temporal Model – All tables have a time component • Dimensional – Models the one truth of the data at a point in time – Multiple versions of Accounts over time – Multiple versions of people over time – One transaction • Transactions are already temporal
  38. 38. Fact Tables • Contains the measurements or facts about a business process • Are thin and deep • Usually is: – Business transaction – Business Event • The grain of a Fact table is the level of the data recorded – Order, Invoice, Invoice Item
  39. 39. Special Fact Tables • Degenerate Dimensions – Degenerate Dimensions are Dimensions that can typically provide additional context about a Fact • For example, flags that describe a transaction • Degenerate Dimensions can either be a separate Dimension table or be collapsed onto the Fact table – My preference is the latter
  40. 40. Dimension Tables • Unlike fact tables, dimension tables contain descriptive attributes that are typically textual fields • These attributes are designed to serve two critical purposes: – query constraining and/or filtering – query result set labeling. Source: Wikipedia
  41. 41. Dimension Tables • Shallow and Wide • Usually corresponds to entities that the business interacts with – People – Locations – Products – Accounts
  42. 42. Time Dimension • All Dimensional Models need a time component • This is either a: – Separate Time Dimension (recommended) – Time attributes on each Fact Table
  43. 43. Mini-Dimensions
  44. 44. Mini-Dimensions • Splitting a Dimension up due to the activity of change for a set of attributes • Helps to reduce the growth of the Dimension table
  45. 45. Slowly Changing Dimensions • Type 1 – Overwrite the row with the new values and update the effective date – Pre-existing Facts now refer to the updated Dimension – May cause inconsistent reports
  46. 46. Slowly Changing Dimensions • Type 2 – Insert a new Dimension row with the new data and new effective date – Update the expiry date on the prior row • Don’t update old Facts that refer to the old row – Only new Facts will refer to this new Dimension row • Type 2 Slowly Changing Dimension maintains the historical context of the data
  47. 47. Slowly Changing Dimensions • No longer to I have one row to represent: – Account 10123 – Terry Bunio – Sales Representative 11092 • This changes the mindset and query syntax to retrieve data
  48. 48. Slowly Changing Dimensions • Type 3 – The Dimension stores multiple versions for the attribute in question • This usually involves a current and previous value for the attribute • When a change occurs, no rows are added but both the current and previous attributes are updated • Like Type 1, Type 3 does not retain full historical context
  49. 49. Complexity • Most textbooks stop here only show the simplest Dimensional Models • Unfortunately, I’ve never run into a Dimensional Model like that
  50. 50. Complex Concept Introduction • Snowflake vs Star Schema • Multi-Valued Dimensions and Bridges • Recursive Hierarchies
  51. 51. Snowflake vs Star Schema
  52. 52. Snowflake vs Star Schema
  53. 53. Dimensional
  54. 54. Snowflake vs Star Schema • These extra tables are termed outriggers • They are used to address real world complexities with the data – Excessive row length – Repeating groups of data within the Dimension • I will use outriggers in a limited way for repeating data
  55. 55. Multi-Valued Dimensions • Multi-Valued Dimensions are when a Fact needs to connect more than once to a Dimension – Primary Sales Representative – Secondary Sales Representative
  56. 56. Multi-Valued Dimensions • Two possible solutions – Create copies of the Dimensions for each role – Create a Bridge table to resolve the many to many relationship
  57. 57. Multi-Valued Dimensions
  58. 58. Bridge Tables
  59. 59. Bridge Tables • Bridge Tables can be used to resolve any many to many relationships • This is frequently required with more complex data areas • These bridge tables need to be considered a Dimension and they need to use the same Slowly Changing Dimension Design as the base Dimension – My Recommendation
  60. 60. Hierarchies and Recursive Hierarchies
  61. 61. Why? • Why Dimensional Model? • Allows for a concise representation of data for reporting. This is especially important for Self-Service Reporting – We reduced from 400+ tables in our Operational Data Store to 60+ tables in our Data Warehouse – Aligns with real world business concepts
  62. 62. Why? • The most important reason – – Requires detailed understanding of the data – Validates the solution – Uncovers inconsistencies and errors in the Normalized Model • Easy for inconsistencies and errors to hide in 400+ tables • No place to hide when those tables are reduced down
  63. 63. Why? • Ultimately there must be a business requirement for a temporal data model and not just a spatial one. • Although you could go through the exercise to validate your understanding and not implement the Dimensional Data Model
  64. 64. How? • Start with your simplest Dimension and Fact tables and define the Natural Keys for them – i.e. People, Product, Transaction, Time • De-Normalize Reference tables to Dimensions (And possibly Facts based on how large the Fact tables will be) – I place both codes and descriptions on the Dimension and Fact tables • Look to De-normalize other tables with the same Cardinality into one Dimension – Validate the Natural Keys still define one row
  65. 65. How? • Don’t force entities on the same Dimension – Tempting but you will find it doesn’t represent the data and will cause issues for loading or retrieval – Bridge table or mini-snowflakes are not bad • I don’t like a deep snowflake, but shallow snowflakes can be appropriate • Don’t fall into the Star-Schema/Snowflake Holy War – Let your data define the solution
  66. 66. How? • Iterate, Iterate, Iterate – Your initial solution will be wrong – Create it and start to define the load process and reports – You will learn more by using the data than months of analysis to try and get the model right
  67. 67. Two things to Ponder
  68. 68. Two things to Ponder • In the Information Age ahead, databases will be used more for analysis than operational – More Dimensional Models and analytical processes
  69. 69. Two things to Ponder • Critical skills going forward will be: – Data Modeling/Data Architecture – Data Migration • There is a whole subject area here for a subsequent presentation. More of an art than science – Data Verbalization • Again a real art form to take a huge amount of data and present it in a readable form
  70. 70. Whew! Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×