Terry Bunio
Data Modeling – Tales from the trenches
Thank you to our Sponsors
@tbunio
tbunio@protegra.com
agilevoyageur.com
www.protegra.com
Who Am I?
• Terry Bunio
• Data Base Administrator
– Oracle
– SQL Server 6,6.5,7,2000,2005,2008,2012
– Informix
– ADABAS
• Data Modeler/Architect
– Investors Group, LPL Financial, Manitoba
Blue Cross, Assante Financial, CI Funds,
Mackenzie Financial
– Normalized and Dimensional
• Agilist
– Innovation Gamer, Team Member, SQL
Developer, Test writer, Sticky Sticker, Project
Manager, PMO on SAP Implementation
Agenda
• Data Modeling Hubris
– Multi-language reference tables
– “All Claims”
– Recursion
Once upon a time
• Worked on a project for a
client in Luxembourg
• Interesting point
– Luxembourg has four official
languages
• English
• French
• German
• Flemish (I think)
Once upon a time
• Need to create multi-lingual
descriptions for reference table
• Currently only required English
and French
• Convinced team that we would
soft model the language
Once upon a time
• These tables also had
independent surrogate kets for
all reference table values
Once upon a time
• It wasn’t fun
• Queries performed terribly and
were overly complex
• Never used the extra flexibility
and we eventually replaced the
functionality with an English
and French description field
Once upon a time
• Not my design
• Once saw a database that
actually stored all text fields on
one table
– You joined to the table with the
Primary Key from the description
table
• Some queries joined to the
name table over 10 times.
All Claims
All Claims
• Anyone work with SAP?
• Their tables are not tables as
much as large flat files
• Record type and other
extremely codified fields
• Really hard to make sense of
All Claims
• To make it easier on
developers we created an
All_claims table that would join
all the relative data together
and also do some filtering
All Claims
• This became quite the beast of
an object
• Became a focal point for
performance tuning
• No one could access the data
until it was loaded
All Claims
• We eventually had to develop
a net change process as we
couldn’t reload all the records
every day
• Ended up being very
successful
– Lot of heartache
– Extremely talented developer
Recursion
Recursion
• Usually used to model multiple
levels of an object
– Office structure
– Organization Hierarchy
– Etc…
Recursion
• Looking back…
– Seemed to be an intellectual
exercise
– Can I figure out a way to
dynamically model this?
Recursion
• Question is:
– Does the data need a dynamic
model?
– Looking back
• The models were 99% stable
• Dynamic model was being down
for the future
• Definitely over engineering
Recursion
• So what?
– Complexity in retrieving data
• Especially for reports
– The data would need to have
multiple levels and the ability to
move between different multiple
levels frequently for me to model
the data recursively like this
again
Recursion
• What not just model the data in
a fixed way and deal with
changes as need
– Region
– Division
– Department
• Whoops! Just add Sub-
Division when required and
convert
Agenda
• Data Modeling Mistakes
– Anthropomorphism
– Over-Engineering
– Keys
• GUIDs
• Surrogate/Real Keys
• Composite Keys
– Deleted Records
– Nulls
– History
– Recursion
Definition
• “A database model is a
specification describing
how a database is
structured and used” –
Wikipedia
Definition
• “A data model describes how
the data entities are related
to each other in the real
world” – Terry (5 years ago)
• “A data model describes how
the data entities are related
to each other in the
application” – Terry (today)
Data Model
Characteristics
• Organize/Structure
like Data Elements
• Define relationships
between Data Entities
• Highly Cohesive
• Loosely Coupled
Relational
• Relational Analysis
– Database design is usually in
Third Normal Form
– Database is optimized for
transaction processing. (OLTP)
– Normalized tables are optimized
for modification rather than
retrieval
Normal forms
• 1st - Under first normal form, all
occurrences of a record type must contain
the same number of fields.
• 2nd - Second normal form is violated
when a non-key field is a fact about a
subset of a key. It is only relevant when
the key is composite
• 3rd - Third normal form is violated when
a non-key field is a fact about another
non-key field
Source: William Kent - 1982
Normal Forms for the
Layman
• 1st – Table only represents
one type of data
– No row types
• 2nd – Field does not depend
on only a part of the Primary
Key
• 3rd – Field depends only on
the Primary Key
Remember
• Remember to ask ourselves
when we are modeling
• Do either of the options
contradict the normal forms
• Usually we model past 3rd
normal form based on other
biases
Anthropomorphism
#1 Mistake in
Data Modeling
• Modeling something
to take on human
characteristics or
characteristics of
our world
Amazon
Amazon
• Warehouse is organized
totally randomly
• Although humans think the
items should be ordered in
some way, it does not help
storage or retrieval in any way
– In fact in hurts it by creating ‘hot
spots’ for in demand items
Data Model
Anthropomorphism
• We sometimes
create objects in
our Data Models
are they exist in the
real world, not in
the applications
Data Model
Anthropomorphism
• This is usually the case for
physical objects in the real
world
– Companies/Organizations
– People
– Addresses
– Phone Numbers
– Emails
Data Model
Anthropomorphism
• Why?
– Do we ever need to consolidate all
people, addresses, or emails?
• Rarely
– We usually report based on other
filter criteria
– So why do we try to place like real
world items on one table when
applications treat them differently?
Over Engineering
Over Engineering
• Additional flexibility that is
not required does not
simplify the solution, it overly
complicates the solution
Over Engineering
• These are usually tables that
have multiple mutually
exclusive foreign keys
– Only one is filled at any one time
• Why not just create separate
join tables?
– Doesn’t violate any normal forms
Keys
GUIDs
• Oscar winner for worst choice
for a Primary Key ever
• Selected based on over
engineering because they
would never be duplicates
GUIDs
• In the meantime they caused
excessive index length, user
frustration, and complex query
execution plans
• Just say no.
GUIDs
• Especially don’t use them on
tables with a fewer number of
records
• Who says all the Primary Keys
In a database need to be of
the same type?
Surrogate Keys
• Surrogate Keys are a huge
benefit
• Straight Integer keys are
probably the most common
– Users are the most used to
integer keys as well
• Same as bank account, credit
cards, other account information
Surrogate Keys
• The exception
– Don’t, don’t, don’t use Surrogate
keys for Reference or Support
tables
– Causes needless lookups for
clients, SQL queries, and for
reports
Surrogate Keys
• Do we really need to assign a
numeric Primary Key for
Gender and Province codes?
– Especially since these value
very rarely change
– Might make sense for reference
tables that change more
frequently.
Composite Keys
Composite Keys
Composite Keys
• Composite Keys are needed to
violate 2nd normal form
– Remove Composite Keys, you
remove being able to have that
violation
• Just a bad idea as there is
inherent meaning that the
Primary Key can change
Deleted Records
• Are we soft deleting or hard
deleting records?
• Used to like soft deleting as
you never lost data
• But this can make queries a
nightmare with needing to filter
on deleted records for every
table in a query
Deleted Records
• Soft deleted records also
perform quite poorly when
included in an index due them
only having two values
– Or else you need to add the
deleted indicator to many
indexes
– Both are inefficient
Nulls
Nulls
• Nulls are evil
• Do whatever you can to avoid
nulls
– Column Defaults
– Domain Defaults
– Did I mention defaults?
Nulls
• Nulls can complicate queries
just like deleted indicators
• Probably also are the number
one cause of devious, mind-
bending defects
– Think of the time you will save!
Nulls
• For this reason, Nulls are the
first thing that goes when
create a Self Service Data
Warehouse
History
History
• Where and how should we
store history?
• Transaction tables are easy
– They usually have always been
historical tables
• But what about tables like
person and address?
History
• Few options
– Create history record on same
table
– Create history record on history
table for each table
– Create history record on one
audit table
– Don’t store it and let the Data
Warehouse worry about it
History on same table
• Keeps the number of tables in
your database to a minimum
• Keeps queries cleaner
• Complicates queries as you
now need to include/exclude
– And you will need to add
additional data information
History on separate table
• Dirties up the database as you
create a history copy of every
table in the database
• Some Queries are cleaner
• Some Queries now need to join
twice as many table though!
History on Audit table
• Queries are cleaner
• Database is cleaner
• But depending on the solution,
you may end up having One
absolutely huge table to parse
through. 
History in Data
Warehouse
• Perhaps the cleanest option
• Requires a commitment to
infrastructure
• Latency may also become an
issue
Lets play a game
Questions?

Data modeling tips from the trenches

  • 1.
    Terry Bunio Data Modeling– Tales from the trenches Thank you to our Sponsors
  • 2.
  • 3.
    Who Am I? •Terry Bunio • Data Base Administrator – Oracle – SQL Server 6,6.5,7,2000,2005,2008,2012 – Informix – ADABAS • Data Modeler/Architect – Investors Group, LPL Financial, Manitoba Blue Cross, Assante Financial, CI Funds, Mackenzie Financial – Normalized and Dimensional • Agilist – Innovation Gamer, Team Member, SQL Developer, Test writer, Sticky Sticker, Project Manager, PMO on SAP Implementation
  • 6.
    Agenda • Data ModelingHubris – Multi-language reference tables – “All Claims” – Recursion
  • 7.
    Once upon atime • Worked on a project for a client in Luxembourg • Interesting point – Luxembourg has four official languages • English • French • German • Flemish (I think)
  • 8.
    Once upon atime • Need to create multi-lingual descriptions for reference table • Currently only required English and French • Convinced team that we would soft model the language
  • 9.
    Once upon atime • These tables also had independent surrogate kets for all reference table values
  • 11.
    Once upon atime • It wasn’t fun • Queries performed terribly and were overly complex • Never used the extra flexibility and we eventually replaced the functionality with an English and French description field
  • 13.
    Once upon atime • Not my design • Once saw a database that actually stored all text fields on one table – You joined to the table with the Primary Key from the description table • Some queries joined to the name table over 10 times.
  • 14.
  • 15.
    All Claims • Anyonework with SAP? • Their tables are not tables as much as large flat files • Record type and other extremely codified fields • Really hard to make sense of
  • 16.
    All Claims • Tomake it easier on developers we created an All_claims table that would join all the relative data together and also do some filtering
  • 18.
    All Claims • Thisbecame quite the beast of an object • Became a focal point for performance tuning • No one could access the data until it was loaded
  • 19.
    All Claims • Weeventually had to develop a net change process as we couldn’t reload all the records every day • Ended up being very successful – Lot of heartache – Extremely talented developer
  • 20.
  • 21.
    Recursion • Usually usedto model multiple levels of an object – Office structure – Organization Hierarchy – Etc…
  • 22.
    Recursion • Looking back… –Seemed to be an intellectual exercise – Can I figure out a way to dynamically model this?
  • 23.
    Recursion • Question is: –Does the data need a dynamic model? – Looking back • The models were 99% stable • Dynamic model was being down for the future • Definitely over engineering
  • 24.
    Recursion • So what? –Complexity in retrieving data • Especially for reports – The data would need to have multiple levels and the ability to move between different multiple levels frequently for me to model the data recursively like this again
  • 25.
    Recursion • What notjust model the data in a fixed way and deal with changes as need – Region – Division – Department • Whoops! Just add Sub- Division when required and convert
  • 26.
    Agenda • Data ModelingMistakes – Anthropomorphism – Over-Engineering – Keys • GUIDs • Surrogate/Real Keys • Composite Keys – Deleted Records – Nulls – History – Recursion
  • 27.
    Definition • “A databasemodel is a specification describing how a database is structured and used” – Wikipedia
  • 28.
    Definition • “A datamodel describes how the data entities are related to each other in the real world” – Terry (5 years ago) • “A data model describes how the data entities are related to each other in the application” – Terry (today)
  • 29.
    Data Model Characteristics • Organize/Structure likeData Elements • Define relationships between Data Entities • Highly Cohesive • Loosely Coupled
  • 30.
    Relational • Relational Analysis –Database design is usually in Third Normal Form – Database is optimized for transaction processing. (OLTP) – Normalized tables are optimized for modification rather than retrieval
  • 31.
    Normal forms • 1st- Under first normal form, all occurrences of a record type must contain the same number of fields. • 2nd - Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite • 3rd - Third normal form is violated when a non-key field is a fact about another non-key field Source: William Kent - 1982
  • 32.
    Normal Forms forthe Layman • 1st – Table only represents one type of data – No row types • 2nd – Field does not depend on only a part of the Primary Key • 3rd – Field depends only on the Primary Key
  • 33.
    Remember • Remember toask ourselves when we are modeling • Do either of the options contradict the normal forms • Usually we model past 3rd normal form based on other biases
  • 34.
  • 35.
    #1 Mistake in DataModeling • Modeling something to take on human characteristics or characteristics of our world
  • 36.
  • 37.
    Amazon • Warehouse isorganized totally randomly • Although humans think the items should be ordered in some way, it does not help storage or retrieval in any way – In fact in hurts it by creating ‘hot spots’ for in demand items
  • 38.
    Data Model Anthropomorphism • Wesometimes create objects in our Data Models are they exist in the real world, not in the applications
  • 39.
    Data Model Anthropomorphism • Thisis usually the case for physical objects in the real world – Companies/Organizations – People – Addresses – Phone Numbers – Emails
  • 40.
    Data Model Anthropomorphism • Why? –Do we ever need to consolidate all people, addresses, or emails? • Rarely – We usually report based on other filter criteria – So why do we try to place like real world items on one table when applications treat them differently?
  • 41.
  • 42.
    Over Engineering • Additionalflexibility that is not required does not simplify the solution, it overly complicates the solution
  • 43.
    Over Engineering • Theseare usually tables that have multiple mutually exclusive foreign keys – Only one is filled at any one time • Why not just create separate join tables? – Doesn’t violate any normal forms
  • 44.
  • 45.
    GUIDs • Oscar winnerfor worst choice for a Primary Key ever • Selected based on over engineering because they would never be duplicates
  • 46.
    GUIDs • In themeantime they caused excessive index length, user frustration, and complex query execution plans • Just say no.
  • 47.
    GUIDs • Especially don’tuse them on tables with a fewer number of records • Who says all the Primary Keys In a database need to be of the same type?
  • 48.
    Surrogate Keys • SurrogateKeys are a huge benefit • Straight Integer keys are probably the most common – Users are the most used to integer keys as well • Same as bank account, credit cards, other account information
  • 49.
    Surrogate Keys • Theexception – Don’t, don’t, don’t use Surrogate keys for Reference or Support tables – Causes needless lookups for clients, SQL queries, and for reports
  • 50.
    Surrogate Keys • Dowe really need to assign a numeric Primary Key for Gender and Province codes? – Especially since these value very rarely change – Might make sense for reference tables that change more frequently.
  • 51.
  • 52.
  • 53.
    Composite Keys • CompositeKeys are needed to violate 2nd normal form – Remove Composite Keys, you remove being able to have that violation • Just a bad idea as there is inherent meaning that the Primary Key can change
  • 54.
    Deleted Records • Arewe soft deleting or hard deleting records? • Used to like soft deleting as you never lost data • But this can make queries a nightmare with needing to filter on deleted records for every table in a query
  • 55.
    Deleted Records • Softdeleted records also perform quite poorly when included in an index due them only having two values – Or else you need to add the deleted indicator to many indexes – Both are inefficient
  • 56.
  • 57.
    Nulls • Nulls areevil • Do whatever you can to avoid nulls – Column Defaults – Domain Defaults – Did I mention defaults?
  • 58.
    Nulls • Nulls cancomplicate queries just like deleted indicators • Probably also are the number one cause of devious, mind- bending defects – Think of the time you will save!
  • 59.
    Nulls • For thisreason, Nulls are the first thing that goes when create a Self Service Data Warehouse
  • 60.
  • 61.
    History • Where andhow should we store history? • Transaction tables are easy – They usually have always been historical tables • But what about tables like person and address?
  • 62.
    History • Few options –Create history record on same table – Create history record on history table for each table – Create history record on one audit table – Don’t store it and let the Data Warehouse worry about it
  • 63.
    History on sametable • Keeps the number of tables in your database to a minimum • Keeps queries cleaner • Complicates queries as you now need to include/exclude – And you will need to add additional data information
  • 64.
    History on separatetable • Dirties up the database as you create a history copy of every table in the database • Some Queries are cleaner • Some Queries now need to join twice as many table though!
  • 65.
    History on Audittable • Queries are cleaner • Database is cleaner • But depending on the solution, you may end up having One absolutely huge table to parse through. 
  • 66.
    History in Data Warehouse •Perhaps the cleanest option • Requires a commitment to infrastructure • Latency may also become an issue
  • 67.
  • 70.