Data modeling tips from the trenches

Terry Bunio
Data Modeling – Tales from the trenches
Thank you to our Sponsors

@tbunio
tbunio@protegra.com
agilevoyageur.com
www.protegra.com

Who Am I?
• Terry Bunio
• Data Base Administrator
– Oracle
– SQL Server 6,6.5,7,2000,2005,2008,2012
– Informix
– ADABAS
• Data Modeler/Architect
– Investors Group, LPL Financial, Manitoba
Blue Cross, Assante Financial, CI Funds,
Mackenzie Financial
– Normalized and Dimensional
• Agilist
– Innovation Gamer, Team Member, SQL
Developer, Test writer, Sticky Sticker, Project
Manager, PMO on SAP Implementation

Agenda
• Data Modeling Hubris
– Multi-language reference tables
– “All Claims”
– Recursion

Once upon a time
• Worked on a project for a
client in Luxembourg
• Interesting point
– Luxembourg has four official
languages
• English
• French
• German
• Flemish (I think)

Once upon a time
• Need to create multi-lingual
descriptions for reference table
• Currently only required English
and French
• Convinced team that we would
soft model the language

Once upon a time
• These tables also had
independent surrogate kets for
all reference table values

Once upon a time
• It wasn’t fun
• Queries performed terribly and
were overly complex
• Never used the extra flexibility
and we eventually replaced the
functionality with an English
and French description field

Once upon a time
• Not my design
• Once saw a database that
actually stored all text fields on
one table
– You joined to the table with the
Primary Key from the description
table
• Some queries joined to the
name table over 10 times.

All Claims
• Anyone work with SAP?
• Their tables are not tables as
much as large flat files
• Record type and other
extremely codified fields
• Really hard to make sense of

All Claims
• To make it easier on
developers we created an
All_claims table that would join
all the relative data together
and also do some filtering

All Claims
• This became quite the beast of
an object
• Became a focal point for
performance tuning
• No one could access the data
until it was loaded

All Claims
• We eventually had to develop
a net change process as we
couldn’t reload all the records
every day
• Ended up being very
successful
– Lot of heartache
– Extremely talented developer

Recursion
• Usually used to model multiple
levels of an object
– Office structure
– Organization Hierarchy
– Etc…

Recursion
• Looking back…
– Seemed to be an intellectual
exercise
– Can I figure out a way to
dynamically model this?

Recursion
• Question is:
– Does the data need a dynamic
model?
– Looking back
• The models were 99% stable
• Dynamic model was being down
for the future
• Definitely over engineering

Recursion
• So what?
– Complexity in retrieving data
• Especially for reports
– The data would need to have
multiple levels and the ability to
move between different multiple
levels frequently for me to model
the data recursively like this
again

Recursion
• What not just model the data in
a fixed way and deal with
changes as need
– Region
– Division
– Department
• Whoops! Just add Sub-
Division when required and
convert

Agenda
• Data Modeling Mistakes
– Anthropomorphism
– Over-Engineering
– Keys
• GUIDs
• Surrogate/Real Keys
• Composite Keys
– Deleted Records
– Nulls
– History
– Recursion

Definition
• “A database model is a
specification describing
how a database is
structured and used” –
Wikipedia

Definition
• “A data model describes how
the data entities are related
to each other in the real
world” – Terry (5 years ago)
• “A data model describes how
the data entities are related
to each other in the
application” – Terry (today)

Data Model
Characteristics
• Organize/Structure
like Data Elements
• Define relationships
between Data Entities
• Highly Cohesive
• Loosely Coupled

Relational
• Relational Analysis
– Database design is usually in
Third Normal Form
– Database is optimized for
transaction processing. (OLTP)
– Normalized tables are optimized
for modification rather than
retrieval

Normal forms
• 1st - Under first normal form, all
occurrences of a record type must contain
the same number of fields.
• 2nd - Second normal form is violated
when a non-key field is a fact about a
subset of a key. It is only relevant when
the key is composite
• 3rd - Third normal form is violated when
a non-key field is a fact about another
non-key field
Source: William Kent - 1982

Normal Forms for the
Layman
• 1st – Table only represents
one type of data
– No row types
• 2nd – Field does not depend
on only a part of the Primary
Key
• 3rd – Field depends only on
the Primary Key

Remember
• Remember to ask ourselves
when we are modeling
• Do either of the options
contradict the normal forms
• Usually we model past 3rd
normal form based on other
biases

#1 Mistake in
Data Modeling
• Modeling something
to take on human
characteristics or
characteristics of
our world

Amazon
• Warehouse is organized
totally randomly
• Although humans think the
items should be ordered in
some way, it does not help
storage or retrieval in any way
– In fact in hurts it by creating ‘hot
spots’ for in demand items

Data Model
Anthropomorphism
• We sometimes
create objects in
our Data Models
are they exist in the
real world, not in
the applications

Data Model
Anthropomorphism
• This is usually the case for
physical objects in the real
world
– Companies/Organizations
– People
– Addresses
– Phone Numbers
– Emails

Data Model
Anthropomorphism
• Why?
– Do we ever need to consolidate all
people, addresses, or emails?
• Rarely
– We usually report based on other
filter criteria
– So why do we try to place like real
world items on one table when
applications treat them differently?

Over Engineering
• Additional flexibility that is
not required does not
simplify the solution, it overly
complicates the solution

Over Engineering
• These are usually tables that
have multiple mutually
exclusive foreign keys
– Only one is filled at any one time
• Why not just create separate
join tables?
– Doesn’t violate any normal forms

GUIDs
• Oscar winner for worst choice
for a Primary Key ever
• Selected based on over
engineering because they
would never be duplicates

GUIDs
• In the meantime they caused
excessive index length, user
frustration, and complex query
execution plans
• Just say no.

GUIDs
• Especially don’t use them on
tables with a fewer number of
records
• Who says all the Primary Keys
In a database need to be of
the same type?

Surrogate Keys
• Surrogate Keys are a huge
benefit
• Straight Integer keys are
probably the most common
– Users are the most used to
integer keys as well
• Same as bank account, credit
cards, other account information

Surrogate Keys
• The exception
– Don’t, don’t, don’t use Surrogate
keys for Reference or Support
tables
– Causes needless lookups for
clients, SQL queries, and for
reports

Surrogate Keys
• Do we really need to assign a
numeric Primary Key for
Gender and Province codes?
– Especially since these value
very rarely change
– Might make sense for reference
tables that change more
frequently.

Composite Keys
• Composite Keys are needed to
violate 2nd normal form
– Remove Composite Keys, you
remove being able to have that
violation
• Just a bad idea as there is
inherent meaning that the
Primary Key can change

Deleted Records
• Are we soft deleting or hard
deleting records?
• Used to like soft deleting as
you never lost data
• But this can make queries a
nightmare with needing to filter
on deleted records for every
table in a query

Deleted Records
• Soft deleted records also
perform quite poorly when
included in an index due them
only having two values
– Or else you need to add the
deleted indicator to many
indexes
– Both are inefficient

Nulls
• Nulls are evil
• Do whatever you can to avoid
nulls
– Column Defaults
– Domain Defaults
– Did I mention defaults?

Nulls
• Nulls can complicate queries
just like deleted indicators
• Probably also are the number
one cause of devious, mind-
bending defects
– Think of the time you will save!

Nulls
• For this reason, Nulls are the
first thing that goes when
create a Self Service Data
Warehouse

History
• Where and how should we
store history?
• Transaction tables are easy
– They usually have always been
historical tables
• But what about tables like
person and address?

History
• Few options
– Create history record on same
table
– Create history record on history
table for each table
– Create history record on one
audit table
– Don’t store it and let the Data
Warehouse worry about it

History on same table
• Keeps the number of tables in
your database to a minimum
• Keeps queries cleaner
• Complicates queries as you
now need to include/exclude
– And you will need to add
additional data information

History on separate table
• Dirties up the database as you
create a history copy of every
table in the database
• Some Queries are cleaner
• Some Queries now need to join
twice as many table though!

History on Audit table
• Queries are cleaner
• Database is cleaner
• But depending on the solution,
you may end up having One
absolutely huge table to parse
through. 

History in Data
Warehouse
• Perhaps the cleanest option
• Requires a commitment to
infrastructure
• Latency may also become an
issue

Data modeling tips from the trenches

More Related Content

Similar to Data modeling tips from the trenches

More from Terry Bunio

Recently uploaded

Data modeling tips from the trenches