2. About Me
Roger Carlson started Roger's Access Library as a place to store
knowledge in all forms related to Access It has grown one of the
most popular sites on the web with an estimated 2 million downloads.
Roger's website (www.rogersaccesslibrary.com) and blog
(http://rogersaccessblog.blogspot.com) have been visited by nearly a
million visitors from 170 countries.
Roger graduated from Western Michigan University with a BS in
Computer Science and taught database design and implementation
at Muskegon Community College for 12 years.
Roger currently works at Spectrum Health, the largest hospital
system in out-state Michigan, as a Senior BI Analyst.
3. What’s the Big Deal about
Normalization?
What is normalization? Normalization is a
methodology for removing redundant data
from a database WITHOUT losing
information.
So who cares? Why is redundant data bad?
4. Flat Files and Spreadsheets and
Databases. Oh My!
In a spreadsheet, it's acceptable to represent the data like
this
One way to correct this, would be to fill in the missing
information.
234-94-3894
5. Repeated Columns
One way to solve the redundant data problem is with Repeated
Columns. This is a common solution in spreadsheets. With
repeated columns, the redundant information are stored as
columns.
How many repeated columns should I create?
Structure becomes untenable to maintain (job desc, pay
grade, pay range, status, etc.
Structure adding new fields requires changes to all queries,
forms, reports, etc.
Difficult to query information. The Problem of Repeated Columns
6. Normalization
The solution is to break the table into multiple
tables that preserves data integrity without using
multiple columns.
And then relate the tables on one or more fields.
7. Decomposition Method vs.
12-Step Method
Decomposition:
Using the formal rules of normalization (Normal
Forms) to break non-normalized tables into
smaller normalized tables.
12-Step Method:
Starts with the business rules and builds the
database into properly normalized tables
8. The 12-Step Program
Many developers are addicted to tables
designed as spreadsheets
We call this "committing spreadsheet"
The following is the 12-Step Program to
Better Databases
9. Additional Reading
Database Design for Mere Mortals: A
Hands-On Guide to Relational
Database Design
by Michael J Hernandez (Addison-
Wesley)
CASE*Method Entity Relationship
Modelling
by Richard Barker (Addison-Wesley)
10. Step 1: Create a Narrative
Create a narrative that accurately and in
some detail describes the business
Collect input screens or paper forms
Collect reports and other output
Talk to managers
Talk to end users
Make the narrative as complete as possible.
11. Employee Database
Narrative
ZYX Laboratories requires an employee tracking database.
They want to track information about employees, the employee's job
history, and their certifications. Employee information includes first
name, middle initial, last name, social security number, address, city,
state, zip, home phone, cell phone, email address.
Job history would include job title, job description, pay grade, pay
range, salary, and date of promotion.
For certifications, they want certification type and date achieved. An
employee can have multiple jobs over time, (ie, Analyst, Sr. Analyst,
QA Administrator). Employees can also earn certifications necessary
for their job.
12. Step 2: Underline the Nouns
ZYX Laboratories requires an employee tracking database.
They want to track information about employees, the
employee's job history, and their certifications. Employee
information includes first name, middle initial, last name, social
security number, address, city, state, zip, home phone, cell
phone, email address.
Job history would include job title, job description, pay grade,
pay range, salary, and date of promotion.
For certifications, they want certification type and date achieved.
An employee can have multiple jobs over time, (ie, Analyst, Sr.
Analyst, QA Administrator). Employees can also earn
certifications necessary for their job.
13. Entities and Attributes
All of these nouns must be represented in the
database -- some as Entities and some as
Attributes.
An Entity is a "thing" about which we store
information. (Table)
An Attribute is the information that is being
stored. (Field)
14. Step 3: Create Noun List
Make a list of all the nouns.
Try to determine which are duplicates or are not
pertinent.
This will be your Preliminary Noun List
Employee First Name Middle
Last Name Address City
State Zip SS#
Phone Cell Email
Job History Job Title Job Description
Promotion Date Pay Range Pay Grade
Salary Certifications Certification Type
Certification Date
15. Step 4: Flag the Entities
Flag the nouns that are "subjects".
This will be your Entity List
Employee * First Name Middle
Last Name Address City
State Zip SS#
Phone Cell Email
Job History * Job Title Job Description
Promotion Date Pay Range Pay Grade
Salary Certifications * Certification Type
Certification Date
16. Step 5: Group Attributes with
Entities
Place all the Entities across the top of a sheet
of paper and write the unflagged nouns in the
Preliminary Noun List below the appropriate
Entity. Check them off the list as you do.
Do all of the nouns belong to an Entity in the
list?
If not, you missed a subject so you should add
it or assign it to "Unassigned" for later
consideration.
18. Step 6: Revise Entity List
Go through the Entity list with the customer if
possible
to see if there is any data that you should be
storing about that entity that you are not.
If so, add it to the attribute list.
19. Step 7: Add Primary Keys
A primary key is a field or fields which uniquely identify
a record. At this point, natural keys only.
20. Step 8: Evaluate Entities
Each Entity:
represents a single subject
has a primary key
DOES NOT contain unnecessary duplicate
attributes. (repeated columns)
23. Step 9: Evaluate Attributes
Each Attribute:
is a characteristic of the Entity
contains only a single value
CANNOT be deconstructed into smaller
components.
DOES NOT contain a calculated or
concatenated value.
is unique within the entire database structure.
DOES NOT have attributes of its own.
24. Step 10: Determine Relationships
Relationship Types
Many-to-Many: Common in real life,
but cannot be represented in a
database.
One-to-Many: The most common
relationship in a database.
One-to-One: Seldom used.
25. Employee-JobHistory
Each Employee can have One or More Job
History instance
And
Each Job History instance can be for One and
Only One Employee
26. Job-Job History
Each Job can have One or More Job History
instance
And
Each Job History instance can be for One and
Only One Job.
32. Step 11: Resolve Many-Many
Relationships
To rationalize a many-to-many relationship between
two tables, you create a entity table -- an "intersection"
or "linking" entity. Then you create one-to-many
relationships between the linking entity and each of the
main entities, with the "many-side" of both relationships
on the linking entity.
The Employee/Certification entity represents a
certification for a particular employee and that can be
given at only one time. Now I can see where to put my
unassigned CertificationDate attribute.
33. Real Entities vs. Pseudo Entities
Real Entity to Resolve M:M
Pseudo Entity to Resolve M:M
36. Step 12: Implementing the E-R
Diagram
So far, I've talked about Entities and Attributes to keep
myself from thinking about implementation issues
during the modeling phase.
But at the implementation phase, entities become
tables and attributes become fields.
37. Add Surrogate Keys:
Add an Autonumber, Primary Key field (Surrogate Key)
(tablename+"id")
EmployeeID
JobID
Create a Unique Index on the Natural Key
SS#
Job Title
39. Add Foreign Keys
Now it's time to look at my relationships.
Relationships are created on fields holding
related information, Primary Key to Foreign
Key.
In a One-to-Many (1:M) relationship, the
primary key of the table on the "One" side is
added to the table on the "Many" side table
and becomes the foreign key.
EmployeeID tblJobHistory
JobID tblJobHistory
There are many ways to represent data. Some of the most common are: spreadsheets, flat files, and relational databases. Each of these ways have their own advantages and disadvantages.
Unfortunately, this requires storing a lot of redundant data. What's the big deal? It's only a couple of fields, right? But that's only in the example shown. What if we were storing all of the demographic data (name, address, phone, city, state, etc.) for a lot of people? This would waste a lot of storage capacity.
But wasted storage is not the worst problem. What if the SSN of Gina Fawn's first record was changed to 215-87-7854? Perhaps this was through operator error or maybe a programmatic update. It doesn't matter, the data has been changed. Now, which SSN is really Gina's? The database has no way of knowing. Worst still, the SSN matches Steve Smith. So, does that SSN represent Gina or Steve? Again, no way to know.
This same problem holds true for all the fields which hold redundant data. This is called a Data Anomaly error. Once you start having data anomalies, you cannot trust the integrity of your database.
Now we don't have problems with redundancy, but we have additional problems. First of all, we have to decide how many repeated columns to create. In Figure 3, I only show one salary increase for Gina and Tony, but is that reasonable? What if Gina has five wage increases and Tony had seven? Is seven sets of columns enough? Do I cap it at the largest record? Or do I add more columns to accommodate growth? If so, how many?
Secondly, such a table structure requires a lot of manual modification and becomes untenable when you have a lot of data. Perhaps instead of just the date and salary, we are also storing the job description, pay grade, status, and so forth? The structure would be come so large and unruly that it would be impossible to maintain.
So far, I’ve approached normalization from a particular point of view. I put all the information into a single table then removed redundancies into separate tables. This method is called decomposition. Decomposition is fine for understanding the theory of normalization and for creating small databases. However, it is less useful for large databases. At least, I've found it so.So I'm going to talk about another way to approach normalization that starts with the individual pieces (or business rules) and builds it up into properly normalized tables. This method is called the Entity-Relationship method and the final result is an Entity-Relationship Diagram. An E-R diagram is useful not just for creating the data model, but for documenting it as well.Since we've been working with the Employee Database in our other examples, let's stick with it. But since I claimed that E-R method works for more complicated designs, let's make it a little more complex. I like to start with a short narrative of the requirements.
It is useful at this point to put them in a grid and assign the rest of the attributes
Next, I need to assign primary keys to each entity. A primary key is a field or fields which uniquely identify a record. At this point, I'm dealing only with natural keys. Surrogate keys will come later in the process.
So, for the Employee table, a person (as represented by the SS#) can have only one first name, last name, address, home phone, and so forth. That satisfies requirement #1. Secondly, if the value of the SS# changes, then so will all of those values. By that, I mean if we move to a different entity with a different SS#, that entity will have a different first name, last name, etc. (For our purposes here, we will assume that no two employees share any of these attributes.)Now, what about the Job History table? Any time an entity has a compound primary key, you should look at it very closely to make sure all the fields depend on the entire primary key. Any particular job can have only one description, pay grade, and pay range. However, none of those depend on the Promotion Date.
I've got a problem here and I need to take another look. What I really have is information about two different "things".
Job Title, Description, Pay Range and Pay Grade pertain to the Job as a category. Everyone who holds that position will have the same values. On the other hand, Salary and Promotion Date will be different for each person. So I really have two entities: 1) Job (information about the job itself), and 2) Job History (information about a particular employee's employment history.
I need take Job Title, Description, Pay Range and Pay Grade out of the Job History table and put them in the Job table.
Lastly, in the Certification table, Certification Date is also not fully dependent on the Certification Type. Different individuals achieve the certification at different dates. I don't have an entity to put the date in, so I'll put that to the side and come back to it later.
These last two problems are really a result of a poor narrative. If I had been more explicit, these would be obvious.
Many-to-Many: Common in real life, but cannot be represented in a database.
One-to-Many: The most common relationship in a database.
One-to-One: Seldom used.
At this point, I should say that this is not a strictly linear process . That is, you can't always move smoothly from one step to the next. Sometimes you have to move back and forth between them as you discover more things about your system.
That's what I'm going to do next. Because I have an unassigned attribute, I'm going to look at the relationships between my existing entities and see if something doesn't present itself.
To look at the relationships, I'm going to ignore the attributes for a while. Attributes do not have relationships, only entities do. If you discover that an attribute does have a relationship with some other entity or attribute, that's an indication that it is really an entity and your grid must change.
So how do I know what the relationships are for my Employee Database? For that I need to go back to the narrative. The second paragraph describes "business rules", that is, how the business actually works. I'll repeat the paragraph here.
An employee can have multiple jobs over time, (ie, Analyst, Sr. Analyst, QA Administrator). Employees can also earn certifications necessary for their job.
From this I can write out the relationships in full sentences, and I find it useful to write them in both directions. For instance, from the narrative, I can say:
Unfortunately, I'm not done yet, for two reasons: 1) many-to-many relationships cannot be directly implemented in a relational database, and 2) I still have an unassigned attribute. So first I'll rationalize the many-to-many relationship and then take another look.
To rationalize a many-to-many relationship between two tables, you create a third table -- an "intersection" or "linking" table. Then you create one-to-many relationships between the linking table and each of the main tables, with the "many-side" of both relationships on the linking table.
As you can see above, Employee and Certifications have a many-to-many relationship, so I need to create a new entity (Employee/Certifications). Sometimes linking tables have logical names. Other times, they don't. In that case, I simply combine the names of the base tables.
Unfortunately, I'm not done yet, for two reasons: 1) many-to-many relationships cannot be directly implemented in a relational database, and 2) I still have an unassigned attribute. So first I'll rationalize the many-to-many relationship and then take another look.
To rationalize a many-to-many relationship between two tables, you create a third table -- an "intersection" or "linking" table. Then you create one-to-many relationships between the linking table and each of the main tables, with the "many-side" of both relationships on the linking table.
As you can see above, Employee and Certifications have a many-to-many relationship, so I need to create a new entity (Employee/Certifications). Sometimes linking tables have logical names. Other times, they don't. In that case, I simply combine the names of the base tables.
Now that I've got all the relationships between my entities identified and assigned all the attributes, I can put it all into one diagram.