1. Normalization 1IntroductionIn this exercise we are looking at theoptimisation of data structure. The examplesystem we are going to use as a model is adatabase to keep track of employees of anorganisation working on different projects.ObjectivesBy the end of the exercise you should be able to: Show understanding of why we normalize data Give formal definitions of 1NF, 2NF & 3NF Apply the process of normalization to your own work
2. Normalization 2The data we would want to store could beexpressed as:Project Project Employee Employee Rate RateNo Name No Name category1203 Madagascar 11 Jessica A £90 travel site Brookes 12 Andy B £80 Evans 16 Max Fat C £701506 Online 11 Jessica A £90 estate Brookes agency 17 Alex B £80 Branton
3. Normalization 3Three problems become apparent with ourcurrent model:Tables in a RDBMS use a simple grid structureEach project has a set of employees so we can’teven use this format to enter data into a table.How would you construct a query to find theemployees working on each project?All tables in an RDBMS need a keyEach record in a RDBMS must have a uniqueidentity. Which field should be the primary key?Data entry should be kept to a minimumOur main problem is that each project containsrepeating groups, which lead to redundancy andinconsistency.
4. Normalization 4We could place the data into a table called:tblProjects_EmployeesProject Project Employee Employee Rate RateNo. Name No. Name category1203 Madagascar 11 Jessica A £90 travel site Brookes1203 Madagascar 12 Andy B £80 travel site Evans1203 Madagascat 16 Max Fat C £70 travel site1506 Online 11 Jessica A £90 estate Brookes agency1506 Online 17 Alex B £70 estate Branton agency
5. Normalization 5Addressing our three problems:Tables in a RDBMS use a simple grid structureWe can find members of each project using asimple SQL or QBE search on either ProjectNumber or Project NameAll tables in an RDBMS need a keyWe CAN uniquely identify each record. Althoughno primary key exists we can use two or morefields to create a composite key.Data entry should be kept to a minimumOur main problem that each project containsrepeating groups still remains. To create aRDBMS we have to eliminate these groups orsets.
6. Normalization 6Did you notice that Madagascar was misspelledin the 3rd record! Imagine trying to spot this errorin thousands of records. By using this structure(flat filing) we create:Redundant dataDuplicate copies of data – we would have to keyin Madagascar travel site 3 times. Not only do wewaste storage space we risk creating;Inconsistent dataThe more often we have to key in data the morelikely we are to make mistakes. (see IT01 noteson the importance of accurate data).
7. Normalization 7The solution is simply to take out the duplication.We do this by:Identifying a keyIn this case we can use the project no andemployee no to uniquely identify each row Project No Employee Unique Identifier No 1203 11 120311 1203 12 120312 1203 16 120316Note: Project 1056 is not shown for reasons of space
8. Normalization 8We look for partial dependenciesWe look for fields that depend on only part of thekey and not the entire key. Field Project No Employee No Project Name  Employee  Rate Category  Rate We remove partial dependenciesThe fields listed are only dependent on part ofthe key so we remove them from the table.
9. Normalization 9We create new tablesClearly we can’t take the data out and leave it outof our database. We put it into a new tableconsisting of the field that has the partialdependency and the field it is dependent on.Looking at our example we will need to createtwo new tables:Dependent Partially Dependent PartiallyOn Dependent On DependentProject No Project Name Employee Employee Name No Rate category Rate
10. Normalization 10We now have 3 tables: tblProjectstblProjects_Employees Project No Project NameProject Employee 1023 MadagascarNo No travel site1023 11 tblEmployees 1056 Online estate agency1023 12 Employee Employee Rate Rate No Name Category1023 16 11 Jessica A £90 Brookes1056 11 12 Andy B £80 Evans1056 17 16 Max Fat C £70 17 Alex A £80 Branton
11. Normalization 11Looking at the project note the reduction in:Redundant dataThe text “Madagascar travel site” is stored onceonly, not for each occurrence of an employeeworking on the project.Inconsistent dataBecause we only store the project name once weare less likely to enter “Madagascat”The link is made through the key, Project No.Obviously there is no way to remove thisduplication without losing the relation altogether,but it is far more efficient storing a short numberrepeatedly, than a large chunk of text.
12. Normalization 12Our model has improved but is still far fromperfect. There is still room for inconsistency. Employee Employee Rate Rate No Name Category Alex Branton is 11 Jessica A £90 being paid £80 Brookes while Jessica Brookes gets £90 – 12 Andy B £80 but they’re in the Evans same rate category! 16 Max Fat C £70 17 Alex A £80 BrantonAgain, we have stored redundant data: the hourlyrate- rate category relationship is being stored inits entirety i.e. We have to key in both the ratecategory AND the hourly rate.
13. Normalization 13The solution, as before, is to remove this excessdata to another table. We do this by:Looking for Transitive RelationshipsRelationships where a non-key attribute isdependent on another non-key attribute. Hourlyrate should depend on rate category BUT ratecategory is not a keyRemoving Transitive RelationshipsAs before we remove the redundant data andplace it in a separate table. In this case we createa new table tblRates and add the fields ratecategory and hourly rate. We then delete hourlyrate from the employees table.
14. Normalization 14We now have 4 tables: tblProjectstblProjects_Employees Project No Project NameProject Employee 1023 MadagascarNo No travel site1023 11 tblEmployees 1056 Online estate agency1023 12 Employee Employee Rate tblRates No Name Category Rate Rate1023 16 11 Jessica A Category Brookes A £901056 11 12 Andy B Evans B £801056 17 16 Max Fat C 17 Alex A C £70 Branton
15. Normalization 15Again, we have cut down on redundancy and it isnow impossible to assume Rate category A isassociated with anything but £90.Our model is now in its most efficient formatwith:Minimum REDUNDANCYMinimum INCONSISTENCY
16. Normalization 16What we have formally done is NORMALIZE thedatabase:At the beginning we had a data structure:Project NoProject NameEmployee No (1n)Employee name (1n)Rate Category (1n)Hourly Rate (1n)(1n indicates there are many occurrences of thefield – it is a repeating group).To begin the normalization process we start bymoving from zero normal form to 1st normal form.
17. Normalization 17The definition of 1st normal formThere are no repeating groupsAll the key attributes are definedAll attributes are dependent on the primary keySo far, we have no keys, and there are repeatinggroups. So we remove the repeating groups anddefine the keys and are left with:Employee Project tableProject number – part of keyProject nameEmployee number – part of keyEmployee nameRate categoryHourly rateThis table is in first normal form (1NF)
18. Normalization 18A table is in 2nd normal form ifIt’s already in first normal formIt includes no partial dependencies (where anattribute is dependent on only part of the key)We look through the fields:Project name is dependent only on projectnumberEmployee name, rate category and hourly rateare dependent only on employee number.So we remove them, and place these fields in aseparate table, with the key being that part of theoriginal key they are dependent on. We are leftwith the following three tables:
19. Normalization 19Employee Project tableProject number – part of keyEmployee number – part of keyEmployee tableEmployee number - primary keyEmployee nameRate categoryHourly rateProject tableProject number - primary keyProject nameThe tables are now in 2nd normal form (2NF). Arethey in 3rd normal form?
20. Normalization 20A table is in 3rd normal form ifIt’s already in second normal formIt includes no transitive dependencies (where anon-key attribute is dependent on another non-key attribute)We can narrow our search down to the Employeetable, which is the only one with more than onenon-key attribute. Employee name is notdependent on either Rate category or Hourlyrate, the same applies to Rate category, butHourly rate is dependent on Rate category. So,as before, we remove it, placing it in its owntable, with the attribute it was dependent on askey, as follows:
21. Normalization 21Employee project tableProject number – part of keyEmployee number – part of keyEmployee tableEmployee number - primary keyEmployee nameRate CategoryRate tableRate category - primary keyHourly rateArialProject number - primary keyProject nameThese tables are all now in 3rd normal form, andready to be implemented.
22. Normalization 22There are other normal forms - Boyce-Coddnormal form, and 4th normal form, but these arevery rarely used for business applications. Inmost cases, tables in 3rd normal form are alreadyin these normal forms anyway.Before you start normalizing everything, a wordof warning. No process is better than commonsense. Take a look at this example.Customer tableCustomer Number - primary keyNameAddressPostcodeTown
23. Normalization 23What normal form is this table in? Giving it aquick glance, we see:no repeating groups, and a primary key defined,so its at least in 1st normal form.Theres only one key, so we neednt even lookfor partial dependencies, so its at least in 2ndnormal form.How about transitive dependencies? Well, itlooks like Town might be determined byPostcode. And in most parts of the world thatsusually the case.So we should remove Town, and place it in aseparate table, with Postcode as the key?
24. Normalization 24No! Although this table is not technically in 3rdnormal form, removing this information is notworth it. Creating more tables increases the loadslightly, slowing processing down. This is oftencounteracted by the reduction in table sizes, andredundant data. But in this case, where the townwould almost always be referenced as part of theaddress, it isnt worth it. Perhaps a company thatuses the data to produce regular mailing lists ofthousands of customers should normalize fully.It always comes down to how the data is going tobe used. Normalization is just a helpful processthat usually results in the most efficient tablestructure, and not a rule for database design.
25. Normalization 25Further Reading:PaperHeathcote – pages 110 -114De Watteville et al – pages 299 – 300Mott et al – pages 106 - 123Webhttp://phoenix.ucr.edu/mis/mgt230/Lecture5/sld001.htmlhttp://www.wamoz.com/rood/normalis.htm(read “A concise dictionary of normal forms”)http://www.problemsolving.com/codecorn/norm.htmhttp://www.acm.org/classics/nov95/s1p4.html