Normalization 1IntroductionIn this exercise we are looking at theoptimisation of data structure. The examplesystem we are going to use as a model is adatabase to keep track of employees of anorganisation working on different projects.ObjectivesBy the end of the exercise you should be able to: Show understanding of why we normalize data Give formal definitions of 1NF, 2NF & 3NF Apply the process of normalization to your own work
Normalization 2The data we would want to store could beexpressed as:Project Project Employee Employee Rate RateNo Name No Name category1203 Madagascar 11 Jessica A £90 travel site Brookes 12 Andy B £80 Evans 16 Max Fat C £701506 Online 11 Jessica A £90 estate Brookes agency 17 Alex B £80 Branton
Normalization 3Three problems become apparent with ourcurrent model:Tables in a RDBMS use a simple grid structureEach project has a set of employees so we can’teven use this format to enter data into a table.How would you construct a query to find theemployees working on each project?All tables in an RDBMS need a keyEach record in a RDBMS must have a uniqueidentity. Which field should be the primary key?Data entry should be kept to a minimumOur main problem is that each project containsrepeating groups, which lead to redundancy andinconsistency.
Normalization 4We could place the data into a table called:tblProjects_EmployeesProject Project Employee Employee Rate RateNo. Name No. Name category1203 Madagascar 11 Jessica A £90 travel site Brookes1203 Madagascar 12 Andy B £80 travel site Evans1203 Madagascat 16 Max Fat C £70 travel site1506 Online 11 Jessica A £90 estate Brookes agency1506 Online 17 Alex B £70 estate Branton agency
Normalization 5Addressing our three problems:Tables in a RDBMS use a simple grid structureWe can find members of each project using asimple SQL or QBE search on either ProjectNumber or Project NameAll tables in an RDBMS need a keyWe CAN uniquely identify each record. Althoughno primary key exists we can use two or morefields to create a composite key.Data entry should be kept to a minimumOur main problem that each project containsrepeating groups still remains. To create aRDBMS we have to eliminate these groups orsets.
Normalization 6Did you notice that Madagascar was misspelledin the 3rd record! Imagine trying to spot this errorin thousands of records. By using this structure(flat filing) we create:Redundant dataDuplicate copies of data – we would have to keyin Madagascar travel site 3 times. Not only do wewaste storage space we risk creating;Inconsistent dataThe more often we have to key in data the morelikely we are to make mistakes. (see IT01 noteson the importance of accurate data).
Normalization 7The solution is simply to take out the duplication.We do this by:Identifying a keyIn this case we can use the project no andemployee no to uniquely identify each row Project No Employee Unique Identifier No 1203 11 120311 1203 12 120312 1203 16 120316Note: Project 1056 is not shown for reasons of space
Normalization 8We look for partial dependenciesWe look for fields that depend on only part of thekey and not the entire key. Field Project No Employee No Project Name Employee Rate Category Rate We remove partial dependenciesThe fields listed are only dependent on part ofthe key so we remove them from the table.
Normalization 9We create new tablesClearly we can’t take the data out and leave it outof our database. We put it into a new tableconsisting of the field that has the partialdependency and the field it is dependent on.Looking at our example we will need to createtwo new tables:Dependent Partially Dependent PartiallyOn Dependent On DependentProject No Project Name Employee Employee Name No Rate category Rate
Normalization 10We now have 3 tables: tblProjectstblProjects_Employees Project No Project NameProject Employee 1023 MadagascarNo No travel site1023 11 tblEmployees 1056 Online estate agency1023 12 Employee Employee Rate Rate No Name Category1023 16 11 Jessica A £90 Brookes1056 11 12 Andy B £80 Evans1056 17 16 Max Fat C £70 17 Alex A £80 Branton
Normalization 11Looking at the project note the reduction in:Redundant dataThe text “Madagascar travel site” is stored onceonly, not for each occurrence of an employeeworking on the project.Inconsistent dataBecause we only store the project name once weare less likely to enter “Madagascat”The link is made through the key, Project No.Obviously there is no way to remove thisduplication without losing the relation altogether,but it is far more efficient storing a short numberrepeatedly, than a large chunk of text.
Normalization 12Our model has improved but is still far fromperfect. There is still room for inconsistency. Employee Employee Rate Rate No Name Category Alex Branton is 11 Jessica A £90 being paid £80 Brookes while Jessica Brookes gets £90 – 12 Andy B £80 but they’re in the Evans same rate category! 16 Max Fat C £70 17 Alex A £80 BrantonAgain, we have stored redundant data: the hourlyrate- rate category relationship is being stored inits entirety i.e. We have to key in both the ratecategory AND the hourly rate.
Normalization 13The solution, as before, is to remove this excessdata to another table. We do this by:Looking for Transitive RelationshipsRelationships where a non-key attribute isdependent on another non-key attribute. Hourlyrate should depend on rate category BUT ratecategory is not a keyRemoving Transitive RelationshipsAs before we remove the redundant data andplace it in a separate table. In this case we createa new table tblRates and add the fields ratecategory and hourly rate. We then delete hourlyrate from the employees table.
Normalization 14We now have 4 tables: tblProjectstblProjects_Employees Project No Project NameProject Employee 1023 MadagascarNo No travel site1023 11 tblEmployees 1056 Online estate agency1023 12 Employee Employee Rate tblRates No Name Category Rate Rate1023 16 11 Jessica A Category Brookes A £901056 11 12 Andy B Evans B £801056 17 16 Max Fat C 17 Alex A C £70 Branton
Normalization 15Again, we have cut down on redundancy and it isnow impossible to assume Rate category A isassociated with anything but £90.Our model is now in its most efficient formatwith:Minimum REDUNDANCYMinimum INCONSISTENCY
Normalization 16What we have formally done is NORMALIZE thedatabase:At the beginning we had a data structure:Project NoProject NameEmployee No (1n)Employee name (1n)Rate Category (1n)Hourly Rate (1n)(1n indicates there are many occurrences of thefield – it is a repeating group).To begin the normalization process we start bymoving from zero normal form to 1st normal form.
Normalization 17The definition of 1st normal formThere are no repeating groupsAll the key attributes are definedAll attributes are dependent on the primary keySo far, we have no keys, and there are repeatinggroups. So we remove the repeating groups anddefine the keys and are left with:Employee Project tableProject number – part of keyProject nameEmployee number – part of keyEmployee nameRate categoryHourly rateThis table is in first normal form (1NF)
Normalization 18A table is in 2nd normal form ifIt’s already in first normal formIt includes no partial dependencies (where anattribute is dependent on only part of the key)We look through the fields:Project name is dependent only on projectnumberEmployee name, rate category and hourly rateare dependent only on employee number.So we remove them, and place these fields in aseparate table, with the key being that part of theoriginal key they are dependent on. We are leftwith the following three tables:
Normalization 19Employee Project tableProject number – part of keyEmployee number – part of keyEmployee tableEmployee number - primary keyEmployee nameRate categoryHourly rateProject tableProject number - primary keyProject nameThe tables are now in 2nd normal form (2NF). Arethey in 3rd normal form?
Normalization 20A table is in 3rd normal form ifIt’s already in second normal formIt includes no transitive dependencies (where anon-key attribute is dependent on another non-key attribute)We can narrow our search down to the Employeetable, which is the only one with more than onenon-key attribute. Employee name is notdependent on either Rate category or Hourlyrate, the same applies to Rate category, butHourly rate is dependent on Rate category. So,as before, we remove it, placing it in its owntable, with the attribute it was dependent on askey, as follows:
Normalization 21Employee project tableProject number – part of keyEmployee number – part of keyEmployee tableEmployee number - primary keyEmployee nameRate CategoryRate tableRate category - primary keyHourly rateArialProject number - primary keyProject nameThese tables are all now in 3rd normal form, andready to be implemented.
Normalization 22There are other normal forms - Boyce-Coddnormal form, and 4th normal form, but these arevery rarely used for business applications. Inmost cases, tables in 3rd normal form are alreadyin these normal forms anyway.Before you start normalizing everything, a wordof warning. No process is better than commonsense. Take a look at this example.Customer tableCustomer Number - primary keyNameAddressPostcodeTown
Normalization 23What normal form is this table in? Giving it aquick glance, we see:no repeating groups, and a primary key defined,so its at least in 1st normal form.Theres only one key, so we neednt even lookfor partial dependencies, so its at least in 2ndnormal form.How about transitive dependencies? Well, itlooks like Town might be determined byPostcode. And in most parts of the world thatsusually the case.So we should remove Town, and place it in aseparate table, with Postcode as the key?
Normalization 24No! Although this table is not technically in 3rdnormal form, removing this information is notworth it. Creating more tables increases the loadslightly, slowing processing down. This is oftencounteracted by the reduction in table sizes, andredundant data. But in this case, where the townwould almost always be referenced as part of theaddress, it isnt worth it. Perhaps a company thatuses the data to produce regular mailing lists ofthousands of customers should normalize fully.It always comes down to how the data is going tobe used. Normalization is just a helpful processthat usually results in the most efficient tablestructure, and not a rule for database design.
Normalization 25Further Reading:PaperHeathcote – pages 110 -114De Watteville et al – pages 299 – 300Mott et al – pages 106 - 123Webhttp://phoenix.ucr.edu/mis/mgt230/Lecture5/sld001.htmlhttp://www.wamoz.com/rood/normalis.htm(read “A concise dictionary of normal forms”)http://www.problemsolving.com/codecorn/norm.htmhttp://www.acm.org/classics/nov95/s1p4.html