Database normalisation is perhaps the most difficult concept to understand for newcomers to databases. It is the most critical process in the design of databases and is often performed by database professionals. The concept applies to all relational databases including Microsoft Access and was developed by Edgar Codd for IBM in the early 1970s.This module offers a simplified explanation to database normalisation but should be attempted as the first module in the course as it provides a basis for understanding how to successfully construct queries. Please read the topics in the specified order.
Normalisation of data First normal form Second normal form Third normal form De-normalisation
1. Introduction.
Most modern databases are of the relational database type. To obtain the best results requires an understanding of data relationships and methods of applying it and how to use many other productivity tools provided. A well-designed database must fulfil the following requirements:
· Eliminate or minimise the duplication of data, and allow data to be entered and edited quickly and easily.
· Provide fast data access time in a large database, with clear and effective reporting quickly and easily as each user requires.
· Prevent invalid data from being entered, maintain data integrity, and prevent unauthorised access to confidential information.
· Be User friendly, and allow for future expansion as new needs are identified.
The source of your data can affect the design process. As well as manually entering data into tables you have created, data may be imported directly from an existing database, spreadsheet or other type of data file. Imported tables and data become part of your database and can be structured as required. Alternatively, the database may be linked to external data files, which remain in their original format and location, but are used as if they were part of your database. You will be able to change the data within these files but will not be able to amend the structure of them.
2. Normalisation of data.
One of the purposes of using a relational database is to cut down on duplication of data. This makes it easier to enter, search for, and use the data, and takes up less disk space. A normalised database is easier to maintain and can be modified more readily as requirements change. A large database will also run significantly faster when it is normalised.
Normalisation is basically a case of splitting duplicated data into separate tables and preventing data redundancy where fields will be left blank.
For example, a Contacts table may look like this, where the CoCode field is the primary key.
CoCode
|
CoName
|
Address
|
Location
|
Contact1
|
Title
|
Phone
|
Contact2
|
Title
|
Phone
|
MCW1
|
McWilliams
|
21 Arrow St.
|
Glasgow
|
S.Purdy
|
Buyer
|
534 9214
|
|
|
|
MCW2
|
McWilliams
|
14 Clyde Cres.
|
London
|
J.Taylor
|
Buyer
|
226 8123
|
W.Allen
|
Sales
|
226 8190
|
BRWN
|
Brown & Son
|
196 Wall Rd.
|
Luton
|
A.Green
|
Manager
|
920 1444
|
L.Walters
|
Director
|
920 1172
|
MCHL
|
Mitchell & Co.
|
Old Dover Rd.
|
Dover
|
D.Phillips
|
Buyer
|
876 9339
|
|
|
|
|
Figure 1.
Or like this:
CoCode
|
CoName
|
Address
|
Location
|
Contact1
|
Title
|
Phone
|
MCW1
|
McWilliams
|
21 Arrow St.
|
Glasgow
|
S.Purdy
|
Buyer
|
534 9214
|
MCW2
|
McWilliams
|
14Clyde Cres.
|
London
|
J.Taylor
|
Buyer
|
226 8123
|
MCW2
|
McWilliams
|
14 Clyde Cres.
|
London
|
W.Allen
|
Sales
|
226 8190
|
BRWN
|
Brown & Son
|
196 Wall Rd.
|
Luton
|
A.Green
|
Manager
|
920 1444
|
BRWN
|
Brown & Son
|
196 Wall Rd.
|
Luton
|
L.Walters
|
Director
|
920 1172
|
MCHL
|
Mitchell & Co.
|
Old Dover Rd.
|
Dover
|
D.Phillips
|
Buyer
|
876 9339
|
|
Figure 2.
Both tables contain data duplication of the company details, and the first table would need to be expanded to accommodate the company with the most contacts, and therefore wasting space (i.e. data redundancy). The first step of normalisation is to create a separate table for duplicated fields.
3. First Normal Form.
First Normal Form splits the data into tables, which have no repeating groups or redundant data. If you split it into too many tables initially, some of them can be combined later.
CoCode
|
CoName
|
Address
|
Location
|
MCW1
|
McWilliams
|
21 Arrow St.
|
Glasgow
|
MCW2
|
McWilliams
|
14 Clyde Cres.
|
London
|
BRWN
|
Brown & Son
|
196 Wall Rd.
|
Luton
|
MCHL
|
Mitchell & Co.
|
Old Dover Rd.
|
Dover
|
|
Figure 3 - Companies Table.
Contact
|
Title
|
Phone
|
S.Purdy
|
Buyer
|
534 9214
|
J.Taylor
|
Buyer
|
226 8123
|
A.Green
|
Sales
|
226 8190
|
D.Phillips
|
Manager
|
920 1444
|
W.Allen
|
Director
|
920 172
|
L.Walters
|
Buyer
|
876 9339
|
|
Figure 4 – Contacts Table
The result above does not link people in the Contact table with the companies in the Companies table, so it is necessary to duplicate the Primary key from the Companies table in the Contacts table also.
CoCode
|
Contact
|
Title
|
Phone
|
MCW1
|
S.Purdy
|
Buyer
|
534 9214
|
MCW2
|
J.Taylor
|
Buyer
|
226 8123
|
MCW2
|
A.Green
|
Sales
|
226 8190
|
BRWN
|
D.Phillips
|
Manager
|
920 1444
|
BRWN
|
W.Allen
|
Director
|
920 172
|
MCHL
|
L.Walters
|
Buyer
|
876 9339
|
|
Figure 5 – revised Contacts table
The Primary key in the Contacts table could not be either CoCode or Contact alone because both of these fields can correctly contain duplications, but needs to be a joint key (also known as a Composite key) – CoCode and Contact combined. CoCode in the Contacts table is a Secondary (also known as a Foreign key) linking back to the Companies table.
4. Second Normal Form.
Second Normal Form occurs when all of the non-key fields are uniquely dependent on the Primary key.
In the example below the Contacts table is extended to include Correspondence details - the date and type of contact made, and the name, department and position of the person making the contact.
CoCode
|
Contact
|
Title
|
Phone
|
Date
|
Type
|
Name
|
Dept.
|
Position
|
BRWN
|
D.Phillips
|
Manager
|
920 1444
|
7 May
|
Phone
|
RT
|
Accts
|
Credit Controller
|
BRWN
|
W.Allen
|
Director
|
920 1172
|
7 May
|
Phone
|
RT
|
Accts
|
Credit Controller
|
MCW2
|
J.Taylor
|
Buyer
|
226 8123
|
14 May
|
Letter
|
NWB
|
Sales
|
Manager
|
BRWN
|
D.Phillips
|
Manager
|
920 1444
|
14 May
|
Phone
|
RT
|
Accts
|
Credit Controller
|
MCW1
|
S.Purdy
|
Buyer
|
534 9214
|
16 May
|
Phone
|
JBC
|
Sales
|
Salesman
|
MCW1
|
S.Purdy
|
Buyer
|
534 9214
|
16 May
|
Letter
|
JBC
|
Sales
|
Salesman
|
MCW1
|
S.Purdy
|
Buyer
|
534 9214
|
16 May
|
Phone
|
NWB
|
Sales
|
Manager
|
MCW1
|
S.Purdy
|
Buyer
|
534 9214
|
18 May
|
Phone
|
RT
|
Accts
|
Credit Controller
|
MCHL
|
L.Walters
|
Buyer
|
876 9339
|
20 May
|
Phone
|
NWB
|
Sales
|
Salesman
|
BRWN
|
D.Phillips
|
Manager
|
920 1444
|
20 May
|
Phone
|
RT
|
Accts
|
Credit Controller
|
MCHL
|
L.Walters
|
Buyer
|
876 9339
|
21 May
|
Phone
|
JBC
|
Sales
|
Salesman
|
MCHL
|
L.Walters
|
Buyer
|
876 9339
|
22 May
|
Phone
|
JBC
|
Sales
|
Salesman
|
Figure 6 – Contacts Table
The new data breaks the Second Normal Form rule as the Dept and Position fields are not dependent on the joint key of CoCode and Contact. The details of the person making contact should be held in a separate Staff table.
Name
|
Dept
|
Position
|
RT
|
Accts
|
Credit Controller
|
NWB
|
Sales
|
Manager
|
JBC
|
Sales
|
Salesman
|
|
Figure 7 – Staff Table
The Staff table may be able to use Name as a primary key, or an additional field can be created containing a unique code for each individual. Either field could then be used in the Contacts table as a secondary key. The Contacts table then contains two secondary key fields, one linking to the Companies table and another linking to the Staff table.
5. Third Normal Form.
This is an extension to Second Normal Form in that all of the non-key fields should also depend on the Primary key, including the whole key in a multiple key field.
It is not really practical to keep correspondence type information in the Contacts table, as the contact details will be duplicated each time contact is made. It is better to create a separate table to hold correspondence data.
CoCode
|
Contact
|
Date
|
Type
|
Name
|
BRWN
|
D.Phillips
|
7 May
|
Phone
|
RT
|
BRWN
|
W.Allen
|
7 May
|
Phone
|
RT
|
MCW2
|
J.Taylor
|
14 May
|
Letter
|
NWB
|
BRWN
|
D.Phillips
|
14 May
|
Phone
|
RT
|
MCW1
|
S.Purdy
|
16 May
|
Phone
|
JBC
|
MCW1
|
S.Purdy
|
16 May
|
Letter
|
JBC
|
MCW1
|
S.Purdy
|
16 May
|
Phone
|
NWB
|
MCW1
|
S.Purdy
|
18 May
|
Phone
|
RT
|
MCHL
|
L.Walters
|
20 May
|
Phone
|
NWB
|
BRWN
|
D.Phillips
|
20 May
|
Phone
|
RT
|
MCHL
|
L.Walters
|
21 May
|
Phone
|
JBC
|
MCHL
|
L.Walters
|
22 May
|
Phone
|
JBC
|
Figure 8 – Correspondence Table.
In the Correspondence table only the Date and Type fields contain new data, the other fields are Secondary keys linking back to the parent tables – Contacts and Staff. The effect of this is that we now have four related tables – Companies linking to Contacts, linking to Correspondence, linking to Staff.
6. De-Normalisation.
The effect of normalisation is to create many small tables containing just a few fields, which can in itself create unnecessary duplication of key fields. De-normalisation reverses the process until an acceptable level of duplication and redundancy is achieved.
For example, a Staff database may keep a Staff History file. The current Salary details for each member of staff should rightly be kept in the Staff History file, but when the monthly payroll is run it would have to check the Staff History file for each member, slowing the process down. It is better to duplicate the current Salary in both places. Salary changes will take slightly longer to update because it needs to be changed in both places, but this is not performed as often as the monthly payroll, so it is a worthwhile trade off for speed in calculating the payroll.
|