Lecture Slides available: PDF PowerPoint NormalisationContents
What is normalisation?Normalisation is the process of taking data from a problem and reducing it to a set of relations while ensuring data integrity and eliminating data redundancy
Data should only be stored once and avoid storing data that can be calculated from other data already held in the database. During the process of normalisation redundancy must be removed, but not at the expense of breaking data integrity rules. If redundancy exists in the database then problems can arise when the database is in normal operation:
The removal of redundancy helps to prevent insertion, deletion, and update errors, since the data is only available in one attribute of one table in the database. The data in the database can be considered to be in one of a number of `normal forms'. Basically the normal form of the data indicates how much redundancy is in that data. The normal forms have a strict ordering:
There are other normal forms, such as 4th and 5th normal forms. They are rarely utilised in system design and are not considered further here. To be in a particular form requires that the data meets the criteria to also be in all normal forms before that form. Thus to be in 2nd normal form the data must meet the criteria for both 2nd normal form and 1st normal form. The higher the form the more redundancy has been eliminated. Integrity ConstraintsAn integrity constraint is a rule that restricts the values that may be present in the database. The relational data model includes constraints that are used to verify the validity of the data as well as adding meaningful structure to it:
The rows (or tuples) in a relation represent entities, and each one must be uniquely identified. Hence we have the primary key that must have a unique non-null value for each row.
This constraint involves the foreign keys. Foreign keys tie the relations together, so it is vitally important that the links are correct. Every foreign key must either be null or its value must be the actual value of a key in another relation. Understanding DataSometimes the starting point for understanding data is given in the form of relations and functional dependancies. This would be the case where the starting point in the process was a detailed specification of the problem. We already know what relations are. Functional dependancies are rules stating that given a certain set of attributes (the determinant) determines a second set of attributes. The definition of a functional dependency looks like A->B. In this case B is a single attribute but it can be as many attributes as required (for instance, X->J,K,L,M). In the functional dependency, the determinant (the left hand side of the -> sign) can determine the set of attributes on the right hand side of the -> sign. This basically means that A selects a particular value for B, and that A is unique. In the second example X is unique and selects a particular set of values for J,K,L, and M. It can also be said that B is functionally dependent on A. In addition, a particular value of A ALWAYS gives you a particular value for B, but not vice-versa. Consider this example: R(matric_no, firstname, surname, tutor_number, tutor_name) tutor_number -> tutor_name Here there is a relation R, and a functional dependency that indicates that:
There is actually a second functional dependency for this relation, which can be worked out from the relation itself. As the relation has a primary key, then given this attribute you can determine all the other attributes in R. This is an implied functional dependency and is not normally listed in the list of functional dependents. Extracting understanding It is possible that the relations and the determinants have not yet been defined for a problem, and therefore must be calculated from examples of the data. Consider the following Student table. Student - an unnormalised tablewith repeating groups
The subject/grade pair is repeated for each student. 960145 has 1 pair while 960150 has four. Repeating groups are placed inside another set of parentheses. From the table the following relation is generated: Student(matric_no, name, date_of_birth, ( subject, grade ) ) The repeating group needs a key in order that the relation can be correctly defined. Looking at the data one can see that grade repeats within matric_no (for instance, for 960150, the student has 2 D grades). However, subject never seems to repeat for a single matric_no, and therefore is a candidate key in the repeating group. Whenever keys or dependencies are extracted from example data, the information extracted is only as good as the data sample examined. It could be that another data sample disproves some of the key selections made or dependencies extracted. What is important however is that the information extracted during these exercises is correct for the data being examined. Looking at the data itself, we can see that the same name appears more than once in the name column. The name in conjunction with the date_of_birth seems to be unique, suggesting a functional dependency of: name, date_of_birth -> matric_no This implies that not only is the matric_no sufficient to uniquely identify a student, the student’s name combined with the date of birth is also sufficient to uniquely identify a student. It is therefore possible to have the relation Student written as: Student(matric_no, name, date_of_birth, ( subject, grade ) ) As guidance in cases where a variety of keys could be selected one should try to select the relation with the least number of attributes defined as primary keys. Flattened Tables Note that the student table shown above explicitly identifies the repeating group. It is also possible that the table presented will be what is called a flat table, where the repeating group is not explicitly shown: Student #2 - Flattened Table
The table still shows the same data as the previous example, but the format is different. We have removed the repeating group (which is good) but we have introduced redundancy (which is bad). Sometimes you will miss spotting the repeating group, so you may produce something like the following relation for the Student data. Student(matric_no, name, date_of_birth, subject, grade ) matric_no -> name, date_of_birth This data does not explicitly identify the repeating group, but as you will see the result of the normalisation process on this relation produces exactly the same relations as the normalisation of the version that explicitly does have a repeating group. First Normal Form
Relational databases require that each row only has a single value per attribute, and so a repeating group in a row is not allowed. To remove the repeating group, one of two things can be done:
Flatten table and Extend Primary KeyThe Student table with the repeating group can be written as: Student(matric_no, name, date_of_birth, ( subject, grade ) ) If the repeating group was flattened, as in the Student #2 data table, it would look something like: Student(matric_no, name, date_of_birth, subject, grade ) Although this is an improvement, we still have a problem. matric_no can no longer be the primary key - it does not have an unique value for each row. So we have to find a new primary key - in this case it has to be a compound key since no single attribute can uniquely identify a row. The new primary key is a compound key (matrix_no + subject). We have now solved the repeating groups problem, but we have created other complications. Every repetition of the matric_no, name, and data_of_birth is redundant and liable to produce errors. With the relation in its flattened form, strange anomalies appear in the system. Redundant data is the main cause of insertion, deletion, and updating anomalies. Insertion anomaly:With the primary key including subject, we cannot enter a new student until they have at least one subject to study. We are not allowed NULLs in the primary key so we must have an entry in both matric_no and subject before we can create a new record.
Update anomalyIf the name of a student were changed for example Smith, J. was changed to Green, J. this would require not one change but many one for every subject that Smith, J. studied. Deletion anomalyIf all of the records for the `Databases' subject were deleted from the table,we would inadvertently lose all of the information on the student with matric_no 960145. This would be the same for any student who was studying only one subject and the subject was deleted. Again this problem arises from the need to have a compound primary key. Decomposing the relation
Record
Student
Matric_no remains the key to the Student relation. It cannot be the complete key to the new Record relation - we end up with a compound primary key consisting of matric_no and subject. The matric_no is the link between the two tables - it will allow us to find out which subjects a student is studying . So in the Record relation, matric_no is the foreign key. This method has eliminated some of the anomalies. It does not always do so, it depends on the example chosen
Student and Record are now in First Normal Form. Second Normal FormSecond normal form (or 2NF) is a more stringent normal form defined as: A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute is fully functionally dependent on the whole key. The problems arise when there is a compound key, e.g. the key to the Record relation - matric_no, subject. In this case it is possible for non-key attributes to depend on only part of the key - i.e. on only one of the two key attributes. This is what 2NF tries to prevent. Consider again the Student relation from the flattened Student #2 table: Student(matric_no, name, date_of_birth, subject, grade )
A dependency diagram is used to show how non-key attributes relate to each part or combination of parts in the primary key.
Interestingly this is the same set of relations as when we recognized that there were repeating terms in the table and directly removed the repeating terms. It should not really matter what process you followed when normalizing, as the end result should be similar relations. Third Normal Form3NF is an even stricter normal form and removes virtually all the redundant data :
By definition transitive functional dependency can only occur if there is more than one non-key field, so we can say that a relation in 2NF with zero or one non-key field must automatically be in 3NF.
Summary: 1NF
Summary: 2NF
Summary: 3NF
|
|