Domain and data dependency, Armstrong's axioms, Normal forms, Dependency preservation, Lossless design.
Size: 5.32 MB
Language: en
Added: Aug 08, 2020
Slides: 60 pages
Slide Content
Database Management
System
UNIT - 4
Relational Database Design
Outline...
* Domain and data dependency
* Armstrong’s axioms
* Normal forms
* Dependency preservation
* Lossless design
Todays Topic...
* Domain and data dependency
Student Table
Columns (5)
Attributes:
RollNo
Branch Semester
Title of column
101 Raju CE 3 8 si E
102 Mitesh [al] 3 7 3
Rows or Ge va: = 5 z E
Tuplesor 3
Records (7) 104 Nilesh EE 3 9 2
fi
105 Hitesh cl 3 7 z
106 Tarun ME 3 8 E
107 Suresh CE 3 9 §
—
——_ Degree = No of columns (5) ———————>
* Domain is a set of all possible unique values for a specific column.
+ Domain of Branch attribute is (CE, Cl, ME, EE)
Domain
* Domain is the set of legal values that can be assigned to an attribute.
Each attribute in a database must have a well-defined domain.
* General domains
* Some domains can only be described with a general statement of
what they contain. These are difficult or impossible to analyze
precisely.
* The best we can do is to make them VARCHAR strings that are long
enough to hold any expected value. The examples listed below
provide more context for these domains.
Domain
+ Names of people: We typically show these broken into First (which
might include a middle name or initial) and Last (which is really the
family name—some languages write this first). Depending on your
application, you might have to add attributes for a courtesy title (Mr.,
Ms., Dr., etc.), a suffix (Jr, Ill, etc.) or a nickname ('Tom' for 'Thomas'
and so on).
* Street addresses: Even the format of these can vary widely: '3201
Main St., Apt. 3', 'Sohnmattstraße 14', and 'P.O. Box 8259' are all valid
address strings.
Domain
* Domains with a pattern
* Some domains have at least some pattern in their permitted values.
These might be recognizable in code, for example with a regular
expression, although it is still impossible to insure that every value
that passes a validity check is actually correct.
+ Email addresses: To be valid, an email address must contain a single
@ sign that separates the user name from the server and domain
names. It cannot contain any spaces. Unfortunately, that’s about all
we can check.
Domain
* Domains with a precise pattern
* Few domains conform to a precise pattern that can be analyzed or
specified exactly.
+ United States Social Security numbers: These are always of the form
999-99-9999, where 9 represents any digit.
+ United States Zip codes: which are always of the form 99999 or
99999-9999.
Data dependency
* Data dependency means that one/more attribute uniquely identifies
other attributes of a relation. in more simple terms we can say that
some data values are dependent on other data values in order to get
recognized.
+ Example:
rollno> name
+ In this example rollno. will be unique for each student but two
students can have same name.
* Suppose we want to know about the student whose name is Amit but
there are two students with name Amit, so in this case name doesn’t
uniquely identifies the student(we have have confusion about which
student’s information we want).
Data dependency
» But if we take rollno.(say 101) then we will get information of one
student whose name is Amit and rollno. is 101, hence no confusion.
Therefore name is dependent on rollno.
Pitfall in Relational Database Design
« The goal of a relational-database design is to generate a set of
relational schemas that allows us to store information without
unnecessary redundancy, yet also allows us to retrieve information
easily.
>The bad database design may led to:
+ Repetition of Information.
+ Inability to represent certain information.
>The design goals of relational database design are:
+ Avoid redundant data.
+ Ensure that relationship among attributes are represented.
* Facilitate the checking of updates for violation of database integrity
constraints.
Function Dependency (FD)
» A function dependency is a type of constraint that is a generalization
of the notation ok key.
* Consider a relation schema R, and let a SR, and let B € a. The
functional dependency a > B.
* Consider the schema :
Loan_info = (Loan_no, Branch_name, Customer_name, Amount)
Loan_no > Branch_name
Loan_no > Amount
* There is no functional dependency:
Loan_no > Customer_name
Function Dependency (FD)
* The functional dependency (FD) is denoted as X > Y, where X is the left
hand side or the determinant of FD and Y is right hand side of the FD.
* A functional dependency X > Y is said to be trivial if Y € X.
Room re y Time
Smith 553 A532 40 Mon 11:45
Smith 353 A532 40 Wed 11:45
Clark 356 H940 300 Tue 1:45
Clark 356 H940 300 Thur 2:10
Turner 456 B278 45 Mon 3:05
Turner 456 B278 45 Wed 4:50
* For the above relation, the FD Course > Prof is satisfied. However, the FD
Prof > Couse is not satisfied, as one professor can teach to more than on
course.
Trivial FD
+ Trivial functional dependency
A functional dependency X > Y is trivial if Y, the right hand side of the
functional dependency, is a subset of X.
is non-trivial, as {Employee Phone} is not subset of {Employee ID,
Employee Address}.
Closure FD set
+» The set of functional dependencies that is logically implied by F is called
the closure of F and is written as F*.
* Suppose we are given a relation schema R = (A,B,C,G,H,|) and the set of
functional dependencies :
A 8
ADC
04 + MH
al
B >H
* The functional dependency
ASH
* Proof:
A>BandB>H
A>H
Armstrong’s Axioms
* We can use following three rules to find logically implied functional
dependencies. This collection of rules is called Armstrong’s axioms in honor of
the person who first proposed it.
i. Reflexivity Rules: If a is set of attributes and B Ca, then a > $ holds.
ii. Augmentation Rule: if a > ß holds and r is set of attributes, then ra > rß
holds.
iii. Transitivity Rule: if a > B holds and B > r holds, then a > r holds.
> Additional rules
i. Union rule: if a > B holds, and a > r holds, then a > Br holds.
ii. Decomposition rule: if a > Br holds, then a > B holds and a > r holds.
iii. Pseudo transitivity rule: if a > B holds, and rB> 6 holds, then ar > 6 holds.
Armstrong’s Axioms Example
* Consider schema :
R = (A,B,C,G,H,l) and the set of FDS (A> B, A > C, CG > H, CG > 1,B > H)
Some of members of F* are:
SAH
Proof :
A > Band B > H (By transitivity rule) A> H
“ca FHl
Proof:
CG > Hand CG > 1 (By union rule) CG > HI
* AGI
Proof :
A> Cand CG > | By pseudo transitivity rule AG > |
OR
A > C By augmentation rule AG > CG and CG > | By transitivity rule AG > |
Closure of Attributes
* Given a set of attributes A and a set of functional dependencies, the
closure of the set of attributes A under F, written as A*, is the set of
attributes B that can be derived from A by applying the inference
axioms to the functional dependencies of F, the closure of A is always
nonempty set because A > A by the axioms of reflexivity.
Irreducible Set of FD
+ A functional dependency set S is irreducible if the set has three
following properties.
1. Each right set of a functional dependency of S contains only one
attributes.
2. The left side of a functional dependency of S is irreducible. It means
that no attributes can be discarded from the determinant without
changing the closure S*
3. No functional dependency in S can be discarding from S without
changing the closure of S.
* Sets of FD with these properties are also called canonical or minimal.
Data Redundancy and Update Anomalies
+ A major aim of relational database design is to group attributes into
relations to minimize data redundancy and thereby reduce the file
storage space required by the implemented base relation.
« In staffbranch relation there is redundant data; branch address is
repeated for every member of staff located at the branch.
* In the branch relation the Branch_no is repeated in the staff relation.
Data Redundancy and Update Anomalies
* Relations that have redundant data have problems called update
anomalies, which are classified as:
* Insertion
* Deletion
+ Modification
Insertion anomalies
+ There are two main types of insertion anomalies:
de
To insert the details of new staff located at the branch number B007, we
must enter the correct details of branch number B007 so that the branch
details are consistent with the values for branch number B007 in other
tuples in Staffbranch relation. But above does not suffer from this
inconsistency.
To insert details of new branch that currently has no staff into
staffbaranch relation, it is necessary to enter nulls into the attributes for
Staff, such as Staff_no. But Stff_no is primary key of Staffbranch relation,
and attempting to enter nulls for Staff_no violations entry integrity and is
not allowed. The design of the relations show in above example avoid
this problem because branch details are entered in the Branch relation
separately from the Staff details.
Deletion anomalies
« If only one staff is working at a branch, and if that staff’s information
is deleted, then branch details are also lost, from the database.
* For example, is Staff_no S2’s record is deleted, the branch information
of Branch_no B003 is also lost.
+ The design of the relations in above example avoids this problems,
because Branch tuples are stored separately from Staff tuples and
only attributes Branch_no relates the two relations.
Modification anomalies
* If we want to change the value of one of the attributes of a particular
branch in the Staffbanch relation, for example, the address for branch
number B003, we must update the tuple of all staff located at that
branch. If this information is not carried out, the database will
become inconsistent.
* The above example illustrate that the Staff and Branch relation have
more desired properties than Staffbanch. We can avoid update
anomalies of Staffbranch relation by decomposing the original
relation into to Staff and Branch relation.
Decomposition using FD
+» We can avoid update anomalies by decomposition of the original
relation.
* The relation schema R is decomposed into relation schemas:
{R1,R2,R3,...,Rn} in such a way that
R1IUR2UR3...URn=R
+ “Find all bank branches that have made a loan in an amount less than
Rs. 1000”.
* The result of this query using the lending relation is : Mianus.
+ The result of the same query using the relation Branch_customer |X|
Customer_loan is Downtown and Mianus.
Decomposition using FD
+ From the above two case we can say that the decomposition of
lending relation into Branch_customer and Customer_loan results in
bad database design. It results in loss of information and repetition of
information.
* Because of loss of information, we call the decomposition of lending
relation into Branch_customer and Cutomer_loan a lossy
decomposition or lossy-join decomposition.
Desirable Properties of Decomposition
Lossless join: This properties ensure that any instance of the
original relation can be identified from corresponding instance in
the smaller relations.
Dependency preservation: This property ensure that a constraint
on the original relation can be maintained by simple enforcing
some constraint on each of the smaller relations.
Normalization
+ Normalization is an essential part of database design. A good
understanding of the semantics of data helps the designer to bullied
efficient design using the concept of normalization.
« Purpose of Normalization
+ Minimize redundancy in data.
* Remove insert, delete and update anomaly during database activities.
* Reduce the need to recognized data when it is modified or enhanced.
+ Normalization reduces a complex user view to a set of small and
stable subgroups of relations.
Normalization Forms
1. First normal form (1NF): A relation is said to be in the first normal
form if it is already in un-normalized form and it has no repeating
group.
2. Second normal form (2NF): A relation is said to be in the second
normal form if it is already in the first normal form and it has no
partial dependency.
3. Third normal form (3NF): A relation is said to be in the third normal
form if it is already in second normal form and it is has no
transitivity dependency.
4.
Normalization Forms
Boyce-Codd normal form (BCNF): A relation is said to be in Boyce-
Codd normal form if it is already in third normal form and every
determinant is a candidate key. It is stronger version of 3NF.
Fourth normal form (4NF): A relation is said to be in the fourth
normal form if it is already in BCNF and it has no multivalued
dependency.
. Fifth form (5NF): A relation is said to be in the fifth normal form if it
is already in 3NF and it is has no join dependency.
Terminologies used in Normalization Forms
» Partial dependency:
« In a relation having more than one key field, a subset of non-key fields may
depend on all key fields but another subset/ a particular non-key field may
depend on only one of the key fields such dependency is called partial
dependency.
* Transitive dependency:
«In a relation, there may be dependency among non-key fields. Such
dependency is called as transitive dependency.
* Determinant
+ A determinant is any field on which some other field is fully functionally
dependent.
Terminologies used in Normalization Forms
+ Multivalued dependency:
+ Consider three fields X,Y and Z in a relation. If for each value of X, there is
well-defined set of values of Y and a well-define set of values of Z and the set
of values of Y is independent of the set of values of Z, then multivalued
dependency exists.
X > Y/Z
* Join dependency:
+ Arelation which has a join dependency cannot be decomposed by projection
into other relations without any difficulty and undesirable result.
INF, 2NF, 3NF
Let us consider the Invoice Report of Alpha Book House
Alpha Book House
Pune — 413001
Customer No.: 1052
Customer Name : Beta School of Computer science
Address: Shivaji Nagar, Pune — 01.
Author’s Author’s
ISBN Book Title ame country Qty Price(Rs.) | Amount (Rs.)
81-203-5 Dos PK. Sinha U.S.A 5 250 1250
0-112-6 DBMS Korth U.S.A 6 300 1800
1-213-9 Simulation Gordon US.A 5 100 500
Grand Total 3550
+ The invoice report is represented in a form of relation. The relation is named as invoice. This is
normalized form.
+ The relation is in 1NF if it has no repeating groups.
+ In the invoice relation, the fields in the inner most of parentheses put
together is known as a repeating groups.
+ This redundancy will lead to inconsistency of data. Hence, it is divided into
two sub-relations shown below to remove the repeating group.
2. Customer (Cust_No, Cust_Name, Cust_Add)
3. Customer_Book (Cust_No, ISBN , Title , Authore Name ,
Authore_Country , Qty, Unit_price)
+ Now each of the above relations (relation 2 and 3) is in 1NF.
2NF
+ 2NF removes partial dependency among attributes of relation. In Relation
2, the number of key field is only one and hence there is no scope for
partial dependency.
* The absence of partial dependency in Relation 2 takes it into 2NF without
any modification.
* In Relation 3, the number of key fields are two. The dependency diagram of
Relation 3 is show below.
Author_Name
Author_Country
2NF
* In fig Qty depends on Cust_No and ISBN, but the remaining non-key fields
(Title, Authir_Name, Author_Country, Unit_price) depend only on ISBN,
thus, there exists partial dependency.
* The existence of partial dependency will result into insertion update and
deletion anomaly.
+ Insertion anomaly
+ In Relation 3, if we want to insert data of new-book (ISBN,Title,
Authir_Name, Author_Country, Unit_price) there must be at least one
customer. This means that the data of new book can be entered into
Relation 3 only when the first customer buys the book.
2NF
+ Update anomaly
* In Relation 3, if we want to change any of the non-key fields, like Title,
Author_Name, it will result into inconsistency because the same is
available in more than one record.
* Deletion anomaly
* In Relation 3 , if book is purchased by only one customer, then the book
data will be lost when we delete that record after fully satisfying that
customer order.
* Hence Relation 3 is divided into two relations as show below:
4. Sales (Cust_No, ISBN , Qty)
5. Book_Author (ISBN, Title, Author_Name, Author_Country, Unit_Price)
+ Now each of the above relations (relation 4 and 5) is in 2NF.
3NF
+ 3NF removes transitive dependency among attributes of relation.
* In Relation 4, there is only one non-key field. So, there is no question of
dependency between non-key fields. Thus, there is no transitive
dependency. Hence Relation 4 is in 3NF.
Author_Name
Dependency diagram for Relation 5
3NF
+ In Relation 5, author_country depends on author name. This means that
Relation 5 has transitive dependency.
+ The existence of transitive dependency will result into insert, update and
delete anomaly
+ Insertion anomaly
+ Consider the book company has resident author who are in the process of
developing new books, it will be difficult to include the author’s details in
Relation 5. This means that there should be at least one published book in
insert the details of a resident author.
3NF
+ Update anomaly
« If author’s country is to be modified, then it is necessary to modify number
of tuples as the same data is an number of tuples.
* Deletion anomaly
+ If the only one book of a resident author is not reprinted, then the
respective author’s data will be lost.
+ Hence Relation 5 is divided into two relations as show below:
6. Book (ISBN, Title, Unit_Price , Author_Name)
7. Book_Author (Author_Name, Author_Country)
* In Relation 6, Author_Name is foreign Key.
Boyce Codd normal form (BCNF)
+ It is an advance version of 3NF that’s why it is also referred as 3.5NF.
* BCNF is stricter than 3NF.
+ A table complies with BCNF if it is in 3NF and for every functional dependency
X->Y, X should be the super key of the table.
+ Example: Suppose there is a company wherein employees work in more than one
department. They store the data like this:
hop fem natonatıy lemp_dept Idept type Lee no ot emp |
hooı ustrian Production and planning poor poo |
hooı ustrian bres foot bso |
hoo2 \merican psi and technical support prs hoo |
hoo2 mean Purchasing department bi 34 foo |
Boyce Codd normal form (BCNF)
+ Functional dependencies in the table above:
emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}
+ Candidate key: {emp_id, emp_dept}
* The table is not in BCNF as neither emp_id nor emp_dept alone are keys.
* To make the table comply with BCNF we can break the table in three tables like
this:
* emp_nationality table:
br. fosa |
001 ustrian |
002 merican |
Boyce Codd normal form (BCNF)
* emp_dept table:
roduction and planning 0001 100 |
tores [D001 250
jesign and technical support D134 100
Purchasing department [D134 00 |
* emp_dept_mapping table:
7 |
1001 duction and planning
11001 stores
002 jesign and technical support
1002 rchasing department
Boyce Codd normal form (BCNF)
» Functional dependencies:
emp_id -> emp_nationality
emp_dept -> (dept_type, dept_no_of_emp}
* Candidate keys:
1. For first table: emp_id
2. For second table: emp_dept
3. For third table: {emp_id, emp_dept}
* This is now in BCNF as in both the functional dependencies left side part is a key.
Fourth Normal Form (4NF)
+ It should meet all the requirement of 3NF
+ Attribute of one or more rows in the table should not result in more than one
rows of the same table leading to multi-valued dependencies
* To understand it clearly, consider a table with Subject, Lecturer who teaches each
subject and recommended Books for each subject.
[SUBJECT LECTURER
Mathematics |Alex Maths Book1
Mathematics |Bosco Maths Book2 |
Physics Rose Physics Book |
[Chemistry |Adam Chemistry Book |
.
Fourth Normal Form (4NF)
If we observe the data in the table above it satisfies 3NF.
But LECTURER and BOOKS are two independent entities here. There is no
relationship between Lecturer and Books.
In the above example, either Alex or Bosco can teach Mathematics. For
Mathematics subject, student can refer either 'Maths Book1' or 'Maths Book2".
SUBJECT --> LECTURER
SUBJECT-->BOOKS
This is a multivalued dependency on SUBJECT.
If we need to select both lecturer and books recommended for any of the
subject, it will show up (lecturer, books) combination, which implies lecturer who
recommends which book. This is not correct.
SELECT c.LECTURER, c.BOOKS FROM COURSE c WHERE SUBJECT =
'Mathematics';
Fourth Normal Form (4NF)
* To eliminate this dependency, we divide the table into two as below:
[SUBJECT [LECTURER SUBJECT |Books
Mathematics [alex Mathematics | Maths Book1
Mathematics [Bosco Mathematics | Maths Book2
Physics Rose Physics [Physics Book
[Chemistry [Adam chemistry _ [chemistry Book
Fourth Normal Form (4NF)
+ Now if we want to know the lecturer names and books recommended for any of
the subject, we will fire two independent queries. Hence it removes the multi-
valued dependency and confusion around the data. Thus the table is in 4NF.
--Select the lecturer names
* SELECT c.SUBJECT , c.LECTURER FROM COURSE c WHERE c.SUBJECT
‘Mathematics’;
--Select the recommended book names
+ SELECT c.SUBJECT , c.BOOKS FROM COURSE c WHERE c.SUBJECT
‘Mathematics’;
Fifth Normal Form (SNF)
+ A database is said to be in 5NF, if and only if, It's in 4NF
+ Joining two or more decomposed table should not lose records nor create new
records.
* Consider an example of different Subjects taught by different lecturers and the
lecturers taking classes for different semesters.
[SUBJECT LECTURER [CLASS
Mathematics [Alex SEMESTER 1
Mathematics [Rose SEMESTER 1
Physics Rose SEMESTER 1
Physics Joseph SEMESTER 2
Chemistry Adam SEMESTER 1
Fifth Normal Form (SNF)
+ In above table, Rose takes both Mathematics and Physics class for Semester 1,
but she does not take Physics class for Semester 2. In this case, combination of all
these 3 fields is required to identify a valid data.
+ Imagine we want to add a new class - Semester3 but do not know which Subject
and who will be taking that subject. We would be simply inserting a new entry
with Class as Semester3 and leaving Lecturer and subject as NULL.
+ As we discussed above, it's not a good to have such entries. Moreover, all the
three columns together act as a primary key, we cannot leave other two columns
blank!
+ Hence we have to decompose the table in such a way that it satisfies all the rules
till 4NF and when join them by using keys, it should yield correct record. Here, we
can represent each lecturer's Subject area and their classes in a better way. We
can divide above table into three - (SUBJECT, LECTURER), (LECTURER, CLASS),
(SUBJECT, CLASS)
Fifth Normal Form (SNF)
SNF
SUBJECT LECTURER: [CLASS LECTURER
Mathematics [Alex SEMESTER 1 [Alex
Mathematics |Rose SEMESTER 1 Rose
Rose SEMESTER 1 Rose
Joseph SEMESTER 2 Joseph
Adam [SEMESTER 1 [Adam
[suseer
ISEMESTER1 [Mathematics
SEMESTER1 [Physics
SEMESTER1 [Chemistry
SEMESTER2 [Physics
Fifth Normal Form (SNF)
+ Now, each of combinations is in three different tables. If we need to identify who
is teaching which subject to which semester, we need join the keys of each table
and get the result.
+ For example, who teaches Physics to Semester 1 ?
+ We would be selecting Physics and Semester1 from table 3 above, join with
table1 using Subject to filter out the lecturer names. Then join with table2 using
Lecturer to get correct lecturer name. That is we joined key columns of each table
to get the correct data. Hence there is no lose or new data - satisfying SNF
where t3.Class = 'SEMESTER1' AND t3.SUBJECT= 'PHYSICS‘
AND t3.Subject = t1.Subject AND t3.Class = t2.Class
AND t1.Lecturer = t2.Lecturer;
Dependency-Preserving
° The dependency preservation decomposition is another property of
decomposed relational database schema D in which each functional
dependency X -> Y specified in F either appeared directly in one of the
relation schemas Ri in the decomposed D or could be inferred from the
dependencies that appear in some Ri.
+ Decomposition D = { R1, R2, R3,,.., ‚Rm} of R is said to be dependency-
preserving with respect to F if the union of the projections of Fon each Ri ,
in D is equivalent to F. In other words, R E join of R1, R1 over X.
+ The dependencies are preserved because each dependency in F represents
a constraint on the database. If decomposition is not dependency-
preserving, some dependency is lost in the decomposition.
Dependency-Preserving
+ Example:
+ Let a relation R(A,B,C,D) and seta FDs F={A->B, A->C ,C->D} are given.
A relation R is decomposed into —
R1 = (A, B, C) with FDs F1 = {A ->B, A -> C}, and
R2 = (C, D) with FDs F2 = {C -> D}.
F=F1UF2=(A->B,A->C,C->D)
so, F'=F.
And so, F'+ = F+.
Lossless Decomposition
* Decomposition is lossless if it is feasible to reconstruct relation R from
decomposed tables using Joins.
+ This is the preferred choice. The information will not lose from the relation
when decomposed. The join would result in the same original relation.