DatabasesAndDatawarehousingconcepts1.ppt

deshpandeprajakta40 7 views 80 slides Mar 05, 2025
Slide 1
Slide 1 of 80
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80

About This Presentation

Databases and data warehouses


Slide Content

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1
Chapter 2
Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 2
Outline

Database Concepts

Steps in Database Design

Entity-Relationship Model

Logical Database Design

Relational Model

Object-Relational Model

Physical Database Design

Data Warehouse Concepts

Logical Data Warehouse Design

Physical Data Warehouse Design

Data Warehouse Architecture

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 3
Steps in Database Design

Requirements specification: Collects information about
users’ needs with respect to the database system

Conceptual design: Builds a user-oriented representation
of the database without any implementation
considerations

Logical design: Translates the conceptual schema from
the previous phase into an implementation model
common to several DBMSs, e.g., relational or object-
relational

Physical design: Customizes the logical schema from the
previous phase to a particular platform, e.g., Oracle or
SQL Server

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 4
Entity Relationship Model: University
Example
Project
Project id
Project acronym
Project name
Description (0,1)
Start date
End date
Budget
Funding agency
(0,n)
Academic staff
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)
Course id
Course name
Level
(1,n)(0,n)
(partial,exclusive)
(0,n)
Start date
End date
Is a prereqHas prereq
Course
Professor
Status
Tenure date (0,1)
/No courses
Assistant
Thesis title
Thesis description
(0,n)(1,1)
Advisor
Semester
Year
Homepage (0,1)
Participates
Prerequisite
(1,1)
(1,n)
Section
CouSec
(0,n)(1,1)
Teaches

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 5
ER Model: Entity and Relationship Types (1)

Entity Relationship Model: Most often used conceptual
model for database design

Entity type: Represents a set of real-world objects of
interest to an application

Entity: object belonging to an entity type

Relationship type: Represents an association between
objects

Relationship: association belonging to a relationship type

Role: Participation of an entity type in a relationship type

Cardinalities in a role: Minimum and maximum number
of times that the entity may participate in the
relationship type

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 6
ER Model: Entity and Relationship Types (2)

Types of roles

Optional vs mandatory: minimum cardinality 0 vs 1

Monovalued vs multivalued: maximum cardinality 1 vs n

Relationship types

Binary vs n-ary: two vs more than two participating entity types

Binary relationship types:

One-to-one, one-to-many, or many-to-many depending on the
maximum cardinality of the two roles

Recursive relationship types: the same entity type
participates more than once in the relationship type

Role names are then necessary to distinguish different roles

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 7
ER Model: Attributes

Attributes: Structural characteristics describing objects
and relationships

Attributes have cardinalities, as for roles

Depending on cardinalities, attributes may be

Optional vs mandatory

Monovalued vs multivalued

Complex attributes: Attributes composed of other
attributes

Derived attributes: Their value for each instance may be
calculated from other elements of the schema

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 8
ER Model: Identifiers and Weak Entity Types

Identifier or key: One or several attributes that uniquely
identify a particular object

Weak entity type: Do not have identifier of their own

Regular entity types are called strong entity types

A weak entity type is dependent on the existence of
another entity type, called the identifying or owner entity
type

An identifying relationship type relates a weak entity
type to its owner

Normal relationship types are called regular relationship types

Partial key: Set of attributes that can uniquely identify
weak entities related to the same owner entity

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 9
ER Model: Generalization (1)

Generalization or is-a relationship: Relates perspectives
of the same concept at different abstraction levels

Supertype: Entity at higher abstraction level

Subtype: Entity at lower abstraction level

Population inclusion: Every instance of the subtype is
also an instance of the supertype

Inheritance: All characteristics (e.g., attributes, roles) of
the supertype are inherited by the subtype

Substitutability: Each time an instance of a supertype is
required, an instance of the subtype can be used instead

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 10
ER Model: Generalization (2)

Several generalization relationships accounting for the
same criterion can be grouped

Types of generalizations

Total vs partial: depending on whether every instance of the
supertype is also an instance of one of the subtypes

Disjoint vs overlapping: depending on whether an instance may
belong to one or several subtypes

Multiple inheritance: A subtype that has several
supertypes

May induce conflicts when a property of the same name is
inherited from two supertypes

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 11
ER Model: Summary of Notation
Professor
Identifier
Other attributes
Attributes
Teaches
Partial Identifier
Other Attributes
Section
(0,1)
(1,1)
(0,n)
(1,n)
Simple attribute
Complex attribute
Attribute 1
Attribute 2
Optional att. (0,1)
Multivalued attr. (1,n)
(1,1)
role name
Entity Type Relationship Type
Weak Entity Type
Attribute typesRole Cardinalities
Identifying
Relationship Type
Generalization
CouSec
(total,exclusive)
(partial,exclusive)
(total,overlapping)
(partial,overlapping)
Generalization types

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 12
Relational Model: University Example
Professor
Employee no
Status
Tenure date (0,1)
No courses
Academic staff
Employee no
First name
Last name
Address
Email
Homepage
Academic staff
Research area
Employee no
Research area
Course id
Course name
Level
Course
Course id
Has prereq
Prerequisite
Project
Project id
Project acronym
Project name
Description (0,1)
Start date
End date
Budget
Funding agency
Employee no
Project id
Start date
End date
Participates
Assistant
Employee no
Thesis title
Thesis description
Advisor
Course id
Semester
Year
Homepage (0,1)
Employee no
Section

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 13
Relational Model (1)

Based on a simple data structure

A relation or table, composed of attributes or columns

Each attribute is defined over a domain or data type

Typical domains: integer, float, date, string, …

Tuple or row: element of the table

Provides several declarative integrity constraints

Not null attribute: a value for the attribute must be
provided

Key: column(s) that uniquely identify one row of the table

Simple vs composite key: depending on whether key is composed
of one or several columns

Primary vs alternate key: one of the keys must be chosen as
primary by the designer, the others are alternate keys

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 14
Relational Model (2)

Referential integrity

Defines a link between two tables (or twice the same one)

A set of attributes in one table (foreign key) references the
primary key of the other table

Values in the foreign key columns must also exist in the primary
key of the referenced table

Check constraint: defines a predicate that must be valid
when inserting or updating a tuple in a relation

Predicate can only involve the tuple being inserted/deleted

All other constraints must be expressed by triggers: an
event-condition-action rule that is automatically
activated when a relation is updated

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 15
Translating from ER to Relational Schemas
(1)

Rule 1: A strong entity type is associated with a table

Table contains the simple monovalued attributes and the simple
component of the monovalued complex attributes of the entity
type

Table also defines not null constraints for mandatory attributes

Identifier of the entity type define the primary key of the table
Academic staff
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)
Academic staff
Employee no
First name
Last name
Address
Email
Homepage

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 16
Translating from ER to Relational Schemas
(2)

Rule 2: A weak entity type is transformed as a strong
entity type, but the associated table also contains the
identifier of the owner entity

A referential integrity constraint relates this identifier to the table
associated with the owner entity type

Primary key of the table: partial identifier of the weak entity type
+ identifier of the owner entity type
Semester
Year
...
Section Course
(1,n)(1,1)
Course id
...
CouSec
Course id
Semester
Year
...
Section

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 17
Translating from ER to Relational Schemas
(3)

Rule 3: A binary one-to-one relationship type is
represented by including the identifier of one of the
participating entity types in the table associated to the
other entity type

A referential integrity constraint relates this identifier to the table
associated with the corresponding entity type

Simple monovalued attributes and simple components of
monovalued complex attributes are also included in the table

Not null constraint for mandatory attributes are also defined
Department
...
Professor
(0,1)(1,1)
Dean
...
Department
...
Dean

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 18
Translating from ER to Relational Schemas
(4)

Rule 4: A binary one-to-many relationship type is
represented by including the identifier of the entity type
with cardinality 1 in the table associated to the other
entity type

A corresponding referential integrity constraint is added

Attributes and not null constraints are also added as in previous
rules
Assistant
...
Professor
(0,n)(1,1)
Advisor
...
Assistant
...
Advisor

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 19
Translating from ER to Relational Schemas
(5)

Rule 5: A binary many-to-many relationship type or an n-
ary relationship type is associated with a table containing
the identifiers of the entity types that participate in all its
roles

Corresponding referential integrity constraints are added

Attributes and not null constraints are also added as in previous
rules

Identifier if any is the key of the relation, but the combination of
all role identifiers can be used as a key
Academic staff Project
Project id
...
Employee no
...
(1,n)(0,n)
Start date
End date
Participates
Employee no
Project id
Start date
End date
Participates

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 20
Translating from ER to Relational Schemas
(6)

Rule 6: A multivalued attribute is associated with a table
containing the identifier of the entity or relationship type
to which it belongs

Corresponding referential integrity constraint is added

Primary key of table is composed of all its attributes
Academic staff
...
Research areas (1,n)
Academic staff
Research area
Employee no
Research area

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 21
Translating from ER to Relational Schemas
(7)

Rule 7: A generalization can be translated in 3 ways

The supertype and the subtype are associated with tables
containing the identifier of the entity type. There is a referential
integrity constraint from the table of the subtype to the table of
the subtype
Academic staff
Employee no
Name
...
Professor
Status
...
Assistant
Thesis title
...
Academic staff
Employee no
Name
...
Professor
Employee no
Status
...
Assistant
Employee no
Thesis title
...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 22
Translating from ER to Relational Schemas
(8)

Rule 7 (cont.):

Only the supertype is associated with a table, which contains also
the attributes of the subtype (these are optional attributes)

Only the subtype is associated with a table containing all
attributes of the subtype and the supertype
Academic staff
Employee no
Name

Thesis title (0,1)
...
Status
...
Professor
Employee no
Name

Status
...
Assistant
Employee no
Name

Thesis title
...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 23
Defining Relational Schemas in SQL
create table AcademicStaff as (
EmployeeNo integer primary key,
FirstName character varying (30) not null,
LastName character varying (30) not null,
Address character varying (50) not null,
Email character varying (30) not null,
Homepage character varying (64) not null );
create table AcademicStaffResearchArea as (
EmployeeNo integer not null,
ResearchArea character varying (30) not null,
primary key (EmployeeNo,ResearchArea),
foreign key EmployeeNo references AcademicStaff(EmployeeNo) );

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 24
Redundancies and Update Anomalies
Relation with redundancies may induce anomalies in the
presence of insertions, updates, and deletions

In relation Participates the information about a project is repeated
for each staff member who works on that project

In relation Affiliation the information about a research area is
repeated for all departments to which the staff member is
affiliated
Dependencies and normal forms are used to describe
such anomalies
Employee no
Project id
Start date
End date
Project acronym
Project name
Participates Affiliation
Employee no
Research area
Department id

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 25
Functional Dependencies

Given a relation R and to sets of attributes X and Y in R, a
functional dependency X Y holds if in all tuples of the
relation, each value of X is associated with at most one
value of Y

In Participates above, Project Id {Project acronym, Project name}

A key is a particular case of a functional dependency

A prime attribute is an attribute that is part of a key

A full functional dependency is a dependency X Y in
which the removal of an attribute from X invalidates the
dependency

A dependency X Z is transitive if there is a set of
attributes Y such that X Y and Y Z hold

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 26
Multivalued Dependencies

Given a relation R and to sets of attributes X and Y in R, a
multivalued dependency X Y holds if the value of X
determines a set of values for Y, independently of any
other attributes

In Affiliation above, Employee no Research area and
Employee no Department Id

A multivalued dependency X Y is trivial if either Y
X or X Y = R, otherwise it is nontrivial

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 27
Normal Forms (1)

Normal Form: Integrity constraint certifying that a
relation schema satisfies particular properties

In the relational model the relations are in first normal
form since all attributes are atomic and monovalues

A relation schema is in the second normal form if every
nonprime attribute is functionally dependent on every
key
Employee no
Project id
Start date
End date
Project acronym
Project name
Participates
Employee no
Project id
Start date
End date
Participates
Project id
Project acronym
Project name
Project

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 28
Normal Forms (2)

A relation is in the third normal form if it is in the second
normal form and there are no transitive dependencies
between a key and a nonprime attribute

A relation is in the Boyce-Codd normal form if for every
nontrivial dependency X  Y, X is a key or contains a key
Assistant
Employee no
Thesis title
Thesis description
Advisor id
Advisor first name
Advisor last name
Assistant
Employee no
Thesis title
Thesis description
Advisor id
Professor
Employee no
First name
Last name
Employee no
Project id
Start date
End date
Location
Participates
Employee no
Project id
Start date
End date
Participates
Location
Project id
Location

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 29
Normal Forms (3)

A relation is in the fourth normal form if for every
nontrivial dependency X Y, X is a key or contains a
key
Affiliation
Employee no
Research area
Department id
Affiliation
Employee no
Research area
Research
Employee no
Research area

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 30
Weaknesses of the Relational Model

It provides a very simple data structure, which disallows
multivalued and complex attributes

complex objects are split into several tables performance
problems

The set of types provided by relational DBMSs is very
restrictive (integer, float, string, date, …)

Insufficient for complex application domains

There is no integration of operations with data
structures

It is not possible to directly reference an object by use of
a surrogate or a pointer

links are based on comparison of values performance
problems

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 31
Object-Relational Model: Motivations
Preserves the foundations of the relational model
Organizes data using an object model

Allow attributes to have complex types groups related facts
into a single row
Extensions

Complex and/or multivalued attributes

User-defined types with associated methods

System-generated identifiers

Inheritance among types
Defined in the SQL:1999 and SQL:2003 standards
Current DBMSs (Oracle, …) introduced object-relational
extensions, but do not necessarily comply with the
standard

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 32
Object-Relational Model: Definition (1)

Allow composite types, in addition to predefined types

Complex values may be stored in a column of a table

Row type: structured values (i.e., composite attributes)

Array type: variable-sized vectors of values

Multiset type: Unordered collection of values, with duplicates
permitted (bags)

Composite types can be nested

This is considered an “advanced feature” in the standard

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 33
Object-Relational Model: Definition (2)

Two sorts of user-defined types

Distinct types: Represented internally by a predefined type,
but cannot be mixed in expressions

Does not make sense to compare a person’s name and a SSN

Structured user-defined types:  class declarations in OO
languages

May have attributes, which can be of any SQL type

May have methods, which can be instance or static (class) methods

Attributes are encapsulated through observer and mutator functions

Both of them can be used as a domain of a column, of an
attribute of another type, or a table

Comparison/ordering of values: with user-defined methods

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 34
Object-Relational Model: Definition (3)
SQL:2003 supports single inheritance:

A subtype inherits attributes and methods from its supertype
A subtype can

Define additional attributes and methods

Overload inherited methods: provide another definition with a
different signature

Override inherited methods: provide another implementation
with a the same signature
Final types: cannot have subtypes
Final methods: cannot be redefined
Instantiable type: includes constructor methods for
creating instances

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 35
Object-Relational Model: Definition (4)

Two types of tables

Relational tables: usual ones, where domains for
attributes are all predefined or user-defined types

Typed tables: defined on structured user-defined types

Constraints (e.g. keys) defined in the table declaration

Have a self-referencing column that contains identifiers
(surrogates) generated by the DBMS

Row identifiers can be used for establishing links between tables

Ref type: values are the unique identifiers

Always associated with a specified structured type

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 36
Object-Relational Model: University Example
Academic staff Type
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)
Participates (0,n)
Professor Type
Status
Tenure date (0,1)
No courses
Advises (0,n)
Teaches (0,n)
Project Type
Project id
Project acronym
Project name
Description (0,1)
Start date
End date
Budget
Funding agency
Participates (1,n)
Semester
Year
Homepage (0,1)
Professor
Course
Section Type
Employee
Project
Start date
End date
Participates Type
Assistant Type
Thesis title
Thesis description
Advisor
Course id
Course name
Level
Sections (1,n)
Has prereq (0,n)
Is a prereq (0,n)
Course Type

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 37
Translating from ER to OR Schemas (1)

Rule 1: A strong entity type is associated with a
structured type containing all its attributes

Multivalued attributes defined with the array or the multiset
type

Composite attributes defined with the row or the structured
type

A table of this structured type must be declared

Table defines not null constraints for mandatory attributes

Identifier of the entity type define the primary key of the table
Academic staff
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)
Academic staff Type
Employee no
Name
First name
Last name
Address
Email
Homepage
Research areas (1,n)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 38
Translating from ER to OR Schemas (2)

Rule 2: A weak entity type is transformed as a strong
entity type, but the structured type contains the
identifier of its owner
Semester
Year
Homepage (0,1)
Section Course
(1,n)(1,1)
Course id
...
CouSec
Semester
Year
Homepage (0,1)
Course
Section Type Course Type
Course id
...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 39
Translating from ER to OR Schemas (3)

Rule 3: A regular relationship type is associated with
structured type containing the identifiers of all
participating entity types, and all attributes of the
relationship

A table of this structured type must be declared

Table defines not null constraints for mandatory attributes

Set of identifiers of entity types define the primary key of the
table

Inverse references may also be defined
Academic staff Project
Project id
...
Employee no
...
(1,n)(0,n)
Start date
End date
Participates
Acad. staff
Project
Start date
End date
Participates
Type
Academic staff
Type

Participates (0,n)
Project
Type

Participates (1,n)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 40
Translating from ER to OR Schemas (4)

Rule 4: Only single inheritance is supported by the model

multiple inheritance must be removed by transforming
some generalization links into binary one-to-one
relationships, to which Rule 3 is applied
PhD Student
Thesis title
...
Assistant
Assistant Type
...
Student
Major
...
PhD Student
Thesis title

Student
Assistant
Assistant Type
...
Student
Major
...
PhD Student

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 41
Defining Object-Relational Schemas in SQL
create type AcademicStaffType as (
EmployeeNo integer,
Name row (
FirstName character varying (30),
LastName character varying (30) ),
Address character varying (50),
Email character varying (30),
Homepage character varying (64),
ResearchAreas character varying (30) multiset,
Participates ref(ParticipatesType) multiset scope Participates
references are checked )
ref is system generated;
create table AcademicStaff of AcademicStaffType (
constraint AcademicStaffPK primary key (EmployeeNo),
Email with options constraint AcademicStaffEmailNN Email not null,
ref is oid system generated );

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 42
Physical Database Design

Specifies how database records are stored, accessed,
and related in order to ensure adequate performance of
a database application

Requires to know the specificities of an application,
properties of data, and usage patterns

Involves analyzing the transactions/queries that run
frequently, that are critical, and the periods of time in
which there will be a high demand on the database (peak
load)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 43
Performance of Database Applications

Factors for measuring performance of database
applications

Transaction throughput: # transaction processed in a time interval

Response time: Elapsed time for the completion of a transaction

Disk storage: Space required to store the database file

A compromise has to be made among these factors

Space-time trade-off: Reduce time to perform an operation by using
more space, and vice versa

Query-update trade-off: Access to data can be more efficient by
imposing some structure upon it

However the more elaborate the structure, the more time is needed
to built it and maintain it when content changes

After initial physical design is done, it is necessary to
monitor it and tune it

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 44
Data Organization

A database is organized in secondary storage into files,
each composed of records, at their turn composed of
fields

In a disk, data is stored in disk blocks (pages)

Set by the operating system during disk formatting

Transfer of data between main memory and disk takes
place in units of disk blocks

DBMSs store data on database blocks (pages)

Selection of a database block depends on several issues

Locking granularity may be at the block level, not the record level

For disk efficiency, database block size must be equal to or be a
multiple of the disk block size

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 45
File Organization

Arrangement of data in a file into records and blocks

Heap files (unordered files): Records placed in the file in
the order as they are inserted

Efficient insertions, slow retrieval

Sequential files (ordered files): Records sorted on the
values of ordering fields

Fast retrieval, slow inserting and deleting

Hash files: Use a hash function that calculates the block
(bucket) of a record based on one or several fields

Collision: Bucket is filled and a new record must be inserted

Fastest retrieval, but collision management degrades
performance

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 46
Indexes

Additional access structures that speed up retrieval of
records in response to search conditions

Provide alternative ways of accessing the records based
on the indexing fields on which the index is constructed

Many different types of indexes

Clustering vs nonclustering: Whether records in the file are
physically ordered according to the fields of the index

Single column vs multiple column, depending on the number of
indexing files

Unique vs nonunique: Whether duplicate values are allowed

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 47
Indexes, Clustering

Types of indexes (cont.)

Sparse or dense: Whether there are index records for all search
values

Single-level vs multilevel: Whether an index is split in several
smaller indexes, with an index to these indexes

Multi-level indexes often implemented by using B-trees of B
+
-trees

Clustering: Tables physically stored together as they
share common columns (cluster key) and are often used
together

Improves data retrieval

Cluster key stored only once  storage efficiency

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 48
Outline

Database Concepts

Steps in Database Design

Entity-Relationship Model

Logical Database Design

Relational Model

Object-Relational Model

Physical Database Design

Data Warehouse Concepts

Logical Data Warehouse Design

Physical Data Warehouse Design

Data Warehouse Architecture

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 49
Data Warehouses: Definition (1)

Operational databases (online transaction processing
systems or OLTP), are not suitable for data analysis

Contain detailed data, do not include historical data, perform
poorly for complex queries due to normalization

Data warehouses address requirements of decision-
making users

A data warehouse is a collection of subject-oriented,
integrated, nonvolatile, and time-varying data to support
data management decisions

Subject-oriented: focus on particular subjects of analysis
(e.g., purchase, inventory, and sales in a retail company)

Operational databases focus on specific functions of an
application

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 50
Data Warehouses: Definition (2)
Integrated: Join together data from several operational
and external systems problems due to differences in
data definition and content

In operational DBs these problems are solved in the design phase
Nonvolatile: Data modification and removal is disallowed
scope of the data is long period of time

Operational DBs keep data for a short period of time, as required
for daily operations
Time-varying: Keep different values for the same informa-
tion and the time when changes to these values occurred

Operational DBs typically do not have temporal support
Online analytical processing (OLAP): Allows decision-
making users to perform interactive analysis of data

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 51
Multidimensional Model

DWs and OLAP use a multidimensional view of data

Represented as a data cube or an hypercube

Dimensions: Perspectives for analyzing data

Cells (facts): Contain measures, values that are to be analyzed
21
12
11
35
Q4
Paris
Nice
Milan
Product (Category)
T
i
m
e

(
Q
u
a
r
t
e
r
)
books
47
Q3
Q2
Q1
Rome
CDs
DVDsgames
1018
30
3226
14
2014 31
12202433
24182814
33252325
measure
values
dimensions
35
27

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 52
Hierarchies

Data granularity: Level of detail of measures

Data analyzed at different granularities (abstraction
levels)

Hierarchies relate low-level (detailed) concepts to higher-
level (general concepts)

Example: Store – City – Region/Province – Country

Given two related levels in a hierarchy, lower level is
called child, higher level is called parent

Instances of these levels are called members

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 53
Kinds of Hierarchies
Balanced: Same number of levels from leaf members to root
Noncovering (ragged): Parent of some members does not
belong to the level immediately above

E.g., small countries do not have Region/Province level
Store level
ParisCity level
Region/
Province level
Country level
Store 1 Store 2
Ile-de-France
France
Nice
Store 3
Store dimension
Rome
Store 10Store 11
Milan
Store 12
Lazio
Italy
All
......
...
...
Lombardy
Provence-Alpes-
Côte d'Azur

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 54
Kinds of Hierarchies

Unbalanced: May not have instances at the lower levels

E.g., some cities do not have stores, sales are through
wholesalers

Parent-child (recursive): Involve twice the same level
Jottard
Hallez Berger
Nguyen
Vancamp Rochat
Weiss
Ivanov
Spiegel Praet
Pirlot Quintin
Claus
Lepage

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 55
Measure Aggregation and Summarizability

Measures are aggregated when using hierarchies for
visualizing data at different abstraction levels

Summarizability conditions ensure correct aggregation

Disjointness of instances: Grouping of instances in a level with
respect to the parent in the next level must result in disjoint sets

Completeness: All instances are included in the hierarchy and
each instance is related to one parent in the next level

Correct use of aggregation functions: Type of measures
determine the kind of aggregation functions that can be applied

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 56
Measure Classification: Additivity
Additive measures (flow or rate measures): Can be
meaningfully summarized using addition along all
dimensions

E.g., sales amount can be summarized when the hierarchies in
Store, Time, and Product dimensions are traversed
Semiadditive measures (stock or level measures): Can be
meaningfully summarized using addition along some
(not all) dimensions

E.g., inventory quantities, can be aggregated in the Store
dimension, but cannot be aggregated in the Time dimension
Nonadditive measures (value-per-unit measures): Cannot
be meaningfully summarized using addition along any
dimension

E.g., item price, cost per unit, exchange rate

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 57
Measure Classification: Aggregation
Complexity

Distributive measures: Defined by an aggregation function
that can be computed in a distributed way

Data is partitioned into n sets, aggregate function applied to each
set, aggregated value is computed by applying a function to these n
subaggregate values

E.g., count, sum, min, max

Algebraic measures: Defined by an aggregation function
that has a bounded number of arguments, each is obtained
by applying a distributive function

E.g., average, computed from sum and count, which are distributive

Holistic measures: Cannot be computed from other
subaggregate values

E.g., median, mode, rank

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 58
OLAP Operations: Roll up

Transforms detailed measures into summarized ones
when one moves up in a hierarchy
Roll-up to the Country level
21
12
11
35
Q4
Paris
Nice
Milan
Product (Category)
T
i
m
e

(
Q
u
a
r
t
e
r
)
books
47
Q3
Q2
Q1
Rome
CDs
DVDsgames
1018
30
323526
2714
2014 31
12202433
24182814
33252325
33
12
11
68
Q4
France
Italy
Product (Category)
T
i
m
e

(
Q
u
a
r
t
e
r
)
books
47
Q3
Q2
Q1
CDs
DVDsgames
3042
30
323526
2714
2014 31
57435139

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 59
OLAP Operations: Drill down

Opposite to the roll-up operation, i.e., it moves from a
more general level to a detailed level in a hierarchy
Drill-down to the
Month level
21
12
11
35
Q4
Paris
Nice
Milan
Product (Category)
T
i
m
e

(
Q
u
a
r
t
e
r
)
books
47
Q3
Q2
Q1
Rome
CDs
DVDsgames
1018
30
323526
2714
2014 31
12202433
24182814
33252325
7
4
8
13
...
Paris
Nice
Milan
Product (Category)
T
i
m
e

(
Q
u
a
r
t
e
r
)
books
...
Mar
Feb
Jan
Rome
CDs
DVDsgames
26
12
1046
84
...... ...
47810
8695
108118
1644 7Dec

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 60
OLAP Operations: Pivot or Rotate

Rotates the axes of a cube to provide an alternative
presentation of the data
21
28
11
14
Q4
Milan
Rome
Paris
Time (Quarter)
books
25
Q3Q2Q1
Nice
CDs
DVDs
games
2726
13
323533
1214
2324 18
10141220
35303231
18113547
S
t
o
r
e

(
C
i
t
y
)
Pivot
21
12
11
35
Q4
Paris
Nice
Milan
Product (Category)
T
i
m
e

(
Q
u
a
r
t
e
r
)
books
47
Q3
Q2
Q1
Rome
CDs
DVDsgames
1018
30
323526
2714
2014 31
12202433
24182814
33252325

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 61
OLAP Operations: Slice

Performs a selection on one dimension of a cube,
resulting in a subcube
Slice on Store.City = ‘Paris’
21
12
11
35
Q4
Paris
Nice
Milan
Product (Category)
T
i
m
e

(
Q
u
a
r
t
e
r
)
books
47
Q3
Q2
Q1
Rome
CDs
DVDsgames
1018
30
323526
2714
2014 31
12202433
24182814
33252325 21
12
11
35
Q4
Product (Category)
T
i
m
e

(
Q
u
a
r
t
e
r
)
books
47
Q3
Q2
Q1
CDs
DVDsgames
1018
30
323526
2714
2014 31

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 62
OLAP Operations: Dice

Defines a selection on two or more dimensions, thus
again defining a subcube
Dice on Store.Country = ‘France’
and Time.Quarter= ‘Q1’ or ‘Q2’
21
12
11
35
Q4
Paris
Nice
Milan
Product (Category)
T
i
m
e

(
Q
u
a
r
t
e
r
)
books
47
Q3
Q2
Q1
Rome
CDs
DVDsgames
1018
30
323526
2714
2014 31
12202433
24182814
33252325
21
11
35
Paris
Nice
Product (Category)
T
i
m
e
(
Q
u
a
r
t
e
r
)
books
Q2
Q1
CDs
DVDsgames
1018
302714
12202433

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 63
Other OLAP Operations

Drill-across: Executes queries involving more than one
cube, which share at least one dimension

E.g., compare sales data in a cube with purchasing data in
another one

Drill-through: Allows one to move from data at the
bottom level in a cube, to data in the operational
systems from which the cube was derived

E.g., for determining the reason for outlier values in a data cube

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 64
Implementations of a Multidimensional
Model

Relational OLAP (ROLAP): Store data in relational
databases, support extensions to SQL and special access
methods to efficiently implement the model and its
operations

Multidimensional OLAP (MOLAP): Store data in special
data structures (e.g., arrays) and implement OLAP
operations in these structures

Better performance than ROLAP for query and aggregation,
less storage capacity than ROLAP

Hybrid OLAP (HOLAP): Combine both technologies,

E.g., detailed data stored in relational databases, aggregations
kept in a separate MOLAP store

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 65
Logical Data Warehouse Design
In ROLAP systems, tables organized in specialized structures
Star schema: One fact table and a set of dimension tables

Referential integrity constraints between fact table and dimension
tables

Dimension tables may contain redundancy in the presence of
hierarchies
Snowflake schema: Avoids redundancy of star schemas by
normalizing dimension tables

Normalized tables optimize storage space, but decrease performance
Starflake schema: Combination of the star and snowflake
schemas, some dimensions normalized, other not
Constellation schema: Multiple fact tables that share
dimension tables

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 66
Logical DW Design: Star Schema
Time
Time key
Date
Event
Weekday flag
Weekend flag
Season
...
Product
Product key
Product number
Product name
Description
Size
Category name
Category descr.
Department name
Department descr.
...
Sales
Product fkey
Store fkey
Promotion fkey
Time fkey
Amount
Quantity
Store
Store key
Store number
Store name
Store address
Manager name
City name
City population
City area
State name
State population
State area
State major activity
...Promotion
Promotion key
Promotion descr.
Discount pct.
Type
Start date
End date
...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 67
Logical DW Design: Snowflake Schemas
Time
Time key
Date
Event
Weekday flag
Weekend flag
Season
...
Category
Category key
Category name
Description
Department fkey
...
Department
Department key
Department name
Description
...
Product key
Product number
Product name
Description
Size
Category fkey
...
Sales
Product fkey
Store fkey
Promotion fkey
Time fkey
Amount
Quantity
Store
Store key
Store number
Store name
Store address
Manager name
City fkey
...
City
City key
City name
City population
City area
State fkey
...
State
State key
State name
State population
State area
State major activity
...
Promotion
Promotion key
Promotion descr.
Discount pct.
Type
Start date
End date
...
Product

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 68
Logical DW Design: Constellation Schemas
Purchases
Product fkey
Supplier fkey
Order time fkey
Due time fkey
Amount
Quantity
Freight cost
Time
Time key
Date
Event
Weekday flag
Weekend flag
Season
...
Product
Product key
Product number
Product name
Description
Size
Category name
Category descr.
Department name
Department descr.
...
Sales
Product fkey
Store fkey
Promotion fkey
Time fkey
Amount
Quantity
Store
Store key
Store number
Store name
Store address
Manager name
City name
City population
City area
State name
State population
State area
State major activity
...
Promotion
Promotion key
Promotion descr.
Discount pct.
Type
Start date
End date
...
Supplier
Supplier key
Supplier name
Contact person
Supplier address
City name
State name
...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 69
Physical Data Warehouse Design: Views (1)

Data warehouses are several orders of magnitude larger
than operational database

Physical design is crucial for ensuring adequate
response time for complex queries

View: Table derived from a query

Query may involve base tables or other views

Materialized view : View physically stored in a database

Enhances query performance by precalculating costly operations

This is achieved at the expense of storage space

Updating materialized views: Modifications of underlying
base tables must be propagated into the view

Must be performed incrementally to ensure efficiency

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 70
Physical Data Warehouse Design: Views (2)

In a DW not all possible aggregations can be materialized

Number of aggregations grows exponentially with the number of
dimensions and hierarchies

Selection of materialized views: Identify the queries to be
prioritized, which determines the views to be materialized

DBMSs are providing tools that tune the selection of
materialized views given previous queries

Queries must be rewritten in order to best exploit existing
materialized views to improve response time

Drawback of materialized view approach: Requires one to
anticipate queries to be materialized

DW queries are often ad hoc and cannot always be anticipated

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 71
Physical Data Warehouse Design: Indexes (1)

Traditional indexes are not appropriate for data
warehouses

Designed for OLTP transactions that access a small number of
tuples

Bitmap indexes: For columns with a low number of
distinct values

Each value is associated with a bit vector whose length is the
number of records in the table

i
th
bit is set if i
th
row has that value in the column

Small in size, logical operators used to manipulate them
Professor
PId Name Rank
P1
P2
P3
P4
John Smith
Ada Jones
...
...
Assistant
Associate
Full
Associate
Bill Kock
Ed Low
...
...
Index on Rank
Rec#Assistant
1
Associate
2
3
4
1
Full
0 0
0 1 0
0 0 1
0 1 0
...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 72
Physical Data Warehouse Design: Indexes (2)

Join indexes: Materialize relational join by keeping pairs
of row identifiers that participate in the join

Index selection: Select indexes to minimize execution
cost of a series of queries and updates

DBMSs are providing tools to assist DBAs to this task
P1
P1,C1,...
P1,C2,...
P2,C1,...
Index on
Client
Index on
Product
Sales
P3,C2,...
P2
P3
...
C1
C2
...
...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 73
Fragmentation or Partitioning

Divide the contents of a relation into several files that can
be more efficiently processed

Vertical fragmentation: Splits attribute of a table into
groups that are stored independently

E.g., most often used attributes in one partition

Horizontal fragmentation: Divides records of a table into
groups according to a particular criterion

Range partitioning: Partitioned wrt range of values

E.g., partitioning based on time, each partition contains data about a
particular time period (year, month)

Hash partitioning: Partitioned wrt hash function

Combined partitioning: Combine these two methods

E.g., rows first partitioned according to range values, then subdivided
using a hash function

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 74
Data Warehouse Architecture (1)
Operational
databases
External
sources
Internal
sources
OLAP tools
Reporting
tools
Data mining
tools
Data marts
Back-end
tier
OLAP
tier
Front-end
tier
Data
sources
Data warehouse
tier
Statistical
tools
Data staging Metadata
ETL
process
Enterprise
data
warehouse
OLAP
server

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 75
Data Warehouse Architecture (2)

Data sources

Operational databases

Other internal or external sources of information (e.g. files)

Back-end tier

Extraction-Transformation-Loading (ETL) tools for manipulating
data from sources

Data staging area: Intermediate database where manipulation is
done

OLAP tier

OLAP Server: Supports multidimensional data and operations

Front-end tier: Deals with data analysis and visualization

Composed of OLAP tools, reporting tools, statistical tools, data-
mining tools, …

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 76
Back-End Tier

Extraction: Gathers data from multiple heterogeneous data
sources

May be operational databases or files in various formats

May be internal or external to the organization

Uses APIs such as ODBC, JDBC, … for achieving interoperability

Transformation: Modifies data to conform to the data
warehouse format

Cleaning: Removes errors, inconsistencies, format transformation

Integration: Reconciles data from different sources

Aggregation: Summarizes data according to the granularity (level of
detail) of the DW

Loading: Feeds the DW with transformed data

Also includes refreshing the data warehouse at a specified frequency

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 77
Data Warehouse Tier

Enterprise data warehouse: Centralized DW that
encompasses all areas in an organization

Data mart: Specialized DW targeted to a particular
functional area or user group

Their data can be derived from the enterprise DW or collected from
data sources

Metadata repository: Describes the content of the DW

Business metadata: Meaning (semantics) of data, organization
rules, policies, constraints, …

Technical metadata: How data is structured/stored in the computer

Data sources, data warehouse, and data marts: logical and physical
schemas, security information, monitoring information …

ETL process: Data lineage (trace to sources), rules, defaults, refresh and
purging rules, algorithms for summarization, …

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 78
OLAP Tier

OLAP servers that provides multidimensional view from
DWs and data marts

Can be ROLAP, MOLAP, or HOLAP

Most database products provide OLAP extensions and
related tools for manipulating cubes

However, no standardized language for querying data
cubes

Oracle uses Java and query language OLAP DML

SQL Server uses .NET and query language MDX

XMLA (XML for Analysis) aims at providing a common
language for exchanging multidimensional data

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 79
Front-End Tier

OLAP tools: Allow interactive exploration and
manipulation of the warehouse data

Facilitate formulation of ad hoc queries (no prior knowledge of
them)

Reporting tools: Enable production, delivery and
management of reports (paper and web-based)

Use predefined queries

Statistical tools: Used to analyze and visualize the cube
data using statistical methods

Data-mining tools: Allow users to analyze data to
discover patterns, trends, enable predictions

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 80
Variations of the Architecture

Enterprise DW without data marts, or data marts without
enterprise DW

Building enterprise DW is costly, building a data mart is easier

Integrating independently created data marts is costly

No OLAP server: Client tools access directly DW

Extreme situation: No DW either, client tools access sources

Virtual DW: Set of materialized views over data sources

Easier to build, but does not provide a real DW solution

Does not contain historical data, no centralized metadata, no possibility
to clean and transform the data

No data staging area: If the source systems conform very
closely to the data in the DW

Rarely the case in real-world situations
Tags