Data Warehouse Modeling

vivekjv 102,318 views 87 slides Dec 30, 2008
Slide 1
Slide 1 of 87
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87

About This Presentation

No description available for this slideshow.


Slide Content

Data Warehouse
Modeling
Thijs Kupers
Vivek Jonnaganti

Agenda
•Introduction
•Data Warehousing Concepts
•OLAP
•Dimension Modeling
•Conceptual Modeling
•Indexing
•Conclusion

Introduction

The Evolution
•1960 - DSS processing using Fortron or COBOL
•1970 - DBMS systems and the advent of DASD
•1975 - OLTP systems facilitating faster access to data
•1980 - PC/4GL technology and the advent of MIS
•1985 - OLAP systems and separation of analytical
processing from transactional processing
•1994 - Architectured environments with integrated
OLAP engines and tools

What is a Data Warehouse?
•A copy of transaction data specifically structured to Query
and Analysis (Ralph Kimball, 1996)
•A collection of integrated, subject oriented databases
designed to support the DSS function where each unit of
data is relevant at some moment of time (Bill Inmon, 1991)
•The data characteristics of a Data Warehouse are;
•Subject-oriented
•Time-variant
•Non-volatile
•Integrated

What is a Data Warehouse? (cont’d)
•A single, complete and consistent store of data obtained
from a variety of different sources made available to end
users, in what they can understand and use in a business
context (Barry Devlin 1992)
•A process of transforming data into information and making it
available to users in a timely enough manner to make a
difference (Forrester Research 1996)

Data Warehouse Goals/Characteristics
•It must make an organization’s information easily accessible
(slicing and dicing)
•It must present the organization’s information consistently
•It must be adaptive and resilient to change
•It must be a secure bastion that protects our information
assets
•It must serve as the foundation for improved decision making
•The business community must accept the DW, if it is to be
deemed successful

Data Warehouse Applications
•Retail Industry
•Forecasting, Market research, Merchandising etc.
•Manufacturing and distribution
•Sales history/trends, Market demand projects etc.
•Banks
•Spot market trends, Marketing, Credit cards etc.
•Insurance Companies
•Property and casualty fraud etc.
•Health Care Providers
•Fraud detection, Patient matching etc.

Data Warehouse Applications
•Government Agencies
•Auditing tax records, information sharing across
different agencies etc.
•Internet Companies
•Analyzing shopping behavior, CRM etc.
•Telecommunications
•Telemarketing, Product development etc.
•Sports
•Analyzing strategies, Winning player combinations etc.

Data Warehouse Sizes
•Terabyte (10^12) - Walmart (24 TB)
•Petabyte (10^15) - Geographic Information
Systems
•Exabyte (10^18) - National Medical Association
•Zettabyte (10^21) - Weather Images
•Zottabyte (10^24) - Intelligence Agency (Video)

Data Warehousing Concepts

Data Warehouse (OLAP) and OLTP
Characteristics
On-Line Transaction
Processing (OLTP) Data Warehouse
Data Content Current values Historical data, summarized
data, calculated data
Data OrganizationApplication by applicationSubject areas across enterprise
Nature of Data Dynamic Static until refreshed, based on
frequency
Data ManipulationUpdated on a field-by-field
basis
Accessed & manipulated
usually no direct update
Usage
Highly structured, repetitive
processing (Clerical User)
Highly structured, analytical
processing (Knowledge User)
Response Time
Critical (Sub-Second to
several seconds) Several seconds to minutes
Updates vs.
Reports
Real-time Updates,
Batch Reporting
Batch Updates,
Real-time Reporting

Data Warehouse Architecture
Enterprise
Data
Warehouse Data
Mart
Data
Mart
Execution
Systems
• CRM
• ERP
• Legacy
• e-Commerce
•Reporting
Tools
•OLAP
Tools
•Ad Hoc
Query
Tools
•Data
Mining
Tools
•External Data
• Purchased
Market Data
• Spreadsheets
•Oracle
•SQL Server
•Teradata
•DB2
•Custom Tools
•HTML Reports
•Cognos
•Business Objects
•MicroStrategy
•Oracle Discoverer
•Brio
•Data Mining Tools
•Portals
Data and Metadata
Repository Layer
•Informatica PowerMart
•Ab Initio
•Data Stage
•Oracle Warehouse Builder
•Custom programs
•SQL scripts
Extract,
Transformation,
and Load (ETL)
Layer
• Cleanse Data
• Filter Records
• Standardize Values
• Decode Values
• Apply Business Rules
• Householding
• Dedupe Records
• Merge Records
Presentation
Layer
ETL Layer
Operational
Source Systems
Technologies:
Metadata
Repository
ODS
•PeopleSoft
•SAP
•Siebel
•Oracle Applications
•Custom Systems
Data
Mart

Data Warehouse Structure
Departmentally
Structured
Individually
Structured
Data Warehouse
Organizationally
Structured
Data
Information
Highly
Summarized
Lightly
Summarized
Atomic/Detailed

Data Warehouse Architecture Drivers
The requirements that drive the DW architecture are;
•Granularity of data
•Data retention and timeliness
•Reporting capability
•Availability
•Scalability

Data Mart Centric
Data Marts
Data Sources
Data Warehouse

Data Mart Centric
If you end up creating multiple warehouses,
integrating them is a problem

Data Warehouse Centric
Data Marts
Data Sources
Data Warehouse

OLAP

OLAP: 3 Tier DSS
Data Warehouse
Database Layer
Store atomic data
in industry
standard Data
Warehouse.
OLAP Engine
Application Logic Layer
Generate SQL execution
plans in the OLAP engine
to obtain OLAP
functionality.
Decision Support Client
Presentation Layer
Obtain multi-
dimensional reports
from the DSS Client.

OLAP Servers
•Support multidimensional OLAP queries
•Characterized by how the underlying data is stored
•Multidimensional OLAP (MOLAP) Servers
•Data stored in array based structures e.g. Hyperion
Essbase
•Relational OLAP (ROLAP) Servers
•Data stored in relational tables e.g. Microstrategy, IBM
Informix
•Hybrid OLAP (HOLAP) Servers
•Data distributed between relational and specialized
storage e.g. Cognos, Microsoft Analysis Services

OLAP Operations
•Rollup; summarize operations
•E.g. given sales data, summarize sales for last year by
product category and region
•Drill down; get more details
•E.g. given summarized sales as above, find breakup of
sales within each region
•Slice and dice; select and project
•Sales of soft-drinks in Gothenburg over the last quarter
•Pivot; change the view of data

Strengths of OLAP
•It is a powerful visualization tool
•It provides fast, interactive response times
•It is good for analyzing time series
•It can be useful to find some clusters and outliners
•Many vendors offer OLAP tools

Dimensional Modeling

What is Dimensional Modeling?
•Logical design technique that seeks to present the
data in a standard, intuitive framework that allows
for high-performance access.
•Adheres to a discipline that uses the relational
model with some important restrictions.
•Composed of one table with a multi-part key,
called the fact table, and a set of smaller tables
called dimension tables.

DM v/s ER Models
DM ER
Used to design database for
Online Analytical Processing
(OLAP)
Used to design database for Online
Transaction Processing (OLTP)
Support ad hoc end-user queriesSupport defined queries
Intuitive & facilitates high-
performance retrieval of data
Removes redundancy of data
De-normalized Normalized

Fact Tables
•Primary table in the DM
•Each row corresponds to a measurement
•Facts in the fact table are numeric and additive
•Narrow rows with a few columns
•Large number of rows (billions)
•Express many-to-many relationships between
dimensions

Dimension Tables
•Define business in terms already familiar to users
•Implement the user interface to the DW
•Wide rows with lots of descriptive text
•Small tables (about a million rows)
•Joined to fact table by a foreign key
•Heavily indexed
•E.g. of typical dimensions
•time periods, geographic region (markets, cities),
products, customers, salesperson, etc.

Four Step Dimensional Design
Process
•Step 1 - Select the business process to model
•The first step in converting an ER diagram to a set of
DM diagrams is to separate the ER diagram into its
discrete business processes and to model each one
separately.
•Step 2 - Choose The Grain of the Business
Process
•The grain is the fundamental atomic level of data to be
represented in the fact table.

Four Step Dimensional Design
Process (cont’d)
•Step 3 - Designate the Fact Tables
•The third step is to select those many-to-many
relationships in the ER model containing numeric and
additive non-key facts and to designate them as fact
tables.
•Step 4 - Choose the dimensions that will apply to
each fact table record
•This involves de-normalizing all of the remaining tables
into flat tables with single-part keys that connect directly
to the fact tables.

Classic Star Schema Model

Snowflake Schema

Fact Constellation Schema

Slowly Changing Dimensions
•Type 1: Overwrite the value

Slowly Changing Dimensions (cont’d)
•Type 2: Add a Dimension row
•Type 3: Add a Dimension column

Conceptual Modeling

Graph Theory
•Directed, acyclic, weakly connected graph
•Quasi-tree

The Dimensional Fact Model
•Fact Schemes
•Facts
•Measures
•Dimensions
•Hierarchies
Dimension attributes
Non-dimension attributes

The Dimensional Fact Model

Why Formalize?

Why Formalize?
•Give meaning to the model
•Tool support
•Transformation Algorithms
•CASE-Tool (Computer Aided Software Engineering)

Fact Scheme
( )SORNAMf ,,,,,=
•M is a set of measures
•A is a set of dimension attributes
•N is a set of non-dimension attributes
•R is a set of ordered couples, having the form (a
i
,
a
j), indicating the ‘edges’ of the scheme
{}
ji
j
i
aa
NAa
aAa
¹
ÈÎ
ÈÎ
0

Fact Scheme
( )SORNAMf ,,,,,=
•O is a set of optional relationships
•S is a set of aggregation statements, in the form
(m
j
, d
i
, Ω)
ROÌ
()
{ },...,,,,, ORANDMAXCOUNTAVGSUM
fDimd
Mm
i
j
ÎW
Î
Î

Fact Scheme
( )SORNAMf ,,,,,=
•We call the set Dim(f) a dimension pattern. Each
element in Dim(f) is a dimension
() ( ){ }RaaAafDim
ii
Î$Î= ,
0

Fact Scheme
( )SORNAMf ,,,,,=
Pictur e 7

Algorithm
From ER to Conceptual Design
2)Define Facts
3)For each fact
a)Build attribute tree
b)Prune & Graft
c)Define Dimensions
d)Define Measures
e)Define Hierarchies

Sample Schema

Define Facts
•Entity F
•Relationship R between entities E
1…E
n
•Transform R into an entity F
•Frequently updated archives are good candidates for
defining facts
•E.g. Sale
•Not: Store, City
•Each Fact becomes a root in a fact scheme

Transform Relation

Build Attribute Tree
•Each vertex corresponds to an attribute of
the scheme
•Root corresponds to the identifier of F

Build Attribute Tree
root=newVertex(identifier(F));
translate(F, root);

Build Attribute Tree
translate(E,v) {
for each attribute a E | a identifier(E)
addChild(v, newVertex({a}));
for each entity G connected to E by a
relationship R | max(E,R) = 1 {
for each attribute b R
addChild(v, newVertex({b}));
next=newVertex(identifier(G));
addChild(v, next);
translate(G, next);
}
}
Î ¹
Î

Example
translate(E=SALE, v=sale)
addChild(v, qty);
addChild(v, unitPrice);
for G=PURCHASE TICKET
addChild(v, ticketNumber);
translate(PURCHASE TICKET, ticketNumber)
for G=PRODUCT
addChild(v, product);
translate(PRODUCT, product);

Attribute Tree

Attribute Tree
•Label the root with the name of the entity F
instead of his identifier
•Optional relationships not in algorithm
if min(E,R)=0

From ER till Conceptual Design
a)Build attribute tree
b)Prune & Graft
c)Define Dimensions
d)Define Measures
e)Define Hierarchies

Prune & Graft
•Prune or graft to eliminate unnecessary level
of detail
•Pruning: Drop a subtree from the quasi-tree
•Grafting: Vertex contains uninteresting
information but its descendants must be
preserved

Graft
graft(v) {
for each v’ | v’ is father of v
for each v’’ | v’’ is child of v
addChild(v’, v’’);
drop(v);
}

Graft
•1-to-1 relation is a good candidate
•When an optional vertex is grafted, all his
children inherit the optional dash

Prune & Graft

Prune & Graft

Dimensions
•Determines the granularity of fact instances
•Time is a key dimension
•Snapshot
•Temporal

Measures
•Numerical attributes of the attribute tree
•Glossary
•How measure can be calculated from source
scheme
•e.g. qty sold, no. of customers

Hierarchies
•Tree has already a kind of hierarchy
•We can still prune/graft details
•Add new levels for aggregation
•E.g. month-quarter-year
•Identify non-dimension attributes
•E.g. address

Aggregation
•Primary fact instances
•Null assumption
•Zero assumption
•Roll-up
•Sum, Avg, Count, Min, Max, …

Aggregation
•Graphical Notation
•Sum

Multi-Aggregation

Multi-Aggregation
•Order matters
•{week, product}  {month, type}
•Time-Dimension: Min
•Product-Dimension: Sum

Multi-Aggregation

Multi-Aggregation

{ } { } { }typemonthtypeweekproductweek
MINSUM
,,, ¾¾®¾¾¾®¾

{ } { } { }typemonthtypeweekproductweek
MINSUM
,,, ¾¾®¾¾¾®¾

{ } { } { }typemonthtypeweekproductweek
MINSUM
,,, ¾¾®¾¾¾®¾

{ } { } { }typemonthproductmonthproductweek
SUMMIN
,,, ¾¾®¾¾¾®¾

{ } { } { }typemonthproductmonthproductweek
SUMMIN
,,, ¾¾®¾¾¾®¾

{ } { } { }typemonthproductmonthproductweek
SUMMIN
,,, ¾¾®¾¾¾®¾

Indexing

Cost Model
•Cost of answering a query is number of rows
processed
•Subcubes
•Powerset of the dimensions

Cost Model

Indexes
•B-tree indexes to speed up query processing
•E.g. for cube ps, we can construct the
following indexes
•I
ps
•I
sp

Example
•Consider Q
1
:
•Using subcube ps: 0,8M rows
•Using subcube psc: 6M rows
•What if we use index I
sp on subcube ps?
•80 rows
sp
sg
s
ps

Indexes
•Ideal situation
•All subcubes
•All indexes

Algorithms
•Balance space subcubes – indexes
•Greedy Algorithm
•Given a set of queries
•Every step select index/subcube with the
highest benefit

?

References
•Text books
•Ralph Kimball, The Data Warehouse Toolkit, John Wiley and Sons, 1996
•W.H. Inmon, Building the Data Warehouse, Second Edition, John Wiley and
Sons, 1996
•Barry Devlin, Data Warehouse from Architecture to Implementation, Addison
Wesley Longman, Inc 1997
•Research Papers/Whitepapers
•M. Golfarelli, D. Maio, S. Rizzi, The Dimensional Fact Model: a Conceptual
Model for Data Warehouses, International Journal of Cooperative
Information, Vol.7 (issue 2/3), pages 215-247, 1998.
•H. Gupta, V. Harinarayan, A. Rajaraman, J.D. Ullman, Index Selection for
OLAP, Proceedings of the Thirteenth international Conference on Data
Engineering, April 07 - 11, pages 208-219, 1997.
•S. Luján-Mora J. Trujillo. A comprehensive method for data warehouse
design. Proc. DMDW, 2003.

References (cont’d)
•Luján-Mora, S., Trujillo, J., and Song, I. Extending the UML for
Multidimensional Modeling. Lecture Notes In Computer Science, Vol. 2460,
pages 290-304., 2002.
•Husemann, B., Lechtenborger, J., Vossen, G.: Conceptual Data Warehouse
Design.
•In: Proc. of the 2nd. Intl. Workshop on Design and Management of Data
Warehouses (DMDW'2000), Stockholm, pages 3-9, 2000.
•Lehner, W., Albrecht, J., and Wedekind, H. 1998. Normal Forms for
Multidimensional Databases. In Proceedings of the 10th international
Conference on Scientific and Statistical Database Management (July 01 –
03), pages 63-72, 1998.
•Web Articles
•http://en.wikipedia.org/wiki/Data_warehouse
•http://en.wikipedia.org/wiki/Online_analytical_processing
•http://en.wikipedia.org/wiki/OLTP

References (cont’d)
•http://www.sidadelman.com/data_warehouse_applications.htm
•http://infolab.stanford.edu/infoseminar/Archive/FallY97/slides/ncr
•www.cdd.go.th/it/file/DataWarehousing_and_DataMining.pdf
•http://www.ciobriefings.com/whitepapers/StarSchema.asp