Star schema

chandanapriya 11,797 views 6 slides Sep 14, 2010
Slide 1
Slide 1 of 6
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6

About This Presentation

Very basic points of datawarehousing, OLTP, OLAP and star schema.


Slide Content

© YIIT- confidential and proprietary


What is a Data Warehouse? :
Data Warehouses and Data Marts are conceptually different – in scope- they are built using the
exact same methods and procedures,
A data warehouse (or mart) is way of storing data for later retrieval. This retrieval is
almost always used to support decision-making in the organization. That is why many
data warehouses are considered to be DSS (Decision-Support Systems).
Both a data warehouse and a data mart are storage mechanisms
for read-only, historical, aggregated data
Both a data warehouse and a data mart are storage mechanisms
for read-only, historical, aggregated data.
A data warehouse stores current and historical data
OLTP:
OLTP stand for Online Transaction Processing.
This is a standard, normalized database structure.
OLTP is designed for transactions, which means that inserts, updates, and deletes must be fast.

OLAP:

OLAP stands for Online Analytical Processing.
OLAP is a term that means many things to many people.


Difference between OLTP and OLAP:

OLTP OLAP
Current data only Current+historical data
Short database transactions Long database transactions
Online update/insert/delete Batch update/insert/delete
Normalization is promoted Denormalization is promoted
High volume transactions Low volume transactions
Transaction recovery is necessary Transaction recovery is not necessary
Few indexes Many indexes
Many joins Few joins

© YIIT- confidential and proprietary


With normalization, we may also have fewer indexes per table. This means that inserts,
updates, and deletes run faster, because each insert, update, and delete may affect one or
more indexes.
Therefore, with each transaction, these indexes must be updated along with
the table. This overhead can significantly decrease our performance.
There are some disadvantages to an OLTP structure, especially when we go to retrieve the
data for analysis.
For one, we now must utilize joins and query multiple tables to get all
the data we want. Joins tend to be slower than reading from a single table, so we want to
minimize the number of tables in any single query.
One of the advantages of OLTP is also a disadvantage: fewer indexes per table.
In general terms,
the fewer indexes we have, the faster inserts, updates, and deletes will be.
However, again
in general terms, the fewer indexes we have, the slower select queries will run.
Since one of our design goals to speed transactions is to minimize the number
of indexes, we are limiting ourselves when it comes to doing data retrieval.
Creating two separate database structures: an OLTP system for transactions,
and an OLAP system for data retrieval.
Star Schema:
Star Schema is a relational database schema for representing multimensional data. It is the
simplest form of data warehouse schema that contains one or more dimensions and fact tables.
It is called a star schema because the entity-relationship diagram between dimensions and fact
tables resembles a star where one fact table is connected to multiple
dimensions.
The center of the star schema consists of a large fact table and it points towards the dimension
tables.
Steps in designing Star Schema:
Identify a business process for analysis(like sales).
Identify measures or facts (sales dollar).
Identify dimensions for facts(product dimension, location dimension, time dimension, organization
dimension).
List the columns that describe each dimension.(region name, branch name, region name).
Determine the lowest level of summary in a fact table(sales dollar).

© YIIT- confidential and proprietary


Important aspects of Star Schema & Snow Flake Schema:
In a star schema every dimension will have a primary key.
In a star schema, a dimension table will not have any parent table.
Whereas in a snow flake schema, a dimension table will have one or more parent tables.
Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
Whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps
to drill down the data from topmost hierachies to the lowermost hierarchies.
Designing a Star Schema:
First, there is a time element to each one. Second, they all are looking for aggregated data; they are
asking for sums or counts, not individual transactions. Finally, they are looking at data in terms of
“by” conditions.
When I talk about “by” conditions, I am referring to looking at data by certain conditions
For example, if we take the question “On a quarterly and then monthly basis, are Dairy
Product sales cyclical” we can break this down into this: “We want to see total sales by
category (just Dairy Products in this case),by quarter or by month.”
Here we are looking at an aggregated value, the sum of sales, by specific criteria.
When we talk about the way we want to look at data, we usually want to see some sort of
aggregated data. These data are called measures.
These measures are numeric values that are measurable and additive.
We need to look at our measures using those “by” conditions. These “by” conditions are called
dimensions.
When we say we want to know our sales dollars, we almost always mean by day, or by quarter, or
by year.
These by conditions will map into dimensions:
there is almost always a time dimension, and product and geographic dimensions are very
common as well.
Therefore, in designing a star schema, our first order of business is usually to determine
what we want to see (our measures) and how we want to see it (our dimensions).

Mapping Dimensions into Tables

When we start building dimension tables, there are a few rules to keep in mind. First, all dimension
tables should have a single-field primary key.
This key is often just an identity column, consisting of an automatically incrementing number.
(The value of the primary key is meaningless; our information is stored in the other fields.)
These other fields contain the full descriptions of what we are after.

© YIIT- confidential and proprietary


For example, if we have a Product dimension (which is common) we have fields in it that contain
the description, the category name, the sub-category name, etc.
These fields do not contain codes that link us to other tables. Because the fields are the full
descriptions, the dimension tables are often fat; they contain many large fields.
Dimension tables are often short, however. We may have many products, but even so, the
dimension table cannot compare in size to a normal fact table.
Dimension tables are often short, however. We may have many products, but even so, the
dimension table cannot compare in size to a normal fact table.
Our dimension table might look something like this:




Notice that both Category and Subcategory are stored in the table and not linked in through joined
tables that store the hierarchy information.
The hierarchies are contained in the individual dimension tables. No additional tables are needed
to hold hierarchical information.

Fact Table:

A table in a star schema that contains facts and connected to dimensions.
A fact table typically has two types of columns: those that contain facts and those that are foreign
keys to dimension tables.
The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables
that contain aggregated facts are often instead called summary tables).
A fact table usually contains facts with the same level of aggregation.

Steps in designing Fact Table:

Identify a business process for analysis(like sales).
Identify measures or facts (sales dollar).

© YIIT- confidential and proprietary


Identify dimensions for facts(product dimension, location dimension, time dimension, organization
dimension).
List the columns that describe each dimension.(region name, branch name, region name).
Determine the lowest level of summary in a fact table(sales dollar).

Building the Fact Table:

The Fact Table holds our measures, or facts.
The measures are numeric and additive across some or all of the dimensions.
For example, sales are numeric and we can look at total sales for a product, or category, and we
can look at total sales by any time period.
While the dimension tables are short and fat, the fact tables are generally long and skinny.
They are long because they can hold the number of records represented by the product of the
counts in all the dimension tables.


In this schema, we have product, time and store dimensions. If we assume we have ten years of
daily data, 200 stores, and we sell 500 products, we have a potential of 365,000,000 records (3650
days * 200 stores * 500 products). As you can see, this makes the fact table long.

© YIIT- confidential and proprietary


The fact table is skinny because of the fields it holds. The primary key is made up of foreign keys
that have migrated from the dimension tables.
These fields are just some sort of numeric value. In addition, our measures are also numeric.
Therefore, the size of each record is generally much smaller than those in our dimension tables.
we have many, many more records in our fact table.

Measure Types:

Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with others.