CSC612 THIRD LECTURE ON DATA WAREHOUSE.pptx

MrNdlela 13 views 43 slides Aug 18, 2024
Slide 1
Slide 1 of 43
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43

About This Presentation

thhf


Slide Content

DATA WAREHOUSING SCHEMAS DR FASHOTO THIRD LECTURE

DEFINITION OF SCHEMA Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema.

STAR SCHEMA WHAT DID YOU NOTICE IN THE LAST DIAGRAM WE DREW IN OUR LAST LECTURE?

Each dimension in a star schema is represented with only one-dimension table. This dimension table contains the set of attributes. The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.

STAR SCHEMA DIAGRAM

There is a fact table at the center . It contains the keys to each of four dimensions . The fact table also contains the attributes, namely dollars sold and units sold.

Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set { location_key , street, city, province_or_state,country }. This constraint may cause data redundancy. For example, “ Matsapha " and “ Manzini " both the cities are in the Manzini region. The entries for such cities may cause data redundancy along the attributes province_or_state and country.

Snowflake Schema Some dimension tables in the Snowflake schema are normalized. The normalization splits up the data into additional tables. Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.

Snowflake Schema

Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.

Fact Constellation Schema A fact constellation has multiple fact tables. It is also known as galaxy schema . The sales fact table is same as that in the star schema.

Fact Constellation Schema Diagram

The shipping fact table has five dimensions, namely item_key , time_key , shipper_key , from_location,to_location . The shipping fact table also contains two measures, namely dollars_cost and units_ shipped. It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table.

Schema Definition Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts.

Syntax for Cube Definition define cube < cube_name > [ < dimension-list > }: < measure_list >

Syntax for Dimension Definition define dimension < dimension_name > as ( < attribute_or_dimension_list > )

Star Schema Definition define cube sales star [time, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count(*) define dimension time as (time key, day, day of week, month, quarter, year) define dimension item as (item key, item name, brand, type, supplier type) define dimension branch as (branch key, branch name, branch type) define dimension location as (location key, street, city, province or state, country)

Snowflake Schema Definition define cube sales snowflake [time, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count(*) define dimension time as (time key, day, day of week, month, quarter, year) define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type)) define dimension branch as (branch key, branch name, branch type) define dimension location as (location key, street, city (city key, city, province or state, country))

Fact Constellation Schema Definition define cube sales [time, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count(*) define dimension time as (time key, day, day of week, month, quarter, year) define dimension item as (item key, item name, brand, type, supplier type) define dimension branch as (branch key, branch name, branch type) define dimension location as (location key, street, city, province or state,country ) define cube shipping [time, item, shipper, from location, to location]: dollars cost = sum(cost in dollars), units shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper type) define dimension from location as location in cube sales define dimension to location as location in cube sales

DW - Partitioning Strategy Data Warehouse Partitioning is a technique used in data warehousing to improve query performance and optimize resource utilization. By dividing large tables into smaller, more manageable units called partitions, businesses can significantly reduce processing time and achieve better query response. Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also helps in balancing the various requirements of the system. It optimizes the hardware performance and simplifies the management of data warehouse by partitioning each fact table into multiple separate partitions.

Why is it Necessary to Partition? Partitioning is important for the following reasons − For easy management, To assist backup/recovery, To enhance performance.

Types of DW Partitioning There are two types of partitioning in DW Horizontal partitioning Vertical partitioning splits the data vertically.

Horizontal partitioning Partitioning by Time into Equal Segments Partition by Time into Different-sized Segments Partition on a Different Dimension Partition by Size of Table Partitioning Dimensions Round Robin Partitions

Vertical partitioning can be performed in two ways − Normalization is the standard relational method of database organization. In this method, the rows are collapsed into a single row, hence it reduce space. Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed up the access to large table by reducing its size.

Metadata Metadata is data that describes and contextualizes other data. It provides information about the content, format, structure, and other characteristics of data, and can be used to improve the organization, discoverability, and accessibility of data. 

FORMS AND STANDARDS OF METADATA Metadata can be stored in various forms, such as text, XML, or RDF, and can be organized using metadata standards and schemas. There are many metadata standards that have been developed to facilitate the creation and management of metadata, such as Dublin Core, schema.org, and the Metadata Encoding and Transmission Standard (METS). 

METADATA SCHEMAS Metadata schemas define the structure and format of metadata and provide a consistent framework for organizing and describing data. Metadata can be used in a variety of contexts, such as libraries, museums, archives, and online platforms. It can be used to improve the discoverability and ranking of content in search engines and to provide context and additional information about search results.

EXAMPLES OF METADATA File metadata:  This includes information about a file, such as its name, size, type, and creation date. Image metadata:  This includes information about an image, such as its resolution, color depth, and camera settings. Music metadata:  This includes information about a piece of music, such as its title, artist, album, and genre. Video metadata:  This includes information about a video, such as its length, resolution, and frame rate. Document metadata:  This includes information about a document, such as its author, title, and creation date. Database metadata:  This includes information about a database, such as its structure, tables, and fields. Web metadata:  This includes information about a web page, such as its title, keywords, and description.

TYPES OF METADATA Descriptive metadata:   provides information about the content, structure, and format of data. Administrative metadata: provides information about the management and technical characteristics of data Structural metadata:   provides information about the relationships and organization of data,

Provenance metadata:  provides information about the history and origin of data Rights metadata:  provides information about the ownership, licensing, and access controls of data Educational metadata:  provides information about the educational value and learning objectives of data

Categories of Metadata Business Metadata  − It has the data ownership information, business definition, and changing policies. Technical Metadata  − It includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural information such as primary and foreign key attributes and indices. Operational Metadata  − It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on it.

Role of Metadata in DW Metadata acts as a directory. This directory helps the decision support system to locate the contents of the data warehouse. Metadata helps in decision support system for mapping of data when data is transformed from operational environment to data warehouse environment. Metadata helps in summarization between current detailed data and highly summarized data. Metadata also helps in summarization between lightly detailed data and highly summarized data.

Role of metadata Cont’d Metadata is used for query tools. Metadata is used in extraction and cleansing tools. Metadata is used in reporting tools. Metadata is used in transformation tools. Metadata plays an important role in loading functions.

Metadata Repository Metadata repository is an integral part of a data warehouse system . A metadata repository is a database that is used to store metadata about data . A metadata repository can be used to manage, organize, and maintain metadata in a consistent and structured manner, and can facilitate the discovery, access, and use of data.

BENEFITS OF METADATA REPOSITORY Improved data quality:  It can help to ensure that metadata is consistently structured and accurate, which can improve the overall quality of the data. Increased data accessibility:   It can make it easier for users to access data and understand the data, by providing context and information about the data.

Enhanced data integration:  It can facilitate data integration by providing a common place to store and manage metadata from multiple sources. Improved data governance:  It can help to enforce metadata standards and policies, making it easier to ensure that data is being used and managed appropriately.

Enhanced data security:  It can help protect the privacy and security of metadata, by providing controls to restrict access to sensitive or confidential information.

Challenges of Metadata Management Lack of standardization:  Different organizations may use different standards for metadata, which can make it difficult to effectively manage metadata across different sources. Data quality:  Poorly structured or incorrect metadata can lead to problems with data quality, making it more difficult to use and understand the data.

Data integration:  When integrating data from multiple sources, it can be challenging to ensure that the metadata is consistent and aligned across the different sources . Data governance:  Establishing and enforcing metadata standards and policies can be difficult, especially in large organizations with multiple stakeholders.

Data security:  Ensuring the security and privacy of metadata can be a challenge, especially when working with sensitive or confidential information.

REASONS FOR DEVELOPMENT OF DW The end-user’s demand for new system or architecture. Online processing techniques High storage capacity Need for integrated data The need to include unstructured data for analytics purpose in the data mixture.

THANK YOU FOR LISTENING
Tags