Chapter # 2 Data Warehouse Architecture & Models Text Book: Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas 2
Learning Objectives The main objective of this chapter is to let the students know the basic structure of a Data warehouse. Learn about Data warehouse physical architecture and various principles of a Data warehousing. To make students understand a data warehouse model. It also explains various types of multidimensional models and Schemas. 3
Major Topics Structure of a Data Warehouse Data Warehouse Architectures Three Data Warehouse Models Extraction, Transformation and Loading (ETL) Design of Data Warehouse Multidimensional Data Model Data cube Conceptual Modeling of Data Warehouses Dimension Schema 4
Structure of a Data Warehouse Essentially, a data warehouse provides historical data for decision-support applications. Such applications include reporting, online analytical processing (OLAP), executive information systems (EIS), and data mining. Data warehouse is a centralized, integrated repository of information. 5
Here, integrated means cleaned up, merged, and redesigned. This may be more or less complicated Depending on how many systems feed into a warehouse and how widely they differ in handling similar information. But most companies already have repositories of information in their production systems and many of them are centralized. 6
Data Warehouse Architectures : Conceptual View Single-layer –Every data element is stored once only –Virtual warehouse 7
Two-layer Real-time + derived data –Most commonly used approach in industry today 8
Three-layer Architecture: Conceptual View Transformation of real-time data to derived data really requires two steps 9
The architecture consists of various interconnected elements: Operational and external database layer – the source data for the DW Information access layer – the tools the end user access to extract and analyze the data Data access layer – the interface between the operational and information access layers Metadata layer – the data directory or repository of metadata information 10
Additional layers are: Process management layer – the scheduler or job controller Application messaging layer – the “middleware” that transports information around the firm Physical data warehouse layer – where the actual data used in the DSS are located Data staging layer – all of the processes necessary to select, edit, summarize and load warehouse data from the operational and external data bases 11
Components of the Data Warehouse Architecture 12
Data Warehouse Architecture 13
Three Data Warehouse Models Enterprise Warehouse - Collects all of the information about subjects spanning the entire organization. The Central data warehouse - A single physical database contains all of the data for a specific functional area. The distributed data warehouse - The components are distributed across several physical databases. 14
Extraction, Transformation and Loading (ETL) Data extraction - get data from multiple, heterogeneous, and external sources Data cleaning –detect errors in the data and rectify them when possible Data transformation –convert data from legacy or host format to warehouse format Loading –sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Refresh –propagate the updates from the data sources to the warehouse 15
Design of Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view It allows selection of the relevant information necessary for the data warehouse Data source view It exposes the information being captured, stored, and managed by operational systems Data warehouse view It consists of fact tables and dimension tables Business query view It shows the perspectives of data in the warehouse from the view of end-user. 16
Multidimensional Data Model From Tables and Spreadsheets to Data Cubes A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables [ Eg . item ( item_name , brand, type), or time(day, week, month, quarter, year) ] Fact table contains measures (such as dollars_sold ) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called a base cuboid . The top most 0-D cuboid , which holds the highest-level of summarization, is called the apex cuboid . The lattice of cuboids forms a data cube. 17
18
Data Cube n-dimensional cube C[A1,A2,…An] is a database with n dimensions as A1,A2…An each of which represents a theme and contains |Ai| number of distinct elements in the dimension Ai. Each distinct element of Ai corresponds to a data row of C. A data cell in the cube C[a1,a2,..an] stores the numeric measures of the data for Ai= ai , for all i . Eg . Data cube C[place, product, year] Data Cell [Peshawar, TV, 4Qtr] stores 314 as numeric measure which represents three dimensions with 3,1 and 4rows respectively. 19
A Sample Data Cube 20
21
Typical Data cube OLAP Operations Roll up (drill-up): Summarize data by climbing up hierarchy or by dimension reduction Drill down(roll down): Reverse of roll-up From higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and Dice: Project and select Pivot (rotate): reorient the cube visualization, 3D to series of 2D planes Drill across: involving (across) more than one fact table Drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 22
23
Conceptual Modeling of Data Warehouses Dimension Schema Modeling data warehouses: dimensions & measures Starschema : A fact table in the middle connected to a set of dimension tables Snowflakeschema : A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Factconstellations : Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 24
Star Schema Easy for users to understand Fast response to queries Simple metadata Supported by many front end tools Less robust to change Slower to build Does not support history 25
Example of STAR Schema 26
Snowflake Schema Direct use by some tools More flexible to change Provides for speedier data loading May become large and unmanageable Degrades query performance More complex metadata 27
28
Fact Constellation Measure of online analytical processing Collection of multiple fact tables Sharing dimension tables Viewed as a collection of stars. Improvement over Star schema . 29