Complete Overview Data Warehouse, Data Lake, Data Marts

chamodmalshankandage 0 views 47 slides Sep 17, 2025
Slide 1
Slide 1 of 47
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47

About This Presentation

Complete Overview of Data Warehouse


Slide Content

Chamod Kandage
Exploring Data
Warehouse

Contents
1.Data Lake vs Data Warehouse vs Data Mart
2.Three-tier Architecture of a Data
Warehouse
3.Data Warehouse Modeling / Data Cube
4.Data Warehouse Design and Usage
5.Data Warehouse Implementation
6.References

Data Lake vs Data
Warehouse vs Data Mart

Introduction to Data Analytics
●Data analytics, also known as business intelligence, uses strategies and technologies to gain insights from business data.
●Data mining plays a core role in data analytics and business intelligence.
●Data warehouses, OLAP tools, and data lakes are essential for enterprises.
●Data warehouses provide Online Analytical Processing (OLAP) tools for interactive analysis of data.
●Data lakes collect extensive data and integrate metadata so that data exploration can be effective.

Defining a Data Warehouse
A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatilecollection of data, designed to
support management's decision-making process.
●Subject-Oriented:Organized around major subjects,
like customer, supplier, product, and sales, rather than
day-to-day operations.
●Integrated:Constructed by integrating multiple
heterogeneous sources, using data cleaning and
integration techniques to ensure consistency.
●Time-Variant:Includes historical data, allowing for
analysis over time.
●Nonvolatile:Data is not changed or deleted, only
added to.
Data warehouses provide a solid platform for consolidating
historic data for analysis. They store structured and
cleansed data in a highly organized, schema-based format,
serving as the foundation of business intelligence by
offering a unified view of data for informed decision-
making.

Data Marts
A data mart is a subset of a data warehouse, designed for a
specific business unit or department.
●Data marts contain a focused set of data relevant to a
particular group's needs.
●Data marts make it easier to access and analyze data
for specific functions, such as sales, marketing, or
finance.
●Data marts usually contain well-structured and pre-
aggregated data, optimized for specific analytical
purposes.
●Data marts can be independent or dependent, with
dependent data marts sourced from an enterprise data
warehouse.

Data Lakes
●A data lake is a single repository of all enterprise data in
its natural format.
○It includes relational, semi-structured (e.g., XML,
CSV, JSON), unstructured data (e.g., emails, PDF
files), and binary data (e.g., images, audio, video).
●Data lakes store both raw data and transformed data.
●Data lakes can support all users in an organization,
including operational users, analysts, and executives.
●Data lakes are often hosted on a cloud-based or
distributed data repository.
●Data lakes are more bottom-up, quick prototyping, and
democratic.
●Many analytical tasks, such as reporting, visualization,
analytics, and data mining, can be conducted on data
lakes.
●Data lakes can support self-service business
intelligence.

Key Differences: Data Warehouse vs. Data Lake
Feature Data Warehouse Data Lake
Data Type Structured, Cleansed All types, including raw data
Schema Predefined Schema-on-Read
Users Data analysts and executives All users in an organization
Structure Top-down, centralized Bottom-up, democratic
Purpose Decision support, structured queriesExploration, advanced analytics
Implementation Time-consuming and resource intensiveQuicker to implement

Real-World Examples: Data
Warehouse
Retail:A retail company uses a data warehouse to combine
sales data, inventory data, and customer data to understand
purchasing patterns and optimize inventory levels.
Finance:A bank uses a data warehouse to analyze customer
transactions, identify fraud, and assess risk.
Healthcare:A hospital uses a data warehouse to analyze
patient data, improve treatment plans, and manage resources.
●For example, data warehouses help in the
analysis of the development and course of
diseases as well as the impact of different
medications.

Real-World Examples: Data
Lake
E-commerce: An e-commerce company uses a data lake to
store and analyze customer clickstream data, social media
data, and product reviews to personalize recommendations
and improve marketing campaigns.
●For example, a data scientist may use data sets
like purchase transactions, communication
documents with related customers, product
categories, product reviews, and product
images, to design a marketing campaign.
Manufacturing: A manufacturing company uses a data lake to
store sensor data from machines, production data, and quality
data to identify bottlenecks and improve efficiency.
Research: A research organization uses a data lake to store
and analyze diverse datasets from multiple sources to discover
trends and patterns

Real-World Examples: Data
Mart
Sales:A sales department uses a data mart to analyze sales
performance, identify top-performing products, and track
customer trends within their specific region.
Marketing: A marketing department uses a data mart to
analyze campaign effectiveness, track customer engagement,
and optimize their marketing strategies.
Finance: A finance department uses a data mart to analyze
financial performance, monitor expenses, and track revenue
within their specific business unit

Three-tier Architecture of
a Data Warehouse

Three-Tier Architecture Overview
A data warehouse commonly uses a three-tier architecture. This architecture provides a framework to manage
data, from its origin to its use in decision-making. The three-tier approach separates data storage, processing, and
analysis, offering flexibility and scalability. This structure supports the mission of data warehousing, which is to make
data accessible for analysis and decision support.

Bottom Tier -Warehouse
Database Server
The Data Storage Layer:
●This level consists of a mainstream database system,
such as a relational database or a key-value store.
●Back-end tools and ETL (extraction, transformation,
loading) utilities feed data into this tier.
○These tools are responsible for extracting,
cleaning, and transforming data from operational
databases and external sources.
○They also handle loading and refreshing data in
the warehouse.
●This tier also includes a metadata repository.
○The metadata repository stores information
about the data warehouse and its content.
○Metadata includes data definitions and names,
timestamps, the source of extracted data, and
any missing fields added through data cleaning
or integration.

Middle Tier -OLAP Server
ROLAP Models
Utilizes anextended relational DBMSto map
operations on multidimensional data to standard
relational operations.
MOLAP Models
Uses a special-purpose serverthat directly
implements multidimensional data and operations.
This tier is typically implemented using either a Relational OLAP (ROLAP) model or a Multidimensional OLAP
(MOLAP) model.

Top Tier -Front-End
Client Layer
Access and Utilization
●This layer includes tools for querying, reporting,
visualization, analysis, and data mining, such as
trend analysis and prediction.
●It allows users to interact with the data for
informed decision-making.

Data Flow
The Movement of Data
●Data is extracted from operational systems and
external sources.
●The extracted data is transformed, cleansed,
integrated, and summarized before being placed in
the warehouse.
●Data that is not needed in the warehouseis removed
during the conversion process

ETL Process
The Engine of the Data Warehouse
●ETL is the process of extracting data from source
systems, transforming it, and loading it into a data
warehouse.
●ETL tools perform data extraction, cleaning,
transformation, loading, and refreshing data in the
warehouse.
●ETL processesare essential for maintaining the
correctness, completeness, and freshness of the data
warehouse.

Data Warehouse Modeling /
Data Cube

Multidimensional Data Model
●Multidimensional data models are used by data warehouses and OLAP tools to view data in the form of a
data cube.
●A data cube is defined by dimensions and facts.
●Dimensions are descriptive attributes that classify and identify events, acting as "entry points" for data
analysis. Dimensions can include hierarchies.
○Dimension hierarchiesorganize data into logical structures with different levels of granularity. For
example, a time dimension could include year, month, and day levels.
○Concept hierarchiesmap low-level concepts to higher-level ones, such as mapping cities to
provinces or states, and then to countries.
●Fact Tablesstore numerical measurements, known as "facts," along with foreign keys referencing the related
dimension tables.

Schemas for
Multidimensional Data
Models
Star Schema
●A star schema has a central fact table with multiple
dimension tables.
●The fact table stores measurements with foreign keys
linking to dimensions.
●Dimension tables store descriptive attributes and are
directly connected to the fact table.
●This design is simple and efficient for querying and
reporting

Schemas for
Multidimensional Data
Models
Snowflake Schema
●A snowflake schema is a normalized variant of the star
schema, splitting some dimension tables into sub-
dimension tables.
●This reduces redundancy in dimension tables but
increases complexity due to more tables and joins.
●While saving some storage, the space savings is
negligible compared to the fact table.
●More joins are needed, which may reduce query
performance

Schemas for
Multidimensional Data
Models
Fact Constellation Schema (Galaxy Schema)
●A fact constellation uses multiple fact tables sharing
dimension tables.
●Each fact table is connected to its own set of
dimension tables.
●Shared dimensions reduce redundancy and ensure
consistency across fact tables.
●This schema is suitable for complex data warehousing
needs with multiple business processes.
●Each fact table has measures specific to it

Data Cube
●A data cube allows data to be modeled and viewed
in multiple dimensions, enabling efficient
computation of aggregations across various sets of
dimensions.
●A data cube consists of a lattice of cuboids, each
representing a different level of summarization of
the multidimensional data.
●Cuboids represent different groupings of
dimensions used for data analysis.
●Full Materialization involves computing all possible
cuboids, while partial materialization computes only
a subset.
○Iceberg Cubes only store cube cells with an
aggregate value above a specified minimum
threshold.
○Shell Fragments compute cuboids involving a
small number of dimensions.
●Data Cube Computation
○Efficient computation of data cubes is crucial
for data warehouse systems.
○Techniques like quotient cubes can reduce
redundancy.

OLAP Operations
OLAP (Online Analytical Processing) tools provide interactive
analysis of multidimensional data.
●Roll-up generalizes data to higher levels of aggregation,
for example, from city to country.
●Drill-down allows users to explore data at finer levels of
granularity, for example, from country to city.
●Other OLAP operations include slice and dice, which
limit or focus the analysis to specific attributes and/or
objects

Data Cube Implementation
Hybrid OLAP (HOLAP)
Relational OLAP (ROLAP)
Uses relational databases to store
and manage data
Combines aspects of both ROLAP
and MOLAP
Multidimensional OLAP
(MOLAP)
Uses special-purpose servers to
directly implement
multidimensional data and
operations
Multidimensional views can be implemented using:

Integration with Data
Mining
●Data mining techniques can be integrated with
OLAP operations to enhance knowledge discovery
at multiple levels of abstraction.
●OLAP operations can be adapted for exploratory
data mining using operations on concept lattices.
●Statistical modeling in data warehouses can be
used for pattern discovery (e.g., outliers).

Data Warehouse Design
and Usage

Introduction to Design
Data warehouses are essential for business intelligence and decision support.
●They consolidate data from various sources to provide a unified view for analysis.
●Data warehouses support both structured and ad hoc queries, analytical reporting, and decision-making processes.
●The core mission of a data warehouse is to publish an organization’s data assets to support decision-making effectively.
●Data warehouses are not simply copies of operational systems; they transform data for analytical purposes.

Data Warehouse
Components
●The basic components of a data warehousing system
include:
○Data migration processes.
○The warehouse itself.
○Access tools.
●A typical data warehouse has a "back room" where data
is prepared, and a "front room" for query and analysis.
●The back room is responsible for extracting, cleaning,
conforming, and delivering data.
●The front room is dedicated to data access for end-
users.
●ETL (Extraction, Transformation, and Loading)
processes are crucial for populating a data warehouse.

Data Warehouse Design: Key
Considerations
Data warehouse design involves several key aspects:
●Requirements analysis: Understanding user needs and
business objectives is fundamental.
●Data modeling: Creating a structure that suits
analytical needs.
●Data quality: Ensuring data is accurate, complete, and
consistent.
●ETL process design: Developing efficient workflows for
extracting, transforming, and loading data.
●Metadata management : Metadata describes the data
and the processes that act upon the data.

ETL Processes in Detail
ETL processes are responsible for extracting data from source
systems, transforming it, and loading it into the data warehouse.
●Data sources are often heterogeneous, including relational
databases, flat files, and online transaction records.
●Extraction: Data is extracted from source systems using
specialized tools.
●Transformation: Data is cleaned, formatted, integrated, and
summarized before being loaded into the warehouse.
●Loading: Transformed data is loaded into the data
warehouse.
●ETL processes are not merely data movement; they prepare
data for analytical use.
●The ETL process requires a "staging area" where data can be
transformed.
○ETL design should consider:
○Handling large data volumes.
○Data quality and consistency.
○Performance and efficiency.

Data Usage and Analysis
●Data warehouses facilitate OLAP (Online Analytical Processing), enabling interactive analysis of data.
●OLAP operations enable users to analyze data at different granularities, such as rollup and drill-down.
●Data warehouses support a variety of analytical techniques:
○Querying and reporting.
○Data mining.
○Integration with other data analysis tools.
●Data warehouses are used for decision support, providing insights into business performance.
●They help organizations understand past trends and make informed decisions for the future.

Data Refreshment
●Data Warehouse Refreshment is about keeping
the content of the DW up-to-date.
○This process can be challenging due to
data volume, data quality issues, and
performance needs.
●The ETL workflow is designed to address the data
refreshment needs.
○The ETL workflow has a lifecycle involving
reverse engineering, logical and physical
design, and software construction.
○Efficient workflow design and management
are essential for keeping the data
warehouse current and accurate.

Data Warehouse
Implementation

Data Warehousing: The Foundation
Data warehouses serve as the foundation of business intelligence.
●They provide a simple and concise view of subject issues by excluding non-essential data for decision support.
●Data is extracted from operational systems and transformed before being placed in the warehouse.
●The transformation process involves reformatting, cleansing, integrating, and summarizing data.
●A data warehouse is constructed by integrating multiple heterogeneous sources like relational databases, flat files, and
online transaction records.

Core Components of a Data
Warehousing System
●The basic components include:
○Data migration
○The warehouse itself
○Access tools
●Data is extracted from operational systems,
transformed, and then loaded into the warehouse.
●Operational data that is not needed for decision
support is removed during the conversion process.
●The ETL (Extraction, Transformation, Loading) process is
critical for preparing data for the warehouse.

ETL Workflow Challenges
●Key factors that can cause issues in ETL workflows
include:
○Large data volumes
○Data quality problems
○Performance
○Changes in data sources
○Changes in the data warehouse itself
●ETL processes are data intensive and can involve
dozens of sources, complex cleaning and
transformation activities, and loading facilities.
●The development of an in-house ETL process can
consume a large amount of resources for a data
warehouse project.

ETL Workflow Modeling
●A modeling approach is based on the life cycle of ETL
processes which includes:
○Reverse engineering and requirements collection
○Logical design
○Tuning and physical design
○Software construction
●The ETL conceptual model focuses on documenting the
specifics of the data sources in relation to the data
warehouse
●It is constructed in the early stages of a data warehouse
project
●The ETL model is not just another process/workflow
model, but is used to formalize data sources

Data Quality and Requirements
●Data quality is a critical aspect of a data warehouse.
●A data warehouse design method should support complete and consistent elicitation of functional and nonfunctional
requirements.
●Functional requirements consider queries issued in applications, while nonfunctional requirements comprise data
structures and data quality.
●The process of requirements elicitation should be based on the quality of the data.
●Requirements must be matched with the available data

Data Source Selection
●Data source selection is a cornerstone of the data
warehouse implementation methodology.
●Interviews with data producer users are carried out to
determine the quality of data in the data sources.
●The goal is to match user requirements with the
available data.
●A comprehensive methodology includes steps for
scenario definition, information gathering, requirements
integration, data source selection, and document
generation.

Data Warehouse Refreshment
●Data warehouse refreshment involves methods for designing efficient workflows of tasks within the ETL
processes.
●It addresses challenges like:
○Huge data volumes
○Ensuring data quality
○High performance
●Adapting workflows after changes in data sources
●Adapting to changes in the data warehouse itself
●The life cycle of data warehouse and its ETL process includes reverse engineering of sources, logical design,
tuning and physical design, and software construction.
●The aim is to optimize the design and implementation of ETL workflows.

Data Staging Area
●The staging area is where data is written to disk at
major checkpoints of the ETL data flow.
●Staging data structures are needed to meet the various
needs of the ETL process.
●Metadata is essential for supporting the staging area,
covering planning, design standards, and data
descriptions.
●The staging area supports the four major checkpoints
of the ETL data flow: extracted, cleaned, conformed,
and delivered data.
●Metadata for the ETL process includes logical data
maps, calculations for derived data, and source
database descriptions.

Metadata Management
●Metadata is crucial for a data warehouse.
●It includes data lineage, business definitions, technical
definitions, and process descriptions.
●Metadata helps in understanding and managing data
transformations.
●Business definitions are needed by the ETL team to give
context to the data they are loading.
●A simple business definition matrix includes source
system, data warehouse element, and business
definition.

Conclusion
●Implementing a data warehouse is a complex but
essential task for business intelligence.
●A successful implementation requires careful planning,
a robust ETL process, a focus on data quality, and
effective metadata management.
●The ETL process is a critical component of a data
warehouse.
●The main goal of a data warehouse is to provide
decision support by publishing the organization's data.

References
Data Mining: Concepts and Techniques: Han, J.,
Pei, J. and Tong, H. (2023). Data Mining: Concepts
and Techniques. 4th ed. Cambridge, MA: Morgan
Kaufmann.
The Data Warehouse ETL Toolkit: Kimball, R. and
Caserta, J. (2004). The Data Warehouse ETL Toolkit:
Practical Techniques for Extracting, Cleaning,
Conforming, and Delivering Data. Indianapolis: Wiley.
Data Warehouses and OLAP: Wrembel, R. and
Koncilia, C. (2007). Data Warehouses and OLAP:
Concepts, Architectures, and Solutions. Hershey,
PA: IRM Press.
Fundamentals of Database Systems: Elmasri, R.
and Navathe, S.B. (2016). Fundamentals of
Database Systems. 7th ed. Boston: Pearson.

Thank You