Introduction to ETL Data Warehousing.ppt

email2nalinikant 10 views 24 slides Sep 11, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

ETL process for Data Warehousing


Slide Content

ETL
The process of updating the
data warehouse.

Recent Developments in
Data Warehousing: A Tutorial
Hugh J. Watson
Terry College of Business
University of Georgia
[email protected]
http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt

Two Data Warehousing
Strategies
Enterprise-wide warehouse, top
down, the Inmon methodology
Data mart, bottom up, the Kimball
methodology
When properly executed, both result
in an enterprise-wide data warehouse

The Data Mart Strategy
The most common approach
Begins with a single mart and architected marts
are added over time for more subject areas
Relatively inexpensive and easy to implement
Can be used as a proof of concept for data
warehousing
Can perpetuate the “silos of information” problem
Can postpone difficult decisions and activities
Requires an overall integration plan

The Enterprise-wide Strategy
A comprehensive warehouse is built initially
An initial dependent data mart is built using
a subset of the data in the warehouse
Additional data marts are built using
subsets of the data in the warehouse
Like all complex projects, it is expensive,
time consuming, and prone to failure
When successful, it results in an integrated,
scalable warehouse

Data Sources and Types
Primarily from legacy, operational systems
Almost exclusively numerical data at the
present time
External data may be included, often
purchased from third-party sources
Technology exists for storing unstructured
data and expect this to become more
important over time

Extraction, Transformation, and
Loading (ETL) Processes
The “plumbing” work of data
warehousing
Data are moved from source to
target data bases
A very costly, time consuming part of
data warehousing

Recent Development:
More Frequent Updates
Updates can be done in bulk and
trickle modes
Business requirements, such as
trading partner access to a Web site,
requires current data
For international firms, there is no
good time to load the warehouse

Recent Development:
Clickstream Data
Results from clicks at web sites
A dialog manager handles user interactions.
An ODS (operational data store in the data
staging area) helps to custom tailor the
dialog
The clickstream data is filtered and parsed
and sent to a data warehouse where it is
analyzed
Software is available to analyze the
clickstream data

Data Extraction
Often performed by COBOL routines
(not recommended because of high program
maintenance and no automatically generated
meta data)
Sometimes source data is copied to the target
database using the replication capabilities of
standard RDMS (not recommended because of
“dirty data” in the source systems)
Increasing performed by specialized ETL
software

Sample ETL Tools
Teradata Warehouse Builder from Teradata
DataStage from Ascential Software
SAS System from SAS Institute
Power Mart/Power Center from Informatica
Sagent Solution from Sagent Software
Hummingbird Genio Suite from
Hummingbird Communications

Reasons for “Dirty” Data
Dummy Values
Absence of Data
Multipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys,
Non-Unique Identifiers
Data Integration Problems

Data Cleansing
Source systems contain “dirty data” that must
be cleansed
ETL software contains rudimentary data
cleansing capabilities
Specialized data cleansing software is often
used. Important for performing name and
address correction and householding functions
Leading data cleansing vendors include Vality
(Integrity), Harte-Hanks (Trillium), and Firstlogic
(i.d.Centric)

Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating

Parsing
Parsing locates and identifies
individual data elements in the
source files and then isolates these
data elements in the target files.
Examples include parsing the first,
middle, and last name; street number
and street name; and city and state.

Correcting
Corrects parsed individual data
components using sophisticated data
algorithms and secondary data
sources.
Example include replacing a vanity
address and adding a zip code.

Standardizing
Standardizing applies conversion
routines to transform data into its
preferred (and consistent) format
using both standard and custom
business rules.
Examples include adding a pre name,
replacing a nickname, and using a
preferred street name.

Matching
Searching and matching records
within and across the parsed,
corrected and standardized data
based on predefined business rules
to eliminate duplications.
Examples include identifying similar
names and addresses.

Consolidating
Analyzing and identifying
relationships between matched
records and consolidating/merging
them into ONE representation.

Data Staging
Often used as an interim step between data
extraction and later steps
Accumulates data from asynchronous sources using
native interfaces, flat files, FTP sessions, or other
processes
At a predefined cutoff time, data in the staging file is
transformed and loaded to the warehouse
There is usually no end user access to the staging file
An operational data store may be used for data
staging

Data Transformation
Transforms the data in accordance
with the business rules and standards
that have been established
Example include: format changes,
deduplication, splitting up fields,
replacement of codes, derived values,
and aggregates

Data Loading
Data are physically moved to the data
warehouse
The loading takes place within a “load
window”
The trend is to near real time updates
of the data warehouse as the
warehouse is increasingly used for
operational applications

Meta Data
Data about data
Needed by both information technology
personnel and users
IT personnel need to know data sources and
targets; database, table and column names;
refresh schedules; data usage measures; etc.
Users need to know entity/attribute
definitions; reports/query tools available;
report distribution information; help desk
contact information, etc.

Recent Development:
Meta Data Integration
A growing realization that meta data is
critical to data warehousing success
Progress is being made on getting vendors
to agree on standards and to incorporate
the sharing of meta data among their tools
Vendors like Microsoft, Computer
Associates, and Oracle have entered the
meta data marketplace with significant
product offerings
Tags