Introduction to Data Warehouse

7,391 views 52 slides May 19, 2023
Slide 1
Slide 1 of 52
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52

About This Presentation

Data Warehouse – Introduction, characteristics, architecture, scheme and modelling, Differences between operational database systems and data warehouse.


Slide Content

UBN204A62
Mr.T.Somasundaram
Assistant Professor
Department Of Management
KristuJayantiCollege (Autonomous), Bengaluru

UNIT 1: DATA WAREHOUSE
DataWarehouse–Introduction,
characteristics,architecture,scheme
andmodelling,Differencesbetween
operationaldatabasesystemsand
datawarehouse.

Let’s go for thinking
Data Set (Combo)Item Set
Transaction / Purchasing
Frequent items

This is what need of Data Warehouse

What is Data Warehouse?
DataWarehouseisanenvironment,notaproduct.
Dataisoftenscatteredacrossdifferentdatabase,itneedDWto
getcompleteinformation.
Itisaimedateffectiveintegrationofoperationaldatabasesthat
enablesstrategicuseofdata.

Introduction:
DataWarehousingintegratesdataandinformationcollected
fromvarioussourcesintoonecomprehensivedatabase.
(E.g.)Customerinformationfromorganization’spoint-of-sale
systems,itsmailinglists,websiteandcommentcards,etc.
DataWarehouseisacentralizedstoragesystemorcentral
repositoryforstoring,analyzinginformationandinterpretingof
datainordertofacilitatebetterdecisionmaking.
Adatawarehouseisatypeofdatamanagementsystemthat
facilitatesandsupportsbusinessintelligence(BI)activities,
specificallyanalysis.
Itisprimarilydesignedtofacilitatesearchesandanalyses
usuallycontainlargeamountsofhistoricaldata.
Data Warehousing -Introduction

Investment&InsuranceCompanies–toanalyzecustomer&
markettrendsandallieddatapatterns.
RetailChains–usedformarketinganddistributiontotract
items,examinepricingpoliciesandanalyzebuyingtrendsof
customers.
Healthcare–togeneratetreatmentreports,sharedatawith
insurancecompanies&medicalunits.
Airline–operationpurposelikecrewassignment,route
profitability,frequentflyerprogrampromotions,etc.
Banking–tomanageresourcesavailableondeskeffectively.
PublicSector–usedforintelligencegathering,tomaintain&
analyzetaxrecords,healthpolicyrecords,etc.
Telecommunication–usedforproductpromotions,sales
decisionsandtomakedistributiondecisions.
Data Warehouse Usage -Examples

Data Warehouse system has another name

Definition:
“Adatawarehouseisasingle,completeandconsistentstoreof
dataobtainedfromavarietyofsourcesandmadeavailabletoendusers
inawaytheycanunderstandanduseinabusinesscontext.”
“Adatawarehouseisacollectionofcorporateinformation
deriveddirectlyfromoperationalsystemsandsomeexternaldata
sources.”
“Adatawarehouseisasubject-oriented,integrated,time-
variantandnon-volatilecollectionofdatainsupportofmanagement’s
decisionmakingprocess”. -WilliamH.Inmon
Adatawarehousecanbedefinedasacollectionof
organizationaldataandinformationextractedfromoperational
sourcesandexternaldatasources.
Data Warehouse -Definition

DataWarehouseworksasacentralrepository,whereinformation
arrivesfromoneormoredatasources.
Dataflowsintodatawarehousefromtransactionalsystemand
otherrelationaldatabases.
Datamaybestructured,semi-structuredandunstructureddata.
Itisprocessed,transformedandingestedthroughBItools,SQL
clientsandspreadsheets.
Datawarehousecontainmultipledatabasesanddataisorganized
intotablesandcolumnsineachdatabase.
Datastoredinvarioustablesdescribedbyschema&Querytoolsuse
theschematodeterminewhichdatatablestoaccessandanalyze.
Datastoredincolumndescribethedatasuchasinteger,datafield,
string,etc.
Datawarehousingmakesdataminingpossible.
How Data Warehouse works?

1.Subject-Oriented:
Itprovidestopicwiseinformationratherthantheoverallprocessesof
abusiness.(i.e.)sales,inventory,promotion,etc.
2.Integrated:
DWisdevelopedbyintegratingdatafromvariedsourceintoa
consistentformat.
Itisstoredindatawarehouseinaconsistentmannerintermsof
naming,format&coding,whichfacilitatesdataanalysis.
3.Non-Volatile:
Dataonceenteredintoadatawarehousemustremainunchanged&
alldataisreadonly.
4.TimeVariant:
Datastoredinadatawarehouseisdocumentedwithanelementof
time,eitherexplicitlyorimplicitly.
Itisexhibitedinprimarykeywithelementoftimelikeday,week,etc.
Characteristics of Data Warehouse

1.EnterpriseDataWarehouse(EDW):
Itservesasakeyorcentraldatabasethatfacilitatesdecision
supportservicesthroughouttheenterprise.
Itprovidesaccesstocross-organizationalinformation,
offersaunifiedapproachtodatarepresentation.
2.OperationalDataStore(ODS):
Itispreferredforroutineactivitieslikestoringemployee
records.
3.DataMart:
Itisasubsetofadatawarehousebuilttomaintaina
particulardepartment,regionorbusinessunit.
Everydepartmentofabusinesshasacentralrepositoryor
datamarttostoredata.
Types of Data Warehouse

Data Warehouse Vs Data base
CharacteristicsData Warehouse Transactional Database
Suitable
workloads
Analytics, reporting, big
data
Transaction processing
Data source
Data collected and
normalized from many
sources
Data captured as-is from a
single source, such as a
transactional system
Data capture
Bulk write operations
typically on a
predetermined batch
schedule
Optimized for continuous
write operations as new
data is available to
maximize transaction
throughput

Data Warehouse Vs Data base
CharacteristicsData Warehouse Transactional Database
Data
normalization
Denormalizedschemas,
such as the Star schema
or Snowflake schema
Highly normalized, static
schemas
Data storage
Optimized for simplicity
of access and high-speed
query performance using
columnar storage
Optimized for high
throughout write
operations to a single row-
oriented physical block
Data access
Optimized to minimize
I/O and maximize data
throughput
High volumes of small
read operations

Data Warehouse Vs Data lake
Characteristics Data Warehouse Data Lake
Data
Relational data from
transactional systems,
operational databases, and line
of business applications
Alldata, including
structured, semi-
structured, and
unstructured
Schema
Often designed prior to the data
warehouseimplementation but
also can be written at the time of
analysis (schema-on-write or
schema-on-read)
Written at the time of
analysis (schema-on-
read)
Price/Performance
Fastest query results using local
storage
Query results getting
faster using low-cost
storage and decoupling
of compute and storage

Data Warehouse Vs Data lake
CharacteristicsData Warehouse Data Lake
Data quality
Highly curated data that
serves as the central
version of the truth
Any data that may or may not
be curated (i.e. raw data)
Users
Business analysts, data
scientists, and data
developers
Business analysts (using
curated data), data scientists,
data developers, data
engineers, and data architects
Analytics
Batch reporting, BI, and
visualizations
Machine learning,
exploratory analytics, data
discovery, streaming,
operational analytics, big
data, and profiling

Thefourcomponentsofdatawarehouseare–
1.LoadManager(FrontComponent):
Itperformswithalltheoperationsassociatedwiththeextractionand
loadofdataintowarehouse.
2.WarehouseManager:
Itperformsoperationsassociatedwiththemanagementofdatainthe
warehouse.
Itperformsoperationslikeanalysisofdatatoensureconsistency,
creationofindexes&views,transformation,mergingofsourcedata.
3.QueryManager(BackendComponent):
Itperformsalloperationrelatedtomanagementofuserqueries.
4.End-Useraccesstools:
Itiscategorizedinto5groupslikedatareporting,querytools,
applicationdevelopmenttools,EIStools,OLAPtoolsanddatamining
tools.
Data Warehouse –Components

DataWarehouseapplicationscanbecategorizedas:
Queryandreportingtools.
ApplicationDevelopmenttools.
DataMiningtools.
OLAPtools.
Somepopulardatawarehousetoolsare-
Xplenty.
AmazonRedshift.
Teradata.
Oracle12c.
Informatica.
IBMInfosphere.
Cloudera.
Panoply.
Data Warehouse –Tools

Thereareseveralbenefitsofdatawarehouseforenduserslike–
Improveddataconsistency,dataqualityandaccuracy.
Betterbusinessdecisions.
Easieraccesstoenterprisedataforend-users.
Betterdocumentationofdata.
Historicaldataanalysis.
Reducedcomputercostsandhigherproductivity.
Enablingend-userstoaskad-hocqueriesorreportswithout
deterringtheperformanceofoperationalsystems.
Collectionorconsolidationofrelateddatafromvarioussources
intoaplace.
Benefits of Data Warehouse

Itallowsbusinessuserstoquicklyaccesscriticaldatafromsome
sources.
Itprovidesconsistentinformationonvariouscross-functional
activities.
Ithelpstointegratemanysourcesofdatatoreducestresson
productionsystem.
Ithelpstoreducetotalturnaroundtimeforanalysis&reporting.
Itstoresalargeamountofhistoricaldataandhelpsusersto
analysedifferenttimeperiods.
Restructuring&Integrationmakeiteasierforusertousefor
reporting&analysis.
Advantages of Data Warehouse

DataWarehousecanbeoutdatedrelativelyquickly.
Difficulttomakechangesindatatypesandranges,data
sourceschema,indexesandqueries.
Datawarehouseseemeasy,butactually,itistoocomplexfor
averageusers.
Sometimewarehouseuserswilldevelopdifferentbusiness
rules.
Organizationneedtospendlotsoftheirresourcesfor
trainingandimplementationpurpose.
Thisisnotanidealoptionforunstructureddata.
Disadvantages of Data Warehouse

1.BasicStructure(SingleTier):
Structure of Data Warehouse (Architecture)
Data
Warehouse
Other Data inputs
Update new data
Cleaning
Reformatting
OLAP
DSS
Data
mining
DB

Sample Data Warehouse Architecture

2.GenericTwoLevel(TwoTier):
Two Tier Architecture
Sources
Sources
(file)
Sources
(DB)
Sources
(External)
Data
Warehouse
Translation &
Integration
Decision Support
Environment

TwoTier(Level)Architecture:
Two-layerarchitectureisoneoftheDataWarehouselayers
whichseparatesphysicallyavailablesourcesanddata
warehouse.Thisarchitectureisnotexpandableandalso
notsupportingalargenumberofend-users.Italsohas
connectivityproblemsbecauseofnetworklimitations.
Atwo-tierarchitectureincludesastagingareaforalldata
sources,beforethedatawarehouselayer.Byaddinga
stagingareabetweenthesourcesandthestoragerepository,
youensurealldataloadedintothewarehouseiscleansed
andintheappropriateformat.

3.GenericThreeLevel(ThreeTier):
Three Tier Architecture
Sources
Sources
(file)
Sources
(DB)
Sources
(External)
Expanded
Data
Warehouse
Translation &
Integration
Decision Support Environment
Operational Environment
Selection &
Integration
Data
Management
Data
Management

ThreeTier(Level)Architecture:
ThisisthemostwidelyusedArchitectureofDataWarehouse.
ItconsistsoftheTop,MiddleandBottomTier.
a)BottomTier:
ItisthedatabaseoftheDatawarehouseservers.
Itisusuallyarelationaldatabasesystem.
Dataiscleansed,transformed,andloadedintothislayerusingback-
endtools.
Itincludes–
i)DataExtraction–getdatafrommultiple,heterogeneous,andexternal
sources.
ii)Datacleaning-detecterrorsinthedataandrectifythemwhen
possible.
iii)Datatransformation-convertdatafromlegacyorhostformatto
warehouseformat.
iv)Load-sort,summarize,consolidate,computeviews,checkintegrity,
andbuildindicesandpartitions.
v)Refresh-propagatetheupdatesfromthedatasourcestothewarehouse.

b)MiddleTier:
Itistheapplicationlayergivinganabstractedviewofthe
database.
ThemiddletierinDatawarehouseisanOLAPserver
whichisimplementedusingeitherROLAPorMOLAP
model.
Thislayeralsoactsasamediatorbetweentheend-userand
thedatabase.
Itarrangesthedatatomakeitmoresuitableforanalysis.
c)Top-Tier:
Itrepresentthefront-endclientlayer.
Itiswheretheuseraccessesandinteractswiththedata.
ItcouldbeQuerytools,reportingtools,managedquery
tools,AnalysistoolsandDataminingtools.

Datawarehousemodels:
1.EnterpriseWarehouseModels:
“Anenterprisewarehousemodelscollectsallofthe
informationaboutsubjectsspanningtheentireorganization”.
Itprovidescorporate-widedataintegration,usuallyoneor
moreoperationalsystemsorexternalinformationproviders
andiscross-functionalscope.
Itconsistsofsummarizeddataanditsrangeinsizefrom
gigabytes,terabytesorbeyond.
Itisimplementedontraditionalmainframes,computersuper
servers,parallelarchitectureplatforms.
Ittakesyearstodesignandbuild.
Data Warehouse Models

2.DataMart:
“Adatamartcontainsasubsetofcorporate-widedatathat
isofvaluetoaspecificgroupofusers”.
(E.g.)Marketingdatamartcontainsdetailsaboutcustomer,item
andsales.
Itisimplementedonlow–costdepartmentalservers(i.e.)
Unix/Linux.
Itinvolvecomplexintegrationinlongrun.
Itiscategorizedintotwosourceofdata–
i)Independentdatamarts–datacapturedfromoneormore
operationalsystemsorexternalinformationorgeographicarea.
ii)Dependentdatamarts–itissourceddirectlyfromenterprise
datawarehouses.
Data Warehouse Models

3.VirtualWarehouseModels:
“AVirtualwarehousemodelsisasetofviewsover
operationaldatabases”.
Itiseasytobuildbutrequiresexcesscapacityon
operationaldatabaseservers.
a)Top-downapproach–itisasystematicsolutionand
minimizesintegrationproblems.
Itisexpensive,takeslongtimetodevelop,lacks
flexibilityduetodifficultyinachievingconsistency.
b)Bottom-downapproach–itisusedtodesign,
developmentanddeploymentofindependentdatamarts
providesflexibility,lowcostandrapidreturnof
investment.
Data Warehouse Models

Recommended approach for data
warehouse development
Define a high –level corporate data model
Data
mart
Data
mart
Distributed
data marts
Multi tier
data
warehouse
Enterprise
data
warehouse
Model RefinementModel Refinement

Metadata:
“Metadataaredataaboutdata.Itrepresentswarehouseobjects
whenusedindatawarehouse.Itiscreatedforthedatanamesand
definitionsofgivendatawarehouse”.
Itiscreatedandcapturedforanyextracteddata,sourceofextracted
dataanditisaddedbycleaningorintegrationprocess.
Metadatarepositoryareplacedinthebottomofthedatawarehouse
architecture.
Itisusedasdirectorytohelpdecisionsupportsystemanalystlocate
thecontentsofdatawarehouse.
Itserveasaguidetoalgorithmsusedforsummarization.
Metadatashouldbestoredandmanagedpersistently.
Ametadatarepositoryconsistsof–
a)Datawarehousestructure–itincludesdatawarehouseschema,
view,dimensions,hierarchiesandderiveddatadefinitionsaswellas
datamartlocationsandcontents.
Meta Data Models

b)Operationalmetadata–itincludedatalineage(i.e.)historyof
migrateddataandsequenceoftransformations),currencyofdata
andmonitoringinformation(errorreportsandaudittrails).
c)Algorithmsusedforsummarization–includesmeasureand
dimensiondefinitionalgorithms,dataonpartitions,subjectareas,
aggregation,summarizationandpredefinedqueriesandreports.
d)Mappingfromoperationalenvironmenttodatawarehouse–
includesdatabasesandtheircontents,descriptions,data
partitions,dataextraction,cleaningandsecurity.
e)Datarelatedtosystemperformance–includesindicesand
profilesthatimprovedataaccessandretrievalperformance,
schedulingofupdateandreplicationcycles.
f)Businessmetadata–itincludesbusinesstermsanddefinitions,
dataownershipinformationandchargingpolicies.
Meta Data Models

DatawarehouseModeling:
DatawarehouseandOLAPtoolsarebasedona
multidimensionaldatamodel.
Thismodelviewsintheformofadatacube.
SchemasforMultidimensionalModel:
“Adatacubeallowsdatatobemodeledandviewedin
multipledimensions.Itisdefinedbydimensionsand
facts”.
*Dimensions–itisperspectivesorentitieswithrespectto
whichanorganizationwantstokeeprecords.
(E.g.)Salesdatawarehousehasdimensionsliketime,item,
branchandlocation.
Itisspecifiedbyusersorexpertsandadjustdata.
Data Warehouse Modelling

*Facts–itisnumericmeasuresandanalyzetherelationships
betweendimensions.
(E.g.)Salesdatawarehouseincludesalesamountintermsof
dollarssold,unitssold.
Thedatacubeisametaphorformultidimensionaldatastorage.
Theactualstorageofsuchdatamaydifferfromitslogical
representation.
Thedataatdifferentdegreesofsummarizationisreferredas
Cuboid.
Thecuboidthatholdsthelowestlevelofsummarizationis
calledthebasecuboid.
Thecuboidthatholdsthehighestlevelofsummarizationis
calledtheapexcuboid.
Data Warehouse Modelling

Stars,SnowflakesandFactConstellations:Schemasfor
MultidimensionalDatamodels
ERdatamodeliscommonlyusedindesignofrelational
databases,wheredatabaseschemaconsistsofasetofentitiesand
therelationshipsbetweenthem.
Themostpopulardatamodelforadatawarehouseis
multidimensionaldatathatconsistsofStarschema,Snowflake
schemaandFactconstellationschema.
a)StarSchema:
Itestablishestherelationshipsbetweenthefacttableandanyone
ofthedimensiontables.
Ithassinglefacttableconnectedtodimensiontableslikeastar.
Ithasonefacttableandisassociatedwithnumerousdimensions
tableanddepictsastar.
Schemas

(E.g.)StarSchemadiagramforsalesdataofacompanywith
respecttofourdimensions:time,item,branch&location.
Eachdimensionrepresentedwithonlyonedimensiontable.
Dimensiontablecontainssetofattributes.
Facttableatcentre&itcontainskeytoeachoffour
dimensions.
Facttablecontainsattributeslike,dollarssold&unitssold.
Schemas

b)SnowflakeSchema:
Itistheextensionofstarschema.
Largedimensiontablesarenormalizedintomultiple
tables.
Thenormalizationsplitupthedataintoadditionaltables.
Thereisrelationshipbetweenthedimensionstables&it
hastodojoinstofetchthedata.
Indiagram,itemdimensiontableisnormalizedandsplitinto
twodimensiontablesnamely,itemandsuppliertable.
Itemdimensiontablecontainsattributesitem_key,
item_name,type,brandandsupplier_key.
Supplierdimensiontablecontainstheattributes
supplier_keyandsupplier_table.
Schemas

(E.g.)SnowflakeSchemadiagramforsalesdataofacompany:
Schemas

c)Factconstellation:
Factconstellationisameasureofonlineanalyticalprocessing,
whichiscollectionmultiplefacttablessharingdimension
tables,viewedasacollectionofstars.
Itconsistsofmultiplefacttables&thisschemaisalsoknown
asgalaxyschema.
Infollowingdiagram,itconsistsoftwofacttables,salesand
shipping.
Salefacttableissameasstarschema.
Shippingfacttablehasfivedimensionsnamely,item_key,
time_key,shipper_key,location.
Itcontainstwomeasuresnamelydollarssoldandunitssold.
Itispossiblefordimensiontabletosharebetweenfacttables.
Schemas

(E.g.)FactConstellationSchemadiagramforsalesdataofa
company:
Schemas

Difference between Star, Snowflake & Fact
Constellation Schema
Star Schema
Snowflake
Schema Fact Schema
Elements
Single Fact Table
connected to
multiple dimension
tables with no sub-
dimension tables
Single Fact Table
connects to
multiple dimension
tables that connects
to multiple sub-
dimension tables
Multiple Fact
Tables connects to
multiple dimension
tables that connects
to multiple sub-
dimension tables
NormalizationDenormalized Normalized Normalized
Number of
Dimensions
Multiple dimension
tables map to a
single Fact Table
Multiple dimension
tables map to
multiple dimension
tables
Multiple dimension
tables map to
multiple Fact
Tables
Data
Redundancy High Low Low

Difference between Star, Snowflake & Fact
Schema
Star Schema
Snowflake
Schema Fact Schema
Performance
Fewer foreign keys
resulting in
increased
performance
Decreased
performance
compared to Star
Schema from
higher number of
foreign keys
Decreased
performance
compared to Star
and Snowflake.
Used for complex
data aggregation.
Complexity
Simple, designed to
be easy to
understand
More complicated
compared to Star
Schema –can be
more challenging
to understand
Most complicated
to understand.
Reserved for highly
complex data
structures

Difference between Star, Snowflake & Fact
Schema
Star Schema
Snowflake
Schema Fact Schema
Storage
Usage
Higher disk space
due to data
redundancy
Lower disk space
due to limited data
redundancy
Low disk space
usage compared to
the level of
sophistication due
to the limited data
redundancy
Design
Limitations
One Fact Table
only, no sub-
dimensions
One Fact Table
only, multiple sub-
dimensions are
permitted
Multiple Fact
Tables permitted,
only first level
dimensions are
permitted

OperationalDatabasesystems:
Operationalsystemassistacompanyororganizationinitsday-to-
daybusiness.
Itsapplicationsanddataarehighlystructureandprovideimmediate
focusonbusinessfunctionswiththehelpofOLTP.
Itrequiredtosupportalargenumberoftransactiononadaily
basis.
Operationaldatastoressmall,focusingthedatabaseonspecific
businessareaandeliminatingdatabaseoverheadinareassuchas
indexes.
DataWarehousesystem:
Thissystemisorganizedaroundthetrendsorpatternsinthose
eventssetbyoperationalsystems.
Datawarehousefocusonbusinessneedsandrequirements.
Itdevelopideasforchangingthebusinessrulestomakethese
eventsmoreeffective.
Operational Data base systems and Data
warehouse

Themultiplepurposesare–
Itminimisestheimpactofreportingandcomplexquery
processingonoperationalsystems.
Itpreservesoperationaldataforreuseafterthatdatahas
beenpurgedfromtheoperationalsystems.
Itmanagesthedatabasedontime,allowingtheuserto
lookbackandseehowthecompanylookedinthepast
versusthepresent.
Itprovidesadatastorethatcanbemodifiedtoconformto
thewaytheusersviewthedata.
Itunifiesthedatawithinacommonbusinessdefinition,
offeringoneversionofreality.
Operational Data base systems and Data
warehouse

Difference between Operational Data base
systems and Data warehouse
DifferenceOperational SystemData Warehouse
system
Size &
Content
Small Large
PerformanceSpeed in natureSlow, as therequest
of data is based on
query
Content focusSmall work areasCross functional
areas
Tools Typicalstructure and
deals with less no. of
tools
Varioustools and
supports the types of
data