SlidePub
Home
Categories
Login
Register
Home
General
Data mining and data warehousing notes
Data mining and data warehousing notes
tinamaheswariktm2004
38 views
54 slides
Jan 03, 2025
Slide
1
of 54
Previous
Next
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
About This Presentation
dwdm
Size:
1.24 MB
Language:
en
Added:
Jan 03, 2025
Slides:
54 pages
Slide Content
Slide 1
WhatisDataandInformation?
Dataisanindividualunitthatcontainsrawmaterialswhichdonotcarryanyspecific
meaning.
Informationisagroupofdatathatcollectivelycarriesalogicalmeaning.Datadoesn't
dependoninformation.
Informationdependsondata.DataismeasuredinbitsandBytes.
Informationismeasuredinmeaningfulunitsliketime,quantity,etc.
DataWarehouse
Datawarehouseislikearelationaldatabasedesignedforanalyticalneeds.Itfunctionson
thebasisofOLAP(OnlineAnalyticalProcessing).Itisacentrallocationwhere
consolidateddatafrommultiplelocations(databases)arestored.
Slide 2
WhatisDatawarehousing?
Datawarehousingistheactoforganizing&storingdatainawaysoastomakeitsretrieval
efficientandinsightful.Itisalsocalledastheprocessoftransformingdataintoinformation.
Fig:DatawarehousingProcess
DataWarehouseCharacteristics
ADatawarehouseisasubject-oriented,integrated,timevariantandnon-volatilecollection
ofdatainsupportofmanagement’sdecisionmakingprocess.
Subject-oriented:
ADatawarehousecanbeusedtoanalyzeaparticularsubject
areaEx:“Sales”canbeparticularsubject
Integrated:
ADatawarehouseintegratesdatafrommultipledatasources.
TimeVariant:
Historicaldataiskeptinadatawarehouse.
Ex:onecanretrievedatafrom3months,6months,12monthsorevenolderdatafroma
datawarehouse.Thiscontrastswithatransactionssystem,whereoftenonlythemost
recentdataiskept.
Non-Volatile:
Slide 3
Oncedataisinthedatawarehouse,itwillnotchange.Sohistoricaldatainadatawarehouse
shouldneverbealtered.
DatawarehouseArchitecture
Thearchitectureofadatawarehousetypicallyinvolvesthreemaintiers:DataSources,
DataStorage(WarehouseandMarts),andFront-EndTools.Eachlayerplaysa
crucialroleintheoverallsystem.
1.DataSources
●Purpose:Thisiswheretherawdataoriginates.
●Sources:
○Operationaldatabases(e.g.,ERPsystems,CRMsystems).
○Externalsources(e.g.,webdata,third-partydatafeeds).
●Processes:
○Dataisextractedfromthesesourcesandpreparedforintegrationintothe
datawarehouse.
2.BottomTier:DataWarehouse
●Definition:Acentralizedrepositorythatstoresintegratedandprocesseddata
frommultiplesources.
●Components:
○ETLProcess(Extract,Transform,Load):
■Extract:Pullsrawdatafromsourcesystems.
■Transform:Convertsdataintoastandardizedformat.
■Load:Storesthetransformeddatainthewarehouse.
○MetadataRepository:
■Storesinformationaboutthedata(e.g.,datadefinitions,mappings,
lineage).
○MonitorandIntegrator:
■Managesthedatarefreshandupdateprocesses.
●DataMarts:
○Subsetsofthedatawarehousedesignedforspecificbusinessfunctionsor
departments(e.g.,finance,sales).
○Enablefasterandmoretargetedanalysis.
3.MiddleTier:OLAPServer
●Purpose:Enablesefficientqueryingandanalyticalprocessing.
●KeyFeatures:
Slide 4
○Organizesdataintomultidimensionalviews(e.g.,cubes)foranalysis.
○Supportsoperationssuchasslicing,dicing,drillingdown,androllingup.
●Functionality:
○Servespreprocesseddatatofront-endtoolsforfasterperformance.
4.TopTier:Front-EndTools
●Purpose:Providesuserswithinterfacestointeractwiththedata.
●Tools:
○AnalysisTools:Forin-depthexplorationanddiscovery.
○QueryTools:Forrunningspecificqueries.
○ReportingTools:Forgeneratingsummariesanddashboards.
○DataMiningTools:Foruncoveringpatternsandtrendsusingalgorithms.
●Users:Businessanalysts,decision-makers,anddatascientistsrelyonthese
toolsforinsights.
KeyProcesses
1.ETL(Extract,Transform,Load):
○Criticalformaintaininghigh-qualitydatainthewarehouse.
2.OLAP(OnlineAnalyticalProcessing):
○Ensuresreal-time,interactive,andmultidimensionaldataanalysis.
3.DataRefreshing:
○Keepsthedatawarehouseupdatedwiththelatestchangesfromsource
systems.
Slide 5
Datawarehousingimportancefortheprocessofdatamining
Datawarehousingplaysacriticalroleintheprocessofdataminingby
providingawell-organizedandefficientenvironmentforanalyzinglarge
amountsofdata.Here’swhyitisessential:
1.CentralizedDataRepository
●Adatawarehouseintegratesdatafrommultiplesources(e.g.,
operationaldatabases,externalfeeds)intoasinglelocation.
●Thisensuresconsistencyandprovidesaunifieddatasetformining
patterns,trends,andinsights.
2.DataQualityandCleanliness
Slide 6
●Beforedataisloadedintoawarehouse,itgoesthroughtheETL
process(Extract,Transform,Load),whichensures:
○Removaloferrorsandduplicates.
○Standardizationofformats(e.g.,dateformats,unitmeasures).
●Cleanandreliabledataiscrucialforaccurateminingresults.
3.HistoricalDataAvailability
●Datawarehousesstorehistoricaldataoverlongperiods.
●Thisenablesminingalgorithmstouncoverpatternsandtrendsover
time,suchascustomerpurchasebehaviorsorseasonaltrends.
4.SupportforMultidimensionalAnalysis
●DatainwarehousesisoftenorganizedintoOLAPcubes,enabling
multidimensionalviewsforanalysis.
●Thismakesiteasierfordataminingtoolstoexplorerelationshipsand
correlationsbetweenvariables(e.g.,salesbyregion,product,and
time).
5.High-PerformanceQuerying
●Adatawarehouseisdesignedforread-intensiveoperations,unlike
transactionaldatabases.
●Thismeansdataminingtoolscanefficientlyquerylargevolumesof
datawithoutslowingdownoperationalsystems.
6.ScalabilityforBigData
●Datawarehousesarebuilttohandlelargedatasets,makingthem
idealformininglargeandcomplexdata.
●Asdatagrows,warehousescanscaletoaccommodateit,ensuring
uninterruptedminingprocesses.
7.BetterDecision-Making
●Theresultsofdatamining,suchaspatternsandpredictions,areonly
asgoodasthedatatheyanalyze.
●Withaccurate,comprehensive,andwell-organizeddatafromthe
warehouse,businessescantrusttheoutcomesofdatamining.
Example:
Slide 7
Inretail,adatawarehousemightstoresales,inventory,andcustomerdata
frommultiplestoresoverseveralyears.Dataminingcanthenanalyzethis
datato:
●Identifybuyingpatterns.
●Predictfuturesalestrends.
●Suggestproductstoupsellorcross-sell.
IntroductiontoDataMining:KindsofData
Intraditionaldatawarehousinganddatamining,thetypesofdataplayacriticalrolein
determiningthetechniquesandtoolsusedforanalysis.
1.StructuredData
●Definition:Dataorganizedintorowsandcolumns,typicallystoredinrelational
databases.
●Examples:
○Salesrecords:(e.g.,ProductID,Quantity,Price).
○Customerdetails:(e.g.,Name,Age,Address).
●ImportanceinDataMining:
○EasytoanalyzeusingSQLanddataminingtechniqueslikeclustering,
classification,andassociationrulemining.
2.Semi-StructuredData
●Definition:Datathatdoesnotfitintoastricttabularformatbuthassomeorganizational
properties.
●Examples:
○XMLandJSONfiles.
○Webserverlogsorsocialmediaposts.
●ImportanceinDataMining:
○Usedforextractingandprocessingdatapatternswherestructureisinconsistent.
3.UnstructuredData
●Definition:Datawithoutapredefinedformatororganization.
●Examples:
○Textdata(emails,reports).
○Multimedia(images,videos,audiofiles).
●ImportanceinDataMining:
○Requiresspecializedtechniqueslikenaturallanguageprocessing(NLP),image
processing,andvideoanalysis.
4.TransactionalData
●Definition:Datageneratedfromdailybusinesstransactionsoractivities.
Slide 8
●Examples:
○Onlinepurchases(e.g.,Amazonorders).
○ATMorbanktransactions.
●ImportanceinDataMining:
○Usefulforfindingpatterns(e.g.,frequentitemsetsinmarketbasketanalysis).
○Helpsdetectfraudorunusualactivities.
5.TemporalData
●Definition:Datathatistime-dependentorassociatedwithatimedimension.
●Examples:
○Stockmarketpricesovertime.
○Weatherdatalogs.
●ImportanceinDataMining:
○Time-seriesanalysisisusedtouncovertrends,patterns,andmakepredictions
(e.g.,forecastingsales).
6.SpatialData
●Definition:Datathatcontainsgeographicalorspatialinformation.
●Examples:
○GPSdatafrommobiledevices.
○Landusemapsorsatelliteimagery.
●ImportanceinDataMining:
○Usedforlocation-basedanalysis,urbanplanning,andgeographicpattern
discovery.
7.SequentialData
●Definition:Datainwhichtheorderofelementsisimportant.
●Examples:
○Clickstreamdata(e.g.,websitenavigationpaths).
○Biologicaldata(e.g.,DNAsequences).
●ImportanceinDataMining:
○Sequenceminingtechniquesareusedtodiscoverpatternslikecustomer
behaviororgenestructures.
8.MultimediaData
●Definition:Datathatincludesimages,audio,video,orcombinationsoftheseformats.
●Examples:
○MedicalimageslikeX-raysorMRIs.
○Videosfromsurveillancesystems.
●ImportanceinDataMining:
○Requiresadvancedtechniqueslikedeeplearning,audio-videorecognition,and
content-basedretrieval.
9.Metadata
●Definition:Dataaboutdata,whichdescribesotherdatasets.
Slide 9
●Examples:
○Fileproperties(e.g.,size,type,creationdate).
○Socialmediatags(e.g.,hashtags,geotags).
●ImportanceinDataMining:
○Helpsorganize,retrieve,andunderstandthecontentorstructureofdatasets.
HowKindsofDataFitintoTraditionalDataWarehousing
●StructuredDataistheprimaryfocusofdatawarehousing,storedintablesforefficient
queryingandanalysis.
●Semi-StructuredandUnstructuredDataareincreasinglyintegratedintowarehouses
usingmoderntoolsforadvancedanalysis.
●Historicalandtransactionaldataarestoredindatawarehousestoenablepatternsand
trenddiscoveryovertime.
KeyPatternsDiscoveredThroughDataMining
Datamininginvolvesanalyzinglargedatasetstouncovermeaningfulpatternsandinsights.
Thesepatternsarecriticalformakinginformeddecisionsinvariousindustriessuchas
healthcare,finance,retail,andmore.
1.AssociationPatterns
●Definition:Identifiesrelationshipsbetweenvariablesinadataset.
●Examples:
○MarketBasketAnalysis:Discoveringthat"customerswhobuybreadoftenbuy
butter."
○Onlineshoppingrecommendations:"Peoplewhopurchasedasmartphoneoften
buyacase."
●UseCases:Retailande-commerceforcross-sellingandupsellingproducts.
2.ClassificationPatterns
●Definition:Assignsdataintopredefinedcategoriesorclasses.
●Examples:
○Predictingwhetheraloanapplicantis"high-risk"or"low-risk."
○Classifyingemailsas"spam"or"notspam."
●UseCases:Frauddetection,customersegmentation,andmedicaldiagnosis.
3.ClusteringPatterns
●Definition:Groupssimilardatapointstogetherwithoutpredefinedcategories.
●Examples:
○Identifyingcustomersegmentsbasedonpurchasingbehavior.
○Groupingpatientswithsimilarmedicalhistoriesorsymptoms.
●UseCases:Customerprofiling,marketsegmentation,andimageanalysis.
Slide 10
4.SequentialPatterns
●Definition:Identifiesrecurringsequencesorpatternsindataovertime.
●Examples:
○Analyzingshoppingbehavior:"Customerswhobuysmartphonesoftenpurchase
accessorieswithinaweek."
○Analyzingwebsitenavigationpathstooptimizeuserexperience.
●UseCases:Webusagemining,recommendationsystems,andbiologicalsequence
analysis.
5.PredictionPatterns
●Definition:Forecastsfuturetrendsbasedonhistoricaldata.
●Examples:
○Predictingfuturesalesbasedonpasttrends.
○Anticipatingcustomerchurninsubscriptionservices.
●UseCases:Salesforecasting,financialmarketpredictions,andweatherforecasting.
6.OutlierDetection
●Definition:Identifiesunusualoranomalousdatapointsthatdiffersignificantlyfromthe
restofthedataset.
●Examples:
○Detectingfraudulentcreditcardtransactions.
○Identifyingdefectiveproductsinmanufacturing.
●UseCases:Frauddetection,qualitycontrol,andcybersecurity.
7.Time-SeriesPatterns
●Definition:Uncoverstrends,seasonalvariations,andrecurringpatternsintime-ordered
data.
●Examples:
○Trackingstockmarkettrendsovertime.
○Analyzingelectricityusagepatternsduringpeakandoff-peakhours.
●UseCases:Energyconsumptionanalysis,trendforecasting,andinventory
management.
8.CorrelationPatterns
●Definition:Identifiesrelationshipsordependenciesbetweenvariables.
●Examples:
○Findingacorrelationbetweenweatherconditionsandicecreamsales.
○Discoveringhowadvertisingspendingaffectsproductsales.
●UseCases:Businessstrategyplanningandunderstandingcustomerbehavior.
9.SummarizationPatterns
●Definition:Providesacompactandconciserepresentationofdataforbetter
understanding.
Slide 11
●Examples:
○Summarizingsalesdataintoaveragedailyrevenue.
○Summarizingcustomerdemographicsforaregion.
●UseCases:Generatingexecutive-levelreportsanddashboards.
10.BehaviorPatterns
●Definition:Discoverstypicalbehaviortrendsofindividualsorgroups.
●Examples:
○Trackingcustomers'purchasebehaviorovertime.
○Identifyingusagepatternsofanapporwebsite.
●UseCases:Personalizationinmarketingandimprovinguserexperience.
TechnologiesandApplicationsforDataMininginData
Warehousing
1.DataWarehousingTools
●Purpose:Store,organize,andmanagelargeamountsofdataforeasyaccessand
analysis.
●Examples:
○ETL(Extract,Transform,Load)Tools:Talend,Informatica,MicrosoftSSIS.
○DataWarehousePlatforms:AmazonRedshift,Snowflake,GoogleBigQuery.
●RoleinDataMining:Providesclean,integrated,andhistoricaldataformining
processes.
2.OLAP(OnlineAnalyticalProcessing)
●Purpose:Supportsmultidimensionalanalysisofdatafromdifferentperspectives.
●Examples:
○Pivottables,slice-and-dice,drill-down,androll-upoperations.
●RoleinDataMining:Helpsdiscovertrendsandpatternsbysummarizinglarge
datasets.
3.MachineLearningAlgorithms
●Purpose:Automatesthediscoveryofpatternsandinsights.
●Examples:
○Classificationalgorithms(e.g.,DecisionTrees,NaïveBayes).
○Clusteringalgorithms(e.g.,K-Means,HierarchicalClustering).
○Associationrulemining(e.g.,Apriori,FP-Growth).
●RoleinDataMining:Facilitatesprediction,clustering,andrulediscovery.
4.DataVisualizationTools
●Purpose:Presentminingresultsinanintuitive,visualformat.
●Examples:
○Tableau,PowerBI,QlikView.
Slide 12
●RoleinDataMining:Helpsinterpretinsightseffectivelythroughdashboards,charts,
andgraphs.
5.BigDataTechnologies
●Purpose:Processandanalyzemassivedatasetsthattraditionaltoolscannothandle.
●Examples:
○ApacheHadoop,ApacheSpark.
●RoleinDataMining:Enablesminingoflarge-scaledatainreal-timeorbatch
processes.
6.SQLandQueryTools
●Purpose:Extractandquerydataformining.
●Examples:
○MySQL,PostgreSQL,OracleSQL.
●RoleinDataMining:Providesaccesstodataandenablespre-processingbefore
applyingminingtechniques.
7.ArtificialIntelligence(AI)andDeepLearning
●Purpose:Extractcomplexpatternsandpredictionsfromlargedatasets.
●Examples:
○Neuralnetworks,NLPtechniques,reinforcementlearning.
●RoleinDataMining:Enhancesminingaccuracyandhandlesunstructureddataliketext
andimages.
ApplicationsofDataMininginDataWarehousing
1.RetailandE-commerce
●Uses:
○MarketBasketAnalysis:Findingproductsoftenpurchasedtogether(e.g.,bread
andbutter).
○CustomerSegmentation:Groupingcustomersbasedonbuyinghabits.
○RecommendationSystems:Personalizedproductsuggestions.
●Examples:
○Amazon's"Customersalsobought"feature.
2.BankingandFinance
●Uses:
○FraudDetection:Identifyingunusualtransactionpatterns.
○CreditScoring:Predictingloandefaultsbasedoncustomerprofiles.
○RiskManagement:Forecastingfinancialrisksusinghistoricaldata.
●Examples:
○Detectingcreditcardfraudusingclusteringandanomalydetection.
Slide 13
3.Healthcare
●Uses:
○DiseaseDiagnosis:Classifyingpatientsbasedonsymptomsandmedicalhistory.
○TreatmentOptimization:Analyzingpatientoutcomestorecommendeffective
treatments.
○HealthRiskPrediction:Predictingchronicconditionsbasedonlifestyledata.
●Examples:
○Predictingdiabetesriskusingpatientdata.
4.Telecommunications
●Uses:
○ChurnPrediction:Identifyingcustomerslikelytoswitchproviders.
○NetworkOptimization:Analyzingnetworkperformancedatatoimprovequality.
○UsagePatterns:Understandingcustomerusagefortargetedmarketing.
●Examples:
○Telecomcompaniesusingclusteringforcustomersegmentation.
5.Manufacturing
●Uses:
○QualityControl:Detectingdefectiveproducts.
○DemandForecasting:Predictingfutureproductdemandusingsalesdata.
○ProcessOptimization:Identifyinginefficienciesinproductionworkflows.
●Examples:
○Predictivemaintenanceusingsensordata.
6.Education
●Uses:
○StudentPerformanceAnalysis:Predictingstudentsuccessorfailure.
○PersonalizedLearning:Tailoringlearningresourcesbasedonstudentbehavior.
○DropoutPrediction:Identifyingat-riskstudents.
●Examples:
○E-learningplatformsanalyzinguserdatatosuggestcourses.
7.TransportationandLogistics
●Uses:
○RouteOptimization:Findingthemostefficientdeliveryroutes.
○TrafficManagement:Predictingandmanagingcongestion.
○DemandForecasting:Predictingpassengerflowforbetterresourceallocation.
●Examples:
○Ride-hailingserviceslikeUberusingreal-timedatafordynamicpricing.
8.GovernmentandPublicServices
●Uses:
○CrimeAnalysis:Identifyingpatternstopreventcrimes.
○TaxFraudDetection:Analyzingtaxreturnanomalies.
Slide 14
○SocialProgramEfficiency:Evaluatingtheimpactofpublicinitiatives.
●Examples:
○Predictivepolicingusinghistoricalcrimedata.
MajorIssuesinDataMining:GettingtoKnowYourData–Data
ObjectsandAttributeTypes
Understandingyourdataisacriticalstepinthedataminingprocess.Toextractmeaningful
patterns,itisessentialtocomprehendthestructureandtypesofdataobjectsandattributes.
1.DataObjects
●Definition:Dataobjectsareentitiesaboutwhichdataiscollected,stored,andanalyzed.
Theyrepresentrowsorrecordsinadataset.
●Examples:
○Inasalesdataset,eachrowcouldrepresentacustomeroratransaction.
○Inastudentdatabase,eachrowmightrepresentanindividualstudent.
KeyCharacteristicsofDataObjects:
●Attributes:Thepropertiesorcharacteristicsofadataobject(e.g.,age,income,product
purchased).
●RelationshipwithAttributes:Adataobjectisdescribedusingoneormoreattributes.
2.Attributes
Attributes(alsocalledvariablesorfeatures)definethepropertiesofadataobject.Theyare
organizedintodifferenttypes,whichinfluencehowthedataisanalyzed.
TypesofAttributes:
1.Nominal(Categorical)Attributes:
○Definition:Representscategoriesorlabelswithnoinherentorder.
○Examples:
■Gender(Male,Female,Non-Binary).
■Productcategories(Electronics,Clothing,Furniture).
○Characteristics:
■Operations:Equalitycomparison(e.g.,"IsAequaltoB?").
■Nomathematicalcomputation(e.g.,no"greaterthan"or"lessthan").
2.OrdinalAttributes:
○Definition:Representscategorieswithameaningfulorderorrank,butthe
differencesbetweenranksarenotdefined.
○Examples:
■Customersatisfaction(Low,Medium,High).
■Educationalqualifications(HighSchool,Bachelor's,Master's,PhD).
○Characteristics:
■Operations:Equalityandordercomparisons(e.g.,"IsAgreaterthanB?").
■Differencesbetweenranksarenotquantifiable.
Slide 15
3.IntervalAttributes:
○Definition:Representsnumericvalueswheredifferencesaremeaningful,but
thereisnotruezeropoint.
○Examples:
■Temperature(inCelsiusorFahrenheit).
■Calendardates(e.g.,2000,2020).
○Characteristics:
■Operations:Addition,subtraction,comparison.
■Notruezero(e.g.,0°Cdoesnotmean"notemperature").
4.RatioAttributes:
○Definition:Representsnumericvalueswithmeaningfuldifferencesandatrue
zeropoint.
○Examples:
■Age,income,weight,height.
■Salesrevenueornumberofunitssold.
○Characteristics:
■Operations:Addition,subtraction,multiplication,anddivision.
■Truezeroallowsforratios(e.g.,"Twiceasmuch").
3.MajorIssuesinUnderstandingDataObjectsandAttributes
Whileworkingwithdata,thefollowingchallengesoftenarise:
a.MissingData:
●Problem:Someattributesmayhavemissingvalues.
●Impact:Candistortanalysisorleadtoincorrectresults.
●Solution:Usetechniqueslikeimputation,deletion,orpredictiontohandlemissing
values.
b.NoisyData:
●Problem:Datacontainserrors,outliers,orinconsistencies.
●Impact:Noisecanobscurerealpatternsandintroducebias.
●Solution:Applydatacleaningtechniqueslikesmoothingoroutlierdetection.
c.DataDiversity:
●Problem:Dataoftencomesindifferentformats(text,images,numericvalues)andtypes
(nominal,ordinal,interval,ratio).
●Impact:Eachtyperequiresspecificanalysistechniques.
●Solution:Preprocessandtransformdatatomakeitcompatiblewithminingmethods.
d.HighDimensionality:
●Problem:Datasetswithtoomanyattributes(features)makeanalysiscomplex.
●Impact:Increasescomputationtimeandreducesmodelaccuracy(curseof
dimensionality).
●Solution:UsedimensionalityreductiontechniqueslikePCAorfeatureselection.
e.DataRedundancy:
Slide 16
●Problem:Repetitiveorduplicateattributescaninflatedatasetsizeunnecessarily.
●Impact:Leadstoinefficienciesinstorageandprocessing.
●Solution:Removeorcombineredundantattributesthroughcorrelationanalysis.
f.Scalability:
●Problem:Largedatasetsrequirehighcomputationalpower.
●Impact:Mininglargedatasetscanbetime-consumingandresource-intensive.
●Solution:UsedistributedcomputingframeworkslikeHadooporSpark.
4.ImportanceofUnderstandingDataObjectsandAttributes
●Choosingtherightdataminingtechniquedependsonthetypeofdataandattributes.
●Properhandlingofdatatypesensuresmeaningfulanalysis,accurateresults,and
reliabledecision-making.
StatisticalDescriptionsofData
Statisticaldescriptionsofdatahelpsummarizeandunderstandthecharacteristicsofdatasets.
Thesetechniquesprovideinsightsintothedistribution,centraltendencies,spread,and
relationshipswithindata,whichareessentialfordataanalysisandmining.
1.TypesofStatisticalDescriptions
a.DescriptiveStatistics
Descriptivestatisticssummarizeanddescribethefeaturesofadataset.Thesearedividedinto:
1.MeasuresofCentralTendency:Indicatethecenterofthedata.
2.MeasuresofDispersion:Showhowdatapointsspreadaroundthecenter.
2.MeasuresofCentralTendency
2.Median:
Slide 17
●Definition:Themiddlevaluewhenthedataisordered.
●Example:For{4,6,8,10},themedian=(6+8)/2=7.
For{3,5,7},themedian=5.
3.Mode:
●Definition:Themostfrequentlyoccurringvalueinthedataset.
●Example:For{2,2,4,6,6,6,8},themode=6.
3.MeasuresofDispersion
1.Range:
○Definition:Thedifferencebetweenthehighestandlowestvalues.
○Formula:Range=Maxvalue−Minvalue
Example:For{3,7,10,15},Range=15-3=12.
4.InterquartileRange(IQR):
○Definition:Therangeofthemiddle50%ofthedata.
○Formula:IQR=Q3−Q1
WhereQ1=lowerquartile(25thpercentile),Q3=upperquartile(75th
percentile).
Slide 18
4.ShapeofDataDistribution
1.Skewness:
○Definition:Measuresthesymmetryofdata.
○Types:
■PositiveSkew:Longertailontheright(e.g.,incomedata).
■NegativeSkew:Longertailontheleft.
■Symmetrical:Bell-shapedcurve(normaldistribution).
○Example:Inexamscores,apositiveskewmayindicatemoststudentsscored
lower,withafewhighscorers.
2.Kurtosis:
○Definition:Measuresthe"tailedness"ofdata.
○Types:
■HighKurtosis:Datahasheavytails(outliers).
■LowKurtosis:Datahaslighttails.
5.DataVisualizationforStatisticalDescription
Statisticalsummariesareoftensupportedbyvisualtools:
1.Histograms:Showfrequencydistributionofdata.
2.BoxPlots:Visualizedataspread,outliers,andquartiles.
3.ScatterPlots:Showrelationshipsbetweentwovariables.
4.BarChartsandPieCharts:Representcategoricaldata.
6.ApplicationsofStatisticalDescriptions
●UnderstandingDataCharacteristics:Identifiestrends,patterns,andanomalies.
●PreparingDataforMining:Summarizesdatabeforeapplyingalgorithms.
●DecisionMaking:Helpsmakeinformeddecisionsinfieldslikehealthcare,business,
andeducation.
Whatmethodsareusedtoestimatedatasimilarityanddissimilarityindatamining,and
howdotheyaidintheminingprocess?
Indatamining,similarityanddissimilaritymeasuresareusedtocomparedata
objectsorinstancestodeterminehowalikeordifferenttheyare.Thesemeasuresare
essentialfortaskslikeclustering,classification,andanomalydetection,where
groupingsimilardatapointsordistinguishingbetweendifferentonesisrequired.
1.SimilarityandDissimilarity
Slide 19
●Similarity:Ameasureofhowaliketwodataobjectsare.Itoftenrangesfrom0
to1,where1meanstheobjectsareidentical,and0meanstheyare
completelydifferent.
.Dissimilarity:Ameasureofhowdifferenttwodataobjectsare.Itisoften
representedasadistance,withhighervaluesindicatinggreaterdissimilarity.
2.MethodsforEstimatingDataSimilarityandDissimilarity
i.EuclideanDistance(forNumericalData)
●Definition:Euclideandistanceisthestraight-linedistancebetweentwopoints
inamulti-dimensionalspace.Itisoneofthemostcommonlyusedmeasuresfor
numericaldata.
Fig:EuclideanDistance
Example:Ifyouwanttofindhowsimilartwoproductsarebasedontheirpricesand
sizes,youcancalculatetheirEuclideandistance.
ii.ManhattanDistance(forNumericalData)
●Definition:AlsocalledCityBlockDistance,Manhattandistancecalculates
thesumoftheabsolutedifferencesbetweenthe correspondingattributesof
twodataobjects.
Slide 20
Fig:ManhattanDistance
Usage:Usedwhenthedataconsistsofnumericalvalues,especiallywhenthevariables
representdistancesorpathsinagrid-likestructure.
Example:Itcanbeusefulinapplicationslikepathfindinginlogisticsorgrid-based
problems,wheremovementisrestrictedtohorizontalandverticaldirections.
iii.CosineSimilarity(forTextDataorHigh-DimensionalData)
●Definition:Cosinesimilaritymeasuresthecosineoftheanglebetweentwo
vectorsinamulti-dimensionalspace.Itiscommonlyusedfortextdata
representedaswordvectors.
Fig:cosinesimilarity
Slide 21
.Usage:Itiswidelyusedintextmininganddocumentsimilaritycomparisons,suchas
comparingarticles,books,oruserpreferencesinrecommendationsystems.
Example:Inarecommendationsystem,CosineSimilarityisusedtomeasurehow
similartwousers’preferencesarebasedontheitemstheyhaverated.
iv.JaccardSimilarity(forCategoricalData)
●Definition:Jaccardsimilarityisusedforcomparingtwosetsofcategoricaldata
andmeasurestheratiooftheintersectionovertheunionofthesets.
Usage:Usefulwhenthedataconsistsofbinaryorcategoricalvariables,suchasyes/no
responsesorthepresence/absenceofcertainattributes.
Example:Inmarketbasketanalysis,Jaccardsimilaritycanbeusedtofindhowsimilar
twocustomers'shoppingbasketsarebasedontheproductstheybought.
v.HammingDistance(forBinaryData)
●Definition:Hammingdistancecountsthenumberofpositionsatwhichthe
correspondingvaluesintwobinaryvectorsdiffer.
Slide 22
Formula:Itissimplythenumberofdifferencesbetweentwobinarystrings.
Usage:Usedforbinarydata,suchaserrordetectionincoding,orinmatchingboolean
attributes.
Example:HammingdistancecanbeappliedincomparingtwoDNAsequencesorerror
detectionintransmitteddata.
3.HowTheseMeasuresAidintheMiningProcess
i.Clustering
●Similarityanddissimilaritymeasuresarecriticalinclusteringalgorithmslike
k-means,hierarchicalclustering,andDBSCAN.Thesealgorithmsgroupdata
objectsintoclustersbasedonhowsimilartheyare.
Example:Incustomersegmentation,similaritymeasureshelpgroupcustomerswith
similarpurchasingbehaviorsintothesameclusters,allowingcompaniestotargetthem
withpersonalizedmarketing.
ii.Classification
●Measuresofsimilaritycanbeusedtoclassifynewdatapointsbycomparing
themtoexistinglabeleddataintechniqueslikek-nearestneighbors
(k-NN).
Example:Inspamdetection,thesimilarityofanewemailtopreviouslyclassifiedemails
helpsindeterminingwhetherit’sspamornot.
iii.AnomalyDetection
Slide 23
●Dissimilaritymeasuresareusedtodetectanomaliesoroutliersinadataset.Data
objectsthathavesignificantlydifferentmeasurescomparedtotherestofthe
datasetareflaggedasanomalies.
Example:Infrauddetection,transactionsthataredissimilarfromnormalbehavior
patterns(e.g.,unusualspendingamountsorlocations)canbeflaggedforfurther
investigation.
iv.RecommenderSystems
●Similaritymeasuresarethefoundationofrecommendationsystemsthatsuggest
products,movies,orbookstousersbasedontheirpreviouspreferencesor
behaviors.
●Example:Cosinesimilaritycanbeusedtorecommendmoviestousersbasedon
howsimilartheirpreferencesaretothoseofotherusers.
RoleofDataVisualizationinDataMining
Datavisualizationisacrucialstepinthedataminingprocess.Ithelpstotransform
complexdataintographicalformatsthatareeasiertounderstand,interpret,and
analyze.Byrepresentingdatavisually,patterns,trends,andrelationshipswithinthe
databecomemoreapparent,whichisessentialformakinginformeddecisions.
1.UnderstandingandInterpretingData
●SimplifiesComplexData:Rawdatacanbedifficulttointerpret,especiallywith
largedatasets.Datavisualizationtoolshelppresentthedatainamoredigestible
formatbyusingcharts,graphs,andplots.
○Example:Ascatterplotcanquicklyshowtherelationshipbetweentwo
variables,suchassalesandadvertisingbudget,makingiteasierto
identifytrends.
●IdentifyingPatternsandTrends:Visualizationallowsforimmediaterecognitionof
patterns,trends,andanomaliesinthedata.Itmakestheunderlyingstructureof
thedataclearandaccessible.
○Example:Alinegraphofstockpricesovertimehelpstovisualizetrends,
suchasupwardordownwardmovements.
2.EnhancingDataExploration
●ExploratoryDataAnalysis(EDA):Duringtheearlystagesofdatamining,
visualizationssupportexplorationofthedata,allowingdatascientiststotest
hypothesesandunderstandthestructureofthedataset.
○Example:Histogramscanrevealthedistributionofdatapoints,helping
analystsdetermineifdataisnormallydistributedorskewed.
Slide 24
●DimensionalityReduction:Indatasetswithmanyvariables(high-dimensional
data),datavisualizationtechniqueslikeprincipalcomponentanalysis(PCA)help
reducedimensionswhileretainingimportantfeatures,allowingforeasier
analysis.
○Example:A3Dscatterplotcanrepresentcomplexdatawithmultiple
variablesinareduced,moreunderstandableform.
3.DetectingOutliersandAnomalies:
●OutlierDetection:Datavisualizationiseffectiveindetectingoutliers—datapoints
thatdeviatesignificantlyfromotherobservations.Theseoutlierscansometimes
indicateerrorsorinterestinginsights.
○Example:Aboxplotshowstheinterquartilerangeandhighlightsanydata
pointsthatfalloutsidethe"whiskers"aspotentialoutliers.
●DataQualityAssessment:Byvisualizingdatadistributions,analystscanassess
thequalityofdataanddetectissueslikemissingvalues,inconsistencies,or
errors.
○Example:Aheatmapofmissingdatacanindicatepatternsofmissing
valuesacrossdifferentfeatures.
4.FacilitatingModelSelectionandEvaluation
●ModelComparison:Datavisualizationhelpsincomparingtheperformanceof
differentmodelsbyvisualizingevaluationmetricssuchasaccuracy,precision,
recall,orerrorrates.
○Example:AROCcurve(ReceiverOperatingCharacteristiccurve)
visualizestheperformanceofaclassificationmodel,allowingtheselection
ofthebestmodel.
●VisualizingClusters:ForclusteringalgorithmslikeK-means,visualizationhelps
toassesshowwellthedatahasbeenclusteredandwhethertheclustersmake
sense.
○Example:A2Dor3Dplotcanshowclustersofdatapoints,helpingto
determineiftheclustersarewell-separatedoroverlapping.
5.CommunicatingResultstoStakeholders
●MakingDataAccessible:Datavisualizationsplayakeyroleincommunicating
findingstostakeholders,especiallynon-technicalaudiences.Well-designed
visualizationsmakeiteasierfordecision-makerstounderstandtheinsightsfrom
dataminingresults.
○Example:Dashboardswithinteractivevisualizationsallowexecutivesto
exploredatainreal-timeandmakedecisionsbasedonvisualdata
analysis.
●StorytellingwithData:Datavisualizationaidsincreatinganarrativefromthe
data.Bycombiningvisualelementslikechartsandgraphs,analystscantella
compellingstorythatconveystheinsightseffectively.
Slide 25
○Example:Abarchartcomparingsalesbeforeandafteramarketing
campaigncanshowtheimpactofthecampaignclearly.
6.ToolsforDataVisualization
Thereareseveraltoolsandsoftwareusedindataminingforcreatingvisualizations:
●Tableau:Apowerfuldatavisualizationtoolforcreatinginteractivedashboards
andreports.
●PowerBI:Microsoft'sbusinessanalyticstoolfordatavisualizationandsharing
insightsacrossorganizations.
●MatplotlibandSeaborn(Pythonlibraries):Usedforcreatingstatic,animated,and
interactiveplotsinPython.
●D3.js:AJavaScriptlibraryusedtocreateinteractivedatavisualizationsforthe
web.
DataPreprocessing:QualityData
Datapreprocessingisanessentialstepinthedataminingprocess.Itinvolvespreparing
andcleaningdatabeforeitcanbeanalyzed.Thegoalistoimprovethequalityofthe
datasothattheresultsofdataminingareaccurate,reliable,andmeaningful.
WhatisQualityData?
Qualitydatareferstodatathatisaccurate,complete,andconsistent.Itisdatathatcan
betrustedforanalysisanddecision-making.Poor-qualitydatacanleadtomisleading
resultsandincorrectconclusions,whichiswhypreprocessingiscrucial.Themain
characteristicsofqualitydatainclude:
1.Accuracy:Datashouldbecorrectandfreefromerrors.
2.Completeness:Allrequireddatashouldbepresent,withnomissingvalues.
3.Consistency:Datashouldbeconsistentacrossdifferentsourcesandformats.
4.Timeliness:Datashouldbeup-to-dateandrelevantfortheanalysis.
5.Relevance:Datashouldbedirectlyrelatedtotheproblembeingsolved.
6.Uniqueness:Datashouldbefreefromduplicates.
CommonDataQualityIssues:
Beforedatacanbeusedforanalysis,it’simportanttoaddressseveralcommonissues
thatcanaffectdataquality:
1.MissingData:
○Somedataentriesmaybeincomplete,withmissingvaluesforcertain
attributes(e.g.,ageorincome).
Slide 26
○Solution:Techniqueslikeimputation(fillinginmissingvalueswiththe
mean,median,ormostfrequentvalue)ordeletingrows/columnswithtoo
manymissingvaluescanbeapplied.
2.Noise(ErrorsorOutliers):
○Noisereferstorandomerrorsoranomaliesinthedatathatdonot
representtruepatterns(e.g.,incorrectvaluesorextremeoutliers).
○Solution:Datacleaningtechniques,suchassmoothingoroutlier
detection,helpremoveorcorrectnoisydata.
3.DuplicateData:
○Sometimes,thesamedataisrepeatedmultipletimes(e.g.,duplicate
recordsofacustomer).
○Solution:Duplicaterecordscanbeidentifiedandremovedduring
preprocessing.
4.InconsistentData:
○Datacollectedfromdifferentsourcesorformatsmaybeinconsistent.For
example,thesameattributemighthavedifferentunits(e.g.,"kg"and
"grams").
○Solution:Datastandardizationornormalizationcanbeappliedtomake
thedataconsistent.
5.IrrelevantData:
○Datamaycontainunnecessaryinformationthatdoesnotcontributeto
solvingtheproblem.
○Solution:Featureselectionhelpsidentifyandkeeponlyrelevantdata
attributesfortheanalysis.
StepsinDataPreprocessingforQualityData
1.DataCleaning:
○Handlemissingdata,removeduplicates,andcorrecterrors.
○Example:Ifsomecustomerrecordshavemissingages,youcanfillin
thosemissingvalueswiththeaverageage.
2.DataTransformation:
○Standardizeornormalizedatatobringdifferentfeaturesintoasimilar
rangeorformat.
○Example:Ifyouhavedataforweightinkilogramsandheightin
centimeters,convertingbothtothesameunit(e.g.,kilogramsandmeters)
ensuresconsistency.
3.DataReduction:
○Reducethesizeofthedatasetbyremovingirrelevantorredundantdata.
○Example:Ifthedatasetcontainsafeaturelike"favoritecolor"thatdoesn’t
affecttheanalysis,itcanbedropped.
4.DataIntegration:
○Combinedatafromdifferentsourcesintoasingledataset,ensuring
consistencyandavoidingconflicts.
○Example:Integratingsalesdatafromdifferentregionsintoonedatasetfor
analysis.
Slide 27
WhyisQualityDataImportant?
●AccuracyofResults:High-qualitydataleadstomoreaccurateandreliabledata
miningresults.
●BetterDecision-Making:Cleanandwell-prepareddatahelpsbusinessesand
organizationsmakebetterdecisions.
●ImprovedEfficiency:Whendataiscleanandwell-organized,itiseasierand
fastertoanalyze.
DataPreprocessing:DataCleaning
Datacleaningisacrucialstepinthedatapreprocessingprocess.Itinvolvesfixingor
removingincorrect,incomplete,orirrelevantdatafromadatasettomakeitreadyfor
analysis.Withoutproperdatacleaning,anyanalysisorminingcouldleadtoinaccurate
ormisleadingresults.
WhatisDataCleaning?
Datacleaningistheprocessofidentifyingandcorrectingerrorsorinconsistenciesinthe
data.Thishelpsensurethedataisaccurate,complete,andconsistent,whichis
essentialformakingreliableconclusionsandpredictionsfromthedata.
CommonDataCleaningTasks
1.HandlingMissingData
○Sometimes,certainvaluesinadatasetaremissing.Thiscanhappenif
datawasnotrecordedoriftherewereerrorsduringdatacollection.
○WaystoHandleMissingData:
■Removemissingdata:Ifonlyasmallportionofthedatasethas
missingvalues,youcanremovethoserowsorcolumns.
■Imputemissingdata:Youcanfillinmissingvaluesusingestimates
suchasthemean,median,orthemostcommonvalue.
■Usealgorithmsthathandlemissingdata:Somealgorithmscan
handlemissingdatawithoutneedingtofillitinmanually.
○Example:Ifsomepeople’sagesaremissinginasurvey,youcouldfillin
themissingageswiththeaverageagefromtherestofthedata.
2.RemovingDuplicates
○Duplicaterecordsoccurwhenthesameinformationappearsmultiple
timesinthedataset.
○Solution:Identifyandremoveduplicaterowstopreventthemfrom
skewingtheanalysis.
○Example:Ifacustomer’sinformationappearsmorethanonce,youshould
keeponlyoneentrytoavoidovercounting.
3.FixingInconsistentData
Slide 28
○Inconsistentdataoccurswhensimilardataisstoredindifferentformatsor
units,makingitdifficulttoanalyze.
○Solution:Standardizethedatatoensureconsistencyacrossthedataset.
○Example:Ifyouhaveheightdatarecordedbothincentimetersandinches,
youwouldconvertallvaluestooneunit(e.g.,centimeters)foruniformity.
4.CorrectingErrors
○Sometimes,datacontainserrorsduetomistakesmadeduringdataentry
(e.g.,typingmistakesorincorrectvalues).
○Solution:Correcttheseerrorsbycheckingagainstreliablesourcesor
applyinglogicalrulestodetectout-of-rangeorimpossiblevalues.
○Example:Ifsomeone'sageisrecordedas200years,thisisclearlyan
errorandshouldbecorrected.
5.DealingwithOutliers
○Outliersaredatapointsthataresignificantlydifferentfromothervalues.
Theycandistortanalysisifnothandledproperly.
○Solution:Identifyoutliersanddecidewhethertoremoveoradjustthem
basedontheirimpactontheanalysis.
○Example:Ifyou'reanalyzingincomedataandfindoneentrywithan
incomeof$1millionwhenmostincomesareunder$50,000,youmay
choosetoremoveoradjustthatdatapoint.
6.HandlingNoise
○Noisereferstorandomerrorsorvariationsthatdon'treflectthetruedata
pattern.Itcanbecausedbyincorrectmeasurementorotherrandom
factors.
○Solution:Usetechniqueslikesmoothingorfilteringtoreducenoiseinthe
data.
○Example:Ifsensordatafromamachineisfluctuatingwildlywithoutany
realpattern,smoothingthedatahelpsremovetheserandomfluctuations.
WhyisDataCleaningImportant?
●ImprovesAccuracy:Cleaningthedataensuresthattheresultsofanalysisor
miningareaccurateandreliable.
●ReducesErrors:Datacleaninghelpstoeliminateerrors,outliers,and
inconsistenciesthatcoulddistortconclusions.
●PreparesDataforAnalysis:Cleandatamakesiteasiertoapplydatamining
techniquesandalgorithms,ensuringbetterperformanceandresults.
ToolsforDataCleaning
Thereareseveraltoolsandsoftwarethatcanhelpwithdatacleaning:
●Excel/GoogleSheets:BasictoolslikeExcelcanbeusedtoidentifyandremove
duplicatesorfillinmissingdata.
Slide 29
●PythonLibraries:Pythonlibrariessuchaspandasandnumpyofferfunctionsfor
handlingmissingdata,removingduplicates,andcleaningdataefficiently.
●DataCleaningSoftware:ToolslikeOpenRefineandTrifactahelpautomateand
simplifythecleaningprocessforlargedatasets.
DataPreprocessing:DataIntegration
Dataintegrationisacrucialstepinthedatapreprocessingprocess.Itinvolves
combiningdatafromdifferentsourcesintoasingleunifieddataset,makingiteasierto
analyze.Thisstepisimportantbecausedataisoftenstoredinvariousformatsoracross
multiplesystems,andfordataminingtobeeffective,itneedstobeinoneplaceandin
aconsistentformat.
WhatisDataIntegration?
Dataintegrationistheprocessofmergingdatafrommultiplesourcestocreatea
comprehensiveandconsistentdataset.Thisstepisessentialbecause:
●Differentdatasourcesmayprovideusefulinformation,butiftheyarenot
integratedproperly,itbecomesdifficulttoanalyzethemtogether.
●Datacancomefromdifferentdatabases,files,sensors,orapplications,andeach
sourcemightstoredataindifferentformats.
WhyisDataIntegrationImportant?
1.CombiningDatafromDifferentSources:
○Dataoftencomesfrommultiplesystems,suchassalesdatafromastore’s
database,customerdatafromaCRMsystem,andproductdatafroman
inventorysystem.
○Dataintegrationallowsyoutobringallthisinformationtogetherintoone
dataset,makinganalysiseasier.
2.BetterInsights:
○Bycombiningdatafromvarioussources,youcangetamorecomplete
pictureofthesituation,leadingtobetterinsights.
○Example:Ifyoucombinesalesdatawithcustomerfeedback,youcan
understandhowcustomersatisfactionaffectssales.
3.Consistency:
○Dataintegrationensuresthatdatafromdifferentsourcesisconsistentand
canbeanalyzedtogetherwithoutconflictsordiscrepancies.
○Forexample,itresolvesissueswherecustomernamesmightbestoredin
differentformatsacrosssystems(e.g.,"JohnDoe"vs."Doe,John").
ChallengesinDataIntegration
1.DataFormatDifferences:
Slide 30
○Datafromdifferentsourcesmightbeindifferentformats,suchastextfiles,
spreadsheets,ordatabases,whichneedtobestandardized.
○Solution:Dataconversiontoolsortechniquesareusedtoconvertdata
intoacommonformat.
2.DataRedundancy:
○Sometimes,thesameinformationisrecordedinmultipleplaces,leadingto
duplicatedata.
○Solution:Identifyandremoveduplicatestoensurethateachpieceofdata
isunique.
3.DataInconsistencies:
○Datafromdifferentsourcesmighthaveinconsistencies,likedifferentunits
ornamingconventions(e.g.,onesystemusing"kg"forweightandanother
using"lbs").
○Solution:Datatransformationtechniques(likeconvertingallweightsto
kilograms)ensureconsistency.
4.MissingData:
○Differentsourcesmayhavemissingvalues,andintegratingthesesources
couldleadtoincompletedata.
○Solution:Techniquessuchasimputation(fillinginmissingvalueswith
estimates)orusingdatacleaningtoolscanaddressthisissue.
StepsinDataIntegration
1.IdentifyingDataSources:
○Thefirststepinintegrationisidentifyingalltherelevantdatasourcesthat
needtobecombined.
○Thesecanincludedatabases,externalfiles,orevendatacollectedfrom
webservices.
2.DataMatching:
○Datafromdifferentsourcesneedstobematched,meaningidentifying
whichdatainonesourcecorrespondstodatainanother.
○Example:MatchingcustomerIDsfromtwodifferentdatabasestocombine
theirpurchasehistoryandcontactinformation.
3.DataTransformation:
○Thisinvolvesconvertingdataintoacommonformatandstructuresothat
itcanbeeasilycombined.
○Example:Convertingalldatefieldstothesameformat(e.g.,
YYYY-MM-DD).
4.DataCleaning:
○Removeduplicates,fixerrors,andhandlemissingdataduringthe
integrationprocesstoensurethedatasetiscleanandaccurate.
5.DataConsolidation:
○Oncealldatasourcesarematched,transformed,andcleaned,theyare
consolidatedintooneunifieddataset.
Slide 31
ToolsforDataIntegration
●ETLTools(Extract,Transform,Load):Thesearesoftwaretoolsusedtoextract
datafromvarioussources,transformitintothecorrectformat,andloaditintoa
centralsystem.
○Examples:Talend,ApacheNifi,Informatica,andMicrosoftSQLServer
IntegrationServices(SSIS).
●DatabaseManagementSystems(DBMS):SystemslikeMySQL,Oracle,and
PostgreSQLhelpmanageandintegratedatafrommultiplesourcesintoone
unifiedsystem.
DataPreprocessing:DataReduction
Datareductionisanimportantstepinthedatapreprocessingprocess.Itinvolves
reducingtheamountofdatawhilemaintainingthemostimportantinformation.This
helpsmaketheanalysisfasterandmoreefficient,especiallywhendealingwithlarge
datasets.Here’sasimpleexplanationofdatareductionforundergraduatestudents:
WhatisDataReduction?
Datareductionreferstotechniquesusedtoreducethesizeofthedatasetwhileretaining
therelevantpatternsandinformation.Largedatasetscanbedifficulttohandle,analyze,
andstore,sodatareductionhelpsmakethedatamoremanageablewithoutlosingkey
insights.
WhyisDataReductionImportant?
1.ImprovesEfficiency:Reducingtheamountofdataspeedsupprocessingand
analysis,makingitlessresource-intensive.
2.ReducesStorageNeeds:Smallerdatasetsrequirelessmemoryandstorage
space.
3.SimplifiesAnalysis:Asmaller,well-reduceddatasetiseasiertoworkwithand
canstillprovideusefulinsights.
4.FasterDecision-Making:Byfocusingonthemostrelevantdata,businessescan
makequickerdecisions.
TechniquesforDataReduction
Thereareseveralwaystoreducedata,dependingonthenatureofthedatasetandthe
analysisneeds.Herearethemostcommontechniques:
1.DimensionalityReduction
●Definition:Thistechniquereducesthenumberoffeatures(variablesorattributes)
inthedatasetwhilepreservingasmuchinformationaspossible.
Slide 32
●HowItWorks:
○Forexample,inadatasetwithmanyvariables(likeheight,weight,age,
income,etc.),dimensionalityreductiontriestofindasmallersetof
importantvariablesthatstillcapturethemainpatternsinthedata.
●PopularMethods:
○PrincipalComponentAnalysis(PCA):Atechniquethattransformsthe
originalfeaturesintoasmallersetofuncorrelatedcomponents.
○LinearDiscriminantAnalysis(LDA):Amethodusedtofindalinear
combinationoffeaturesthatbestseparatesthedataintodifferentclasses.
●Example:Adatasetofcustomerdetailsmayhavefeatureslike"age,""location,"
"purchasehistory,"andmore.PCAcanreducethesefeaturesintoasmallerset
ofcomponentsthatcapturemostoftheinformation.
2.DataAggregation
●Definition:Dataaggregationinvolvescombiningmultiplerowsofdataintoa
singlerowbyaveragingorsummingthevalues.
●HowItWorks:Thisreducesthenumberofdatapointswhilepreservingthe
overallpatterns.
●Example:Ifyouhavesalesdataforeachdayofthemonth,youcanaggregate
thisdatatoshowonlythetotalsalesforeachweekormonth,reducingthe
numberofrecords.
3.Sampling
●Definition:Samplinginvolvesselectingasmaller,representativesubsetofthe
originaldataset.
●HowItWorks:Insteadofusingtheentiredataset,youuseasmallersamplethat
reflectsthecharacteristicsofthefulldataset.Samplingisespeciallyusefulwhen
dealingwithhugedatasets.
●TypesofSampling:
○RandomSampling:Randomlyselectingasubsetofthedata.
○StratifiedSampling:Ensuringthesamplecontainsproportionateamounts
ofdifferentclassesorcategories.
●Example:Ifacompanyhasdataformillionsofcustomers,asampleof1,000
customersmightbeenoughtogetanideaofcustomerbehavior.
4.DataCompression
●Definition:Datacompressionreducesthesizeofthedatabyencodingitmore
efficiently,withoutlosingimportantinformation.
●HowItWorks:Compressionalgorithmsremoveredundantorunnecessaryparts
ofthedata.
●Example:Textorimagedatacanbecompressedtosavestoragespace,making
iteasiertohandle.
5.FeatureSelection
Slide 33
●Definition:Featureselectioninvolvesidentifyingandkeepingonlythemost
importantfeatures(variables)inthedataset,andremovingirrelevantor
redundantones.
●HowItWorks:Thisreducesthenumberoffeatures,makingtheanalysissimpler
andfasterwithoutlosingkeyinformation.
●Example:Ifyouhaveadatasetwith10features,butonly4areimportantforthe
analysis,featureselectionwillremovetheirrelevantones.
BenefitsofDataReduction
●FasterAnalysis:Lessdatameansfasterprocessingtimefordatamining
algorithms.
●BetterPerformance:Reduceddatacanimprovetheperformanceofmachine
learningmodels,makingthemeasiertotrainandlesspronetooverfitting.
●Cost-Effective:Lessstorageandmemoryareneededtostorethereduced
dataset,makingitcheapertomanage.
DataPreprocessing:DataTransformation
Datatransformationisanimportantstepinthedatapreprocessingprocess.Itinvolves
changingtheformat,structure,orvaluesofdatatomakeitsuitableforanalysis.The
goalofdatatransformationistopreparedatainawaythatimprovesitsquality,
consistency,andusability,especiallyfordataminingtasks.
WhatisDataTransformation?
Datatransformationreferstotheprocessofconvertingdatafromitsrawformintoa
formatthatcanbeeasilyanalyzed.Thiscanincludeseveralactions,suchaschanging
thedata'sscale,convertingdatatypes,orcombiningmultipledatasets.Transformation
helpsmakethedatamoreconsistent,comparable,andreadyforfurtheranalysis.
WhyisDataTransformationImportant?
1.ImprovesConsistency:Differentdatasourcesmightusedifferentformats,
scales,orunits.Transformationmakessureeverythingisinacommonformat.
2.EnhancesDataQuality:Transformationcanhelpdealwithmissingvalues,
incorrectdata,oroutliers.
3.PreparesDataforModeling:Machinelearningalgorithmsanddatamining
modelsoftenrequiredatatobetransformedintospecificformatsorranges.
Slide 34
TypesofDataTransformation
1.Normalization(ScalingData):
○Definition:Changingthescaleofdatatoensurethatitfallswithina
specificrange,usually0to1.
○WhentoUse:Whenfeatures(columns)havedifferentunitsorscales,
suchasheightinmetersandweightinkilograms.
○Example:Ifyouhavedataonpeople'sheights(150cmto200cm)and
weights(50kgto100kg),youmightnormalizethedatasothatallvalues
arescaledbetween0and1.
2.Standardization:
○Definition:Transformingdatatohaveameanof0andastandard
deviationof1.
○WhentoUse:Whendataisnotinanormaldistributionorwhenmachine
learningmodelsrequirethisformofdata(e.g.,algorithmslikek-meansor
supportvectormachines).
○Formula:Z=(X−μ)/σ
WhereXistheoriginalvalue,μ\muμisthemean,andσisthestandard
deviation.
○Example:Ifexamscoresarebetween40and90,standardizationwould
convertthosevaluesintoascalewheremostdatapointsarecloseto0.
3.Discretization:
○Definition:Convertingcontinuousdataintodiscretecategoriesorbins.
○WhentoUse:Whendealingwithcontinuousvariables(e.g.,age,income)
andyouwanttosimplifyorcategorizethedata.
○Example:Ifyouhaveagesrangingfrom1to100,youmightdiscretizethis
intocategorieslike"Child,""Teenager,""Adult,"and"Senior."
4.EncodingCategoricalData:
○Definition:Convertingcategoricaldata(suchas"Yes"or"No")into
numericvaluesthatmachinelearningalgorithmscanunderstand.
○TypesofEncoding:
■LabelEncoding:Assigningeachcategoryauniqueinteger(e.g.,
"Male"=0,"Female"=1).
■One-HotEncoding:Creatingbinarycolumnsforeachcategory
(e.g.,foracolorfeaturewithvalues"Red,""Blue,"and"Green,"you
createthreecolumnswithbinaryvaluesindicatingthepresenceof
eachcolor).
○Example:Ifyouhaveacolumnfor"City"withvalueslike"NewYork,"
"London,"and"Tokyo,"youcanencodetheseintonumbersorbinary
columnsforeasieranalysis.
5.Aggregation:
○Definition:Combiningdatafrommultiplerowsorcolumnsintoasingle
value.
○WhentoUse:Whenyouneedtosummarizedata,suchascalculatingthe
averageortotalforgroups.
Slide 35
○Example:Ifyouhavesalesdataforeachday,youmightaggregateitby
monthtogettotalsalespermonth.
6.FeatureConstruction:
○Definition:Creatingnewfeaturesbycombiningortransformingexisting
ones.
○WhentoUse:Toderiveadditionalusefulinformationfromthedata.
○Example:Ifyouhavecolumnsfor"height"and"weight,"youmightcreate
anewfeaturefor"BMI"(BodyMassIndex)tobetterrepresentaperson's
physicalcondition.
DataPreprocessing:DiscretizationandConceptHierarchy
Generation
Indatapreprocessing,discretizationandconcepthierarchygenerationaretechniques
usedtopreparecontinuousorcomplexdataintosimplerformsthatareeasierto
analyze,especiallyfordataminingtasks.Here’sasimpleexplanationoftheseconcepts
forundergraduatestudents:
1.Discretization
Discretizationistheprocessofconvertingcontinuousdata(numericvalues)into
discretecategoriesorintervals.Forexample,insteadofrepresentingageasaspecific
numberlike23,youmightcategorizeitas"20-30years."
WhyDoWeNeedDiscretization?
●Somedataminingalgorithmsworkbetterwithcategoricaldata(e.g.,decision
trees).
●Convertingcontinuousdataintocategoriesmakesiteasiertoanalyzeandfind
patterns.
HowDoesDiscretizationWork?
Therearedifferentwaystodiscretizedata:
1.EqualWidthBinning:
○Therangeofvaluesisdividedintoequal-sizedintervals.
○Example:Ifthedatarangesfrom0to100andyouwant5intervals,each
binwillhaveawidthof20(0-20,21-40,41-60,etc.).
2.EqualFrequencyBinning:
○Thedataisdividedintobinssothateachbinhasthesamenumberof
datapoints.
○Example:Ifyouhave100datapoints,eachbinwillcontain20datapoints.
3.Clustering-basedDiscretization:
Slide 36
○Thedataisgroupedintoclusters,andeachclusteristreatedasa
category.
○Example:Groupingagedataintocategorieslike"young,""middle-aged,"
and"old"basedonsimilarcharacteristics.
ExampleofDiscretization:
Ifwehavethefollowingdataaboutstudentgrades:
●95,82,63,45,72
Afterdiscretizingusingequal-widthbinningwith3intervals,wemightget:
●95→"A"(90-100)
●82→"B"(70-89)
●63→"C"(50-69)
●45→"D"(0-49)
●72→"B"(70-89)
2.ConceptHierarchyGeneration
Concepthierarchygenerationistheprocessoforganizingdataattributes(orfeatures)
intohierarchicallevels,rangingfrommoregeneraltomorespecific.Thisistypically
usedforcategoricaldatatoallowforahigher-levelviewofthedata.
WhyisConceptHierarchyImportant?
●Ithelpsingeneralizingorsimplifyingthedatabygroupingsimilarconcepts.
●Itallowsdatatobeviewedatdifferentlevelsofabstraction,whichishelpfulin
taskslikedecisionmakingandpatterndiscovery.
HowDoesConceptHierarchyWork?
1.HierarchicalStructure:
○Atthetop,youhavemoregeneralcategories(e.g.,"Animals").
○Asyoumovedown,thecategoriesbecomemorespecific(e.g.,
"Mammals","Reptiles").
2.GeneratingaHierarchy:
○Youcangenerateaconcepthierarchymanuallybasedonknowledgeor
useautomaticalgorithmstogroupsimilaritems.
○Example:Ifyouhaveadatasetwiththe"Location"attribute,aconcept
hierarchymightlooklike:
■TopLevel:Country→State→City
■LowerLevel:USA→California→SanFrancisco
3.Conceptualization:
○Concepthierarchieshelpyoumovefromspecificdatapointstobroader
categories,allowingformoreabstractanalysis.
Slide 37
○Example:Insteadoflookingatindividualproductcategorieslike
"Shampoo,""Toothpaste,"and"Soap,"youmightgroupthemundera
higher-levelcategorylike"PersonalCareProducts."
ExampleofConceptHierarchy:
Foradatasetof"ProductsSold,"aconcepthierarchymightlooklike:
●Level1(General):Products
●Level2(MoreSpecific):Electronics,Clothing,Groceries
●Level3(SpecificProducts):TV,Laptop,T-shirt,Jeans,Apple,Banana
WhyAreTheseTechniquesImportantinDataPreprocessing?
●SimplifyData:Discretizationandconcepthierarchygenerationhelpsimplify
complexdata,makingiteasiertoanalyzeandunderstand.
●ImprovedAnalysis:Bygroupingdataintocategoriesorhierarchies,itiseasierto
detectpatterns,relationships,andtrendsinthedata.
●EnhanceModeling:Manydataminingalgorithmsworkmoreeffectivelywith
categoricalorhierarchicaldata,helpingimprovemodelperformance.
DataWarehouseandOLAPTechnology
DataModelingUsingCubesandOLAP
Datamodelingisanimportantpartofdataanalysisanddatamining.Ithelpsorganizeand
structuredatatomakeiteasiertoanalyzeandgaininsights.Onepopularmethodformodeling
dataisthroughCubesandOLAP(OnlineAnalyticalProcessing)
1.WhatisDataModeling?
Datamodelingistheprocessofdesigninghowdataisstored,organized,andaccessed.Inthe
contextofdataminingandanalysis,wewanttoorganizethedatainawaythatmakesiteasyto
exploreandanalyzefromdifferentperspectives.
2.WhatisOLAP(OnlineAnalyticalProcessing)?
OLAPisatechnologyusedforanalyzinglargeamountsofdataquickly.Itallowsusersto
interactivelyexploreandanalyzedatafrommultipledimensions.OLAPsystemsaredesignedto
helpindecision-makingbysummarizingdatainaneasy-to-understandformat.
KeyFeaturesofOLAP:
Slide 38
●MultidimensionalData:OLAPorganizesdatainamulti-dimensionalview(likeacube)
whereeachdimensionrepresentsdifferentperspectivesofthedata.
●InteractiveAnalysis:Userscan“slice,”“dice,”and“pivot”thedatatoviewitfromdifferent
angles.
●FastQuerying:OLAPsystemsareoptimizedforqueryinglargedatasetsquickly.
3.WhatareDataCubes?
Adatacubeisamulti-dimensionalarrayusedinOLAPtorepresentdata.Imagineacubewhere
eachsiderepresentsadifferentattribute(ordimension)ofthedata.Eachcellinthecube
containsavalue,usuallytheresultofaggregatingorsummarizingdataacrossmultiple
dimensions.
ExampleofaDataCube:
Imagineyouhaveasalesdatasetthatincludesthreedimensions:
●Product:Differentproductsbeingsold(e.g.,TV,Laptop,Phone)
●Time:Salesdataoverdifferentperiods(e.g.,months,years)
●Region:Differentgeographiclocations(e.g.,North,South,East,West)
Inthiscase,thedatacubecouldhave:
●Rowsforproducts(e.g.,TV,Laptop,Phone)
●Columnsfortime(e.g.,January,February,March)
●Depthforregions(e.g.,North,South,East,West)
Thedatacubewouldallowyoutoeasilyfindinformationliketotalsalesforeachproductineach
regionforaspecificmonth.
4.OperationsinOLAP
InOLAP,thereareseveralimportantoperationsthathelpyouexploreandanalyzethedatain
thecube:
Slide 39
1.Slice:Thisoperationallowsyoutoselectasinglelevelfromonedimensionofthecube
andviewa2Dsliceofthedata.
○Example:YoumightslicethedatatoviewsalesforJanuaryacrossallproducts
andregions.
2.Dice:Thisoperationallowsyoutoselecttwoormoredimensionsandviewasubsetof
thedataintheformofasmallercube.
○Example:YoumightdicethecubetoviewsalesforLaptopsintheNorthregion
duringJanuaryandFebruary.
3.Pivot(Rotate):Thisoperationallowsyoutorotatethecubetoviewthedatafroma
differentperspective.
○Example:Youmightpivotthecubetoswapthetimedimensionwiththeregion
dimensiontoseehowsalesvarybyregionacrossdifferentmonths.
4.DrillDown/DrillUp:Theseoperationsallowyoutoviewthedatainmoredetail(drill
down)oratahigherlevelofaggregation(drillup).
○Example:Youcandrilldownfromyearlysalestomonthlysalesordrillupfrom
monthlysalestoquarterlysales.
5.BenefitsofUsingCubesandOLAP
●Efficiency:OLAPcubesprovidefastqueryperformancebypre-aggregatingdata,which
makesanalysisfasterevenwithlargedatasets.
●MultidimensionalView:WithOLAP,youcanviewdatafrommultipleperspectives
(dimensions),helpingyouidentifytrendsandpatternsthatwouldn’tbeobviousinaflat
table.
●User-Friendly:OLAPallowsuserstointeractivelyexploredatawithoutneedingtowrite
complexqueries,makingiteasyfornon-technicaluserstoanalyzethedata.
6.ExampleofOLAPinAction
Slide 40
Let'ssayyouareanalyzingsalesdataforaretailcompany.YoucanuseOLAPto:
●Slicethedatatoviewthesalesofspecificproductsinacertaintimeperiod.
●DicethedatatolookatsalesoflaptopsintheEastregionforJanuaryandFebruary.
●Pivotthedatatoseesalesbyregionratherthanbytimeperiod.
●Drilldownintothemonthlysalesdatatounderstandwhichspecificmonthshadthe
highestsales.
ThisflexibilityinviewingandanalyzingthedataisoneofthemainstrengthsofOLAP.
DataWarehousing(DWH)DesignandUsage
WhatisDataWarehouseDesign?
Datawarehousedesignreferstotheprocessofcreatingthearchitectureandstructure
ofthedatawarehousetostoreandorganizedatainanefficientway.
Thegoalistoensurethatdatacanbeaccessedandanalyzedeasilyandquickly.
Keycomponentsofdatawarehousedesigninclude:
a.DataSource:
●Datacomesfromdifferentsourcessuchasoperationaldatabases,external
systems,orflatfiles.
●Example:Datafromsalestransactions,customerdatabases,andinventory
managementsystems.
b.DataStaging:
●Beforedataentersthedatawarehouse,itgoesthroughastagingareawhereit’s
cleanedandtransformed.Thisistoensurethatthedataisaccurateandinthe
rightformat.
●Example:Removingduplicates,fixingerrors,orconvertingdatatypes(e.g.,
convertingdatesintoastandardformat).
c.DataModeling:
●Thisinvolvesorganizingdatainthewarehousesothatit’seasytoretrieveand
analyze.Twocommontypesofdatamodelsare:
1.StarSchema:Inthismodel,thereisacentralfacttable(containsmain
datalikesales)connectedtomultipledimensiontables(containsrelated
datalikecustomer,time,andproduct).
2.SnowflakeSchema:Amorenormalizedversionofthestarschema,
wherethedimensiontablesarefurtherbrokendownintoadditional
sub-tables.
Slide 41
●Example:Inasalesdatawarehouse,thefacttablecouldstoretotalsales
figures,whiledimensiontablesstoreinformationaboutcustomers,products,and
time.
d.DataStorage:
●Dataisstoredinawaythatmakesiteasytoretrieveforanalysis.Thisinvolves
choosingtherightstoragetechnologylikerelationaldatabases,columnar
databases,orcloud-basedsolutions.
●Example:Storingdataintablesthatallowforfastquerying.
2.UsageofDataWarehouse(DWH)
Adatawarehouseisusedforavarietyofpurposes,primarilytosupport
decision-making,reporting,andanalysis.Here’showit’sused:
a.DecisionSupport:
●Organizationsusedatawarehousestosupportdecision-makingbyproviding
easyaccesstohistoricalandcurrentdatainoneplace.Thisallowsbusiness
leaderstoanalyzetrendsandmakeinformeddecisions.
●Example:Aretailermightuseadatawarehousetoanalyzesalestrendsoverthe
lastfewyearstodecideonfutureinventorypurchases.
b.ReportingandBusinessIntelligence(BI):
●Datawarehousesareusedtocreatereportsanddashboardsthathelp
businessestracktheirperformanceandkeymetrics.ToolslikePowerBI,
Tableau,orExcelcanbeusedtogenerateinsightsfromthedatastoredinthe
warehouse.
●Example:Afinancedepartmentmightgeneratemonthlyprofitandlossreports
fromthedatawarehousetoevaluatethecompany’sfinancialhealth.
c.DataAnalysis:
●Datamining,whichinvolvesextractingpatternsandknowledgefromlargedata
sets,isoftendoneusingadatawarehouse.Analystsusethedatawarehouseto
findinsightsthatmaynotbeimmediatelyapparent.
●Example:Amarketingteamcouldanalyzecustomerpurchasingpatternsto
identifywhichproductsarepopularamongdifferentagegroupsorlocations.
d.HistoricalData:
●Adatawarehousestoreslargeamountsofhistoricaldata,whichisimportantfor
analyzinglong-termtrends,forecasting,anddecision-making.
●Example:Acompanymaystoreseveralyearsofsalesdatainthewarehouseto
analyzelong-termperformance,compareyearlygrowth,orpredictfuturesales.
Slide 42
3.BenefitsofDataWarehousing
●CentralizedDataStorage:Alldataisstoredinoneplace,makingiteasierto
manageandaccess.
●ImprovedReporting:Userscangeneratereportsandinsightsquicklyand
accurately.
●DataConsistency:Thedataiscleaned,transformed,andintegrated,ensuringit
isconsistentacrossdifferentdepartmentsandsystems.
●FasterDecision-Making:Byhavingallhistoricalandcurrentdatainoneplace,
decision-makerscanaccesstheinformationtheyneedinreal-timetomake
quicker,moreinformeddecisions.
4.ChallengesofDataWarehousing
●DataIntegration:Combiningdatafromdifferentsourcescanbecomplex,
especiallyifthedataformatsandstructuresaredifferent.
●DataQuality:Ensuringthedataisaccurate,complete,andup-to-datecanbe
time-consuming.
●CostandMaintenance:Buildingandmaintainingadatawarehousecanbe
expensive,requiringbothhardwareandsoftwareresources.
Primarydifferencesbetweenstar,snowflake,andfact
constellationschemasinDataWarehousing
InDataWarehousing,schemasdefinethestructureofdataandhowitisstored.The
threemaintypesofschemasareStarSchema,SnowflakeSchema,andFact
ConstellationSchema.Here'sasimplebreakdownforundergraduatestudents:
1.StarSchema:
-Structure:Thestarschemaisthesimplestandmostcommon.
Ithasacentralfacttableconnecteddirectlytoseveraldimensiontables,creatinga
star-likeshape.
-FactTable:Thefacttablecontainsnumericdata(likesales,quantities)and**foreign
keysthatlinktodimensiontables.
-DimensionTables:Thesestoredescriptiveinformation(e.g.,productdetails,dates,
customers)thataddcontexttothedatainthefacttable.
-
Slide 43
Advantage:
Easytounderstandandquery.
-Disadvantage:Canleadtodataredundancybecausedimensiontablesarenot
normalized.
Fig:StarDesign
2.SnowflakeSchema:
-Structure:Thesnowflakeschemaisamorenormalizedversionofthestarschema.
Thedimensiontablesarebrokendownintosmallertables,resemblingasnowflake
shape.
-FactTable:Similartothestarschema,butdimensiontablesaredividedinto
sub-tablestoremoveredundancy.
-DimensionTables:Dimensiontablesarenormalized(splitintomultiplerelatedtables)
toreduceduplication.
-Advantage:Reducesdataredundancyandstoragespace.
-Disadvantage:Queriesaremorecomplexandtakelongertoexecutecomparedtoa
starschema.
Slide 44
Fig:Snowflakeschema
3.FactConstellationSchema:
-Structure:Thisschemaisalsocalledagalaxyschema.Itconsistsofmultiplefact
tablesthatsharedimensiontables.Thisisusefulforhandlingcomplexdataandmultiple
subjectareas.
-FactTables:Therearemultiplefacttables,eachrepresentingdifferentbusiness
processes(e.g.,sales,inventory)thatsharedimensionsliketime,location,orproduct.
-DimensionTables:Shareddimensiontablesprovideflexibilityandhelpanalyzedata
acrossdifferentfacttables.
-Advantage:Supportsmultipledatamartsandcomplexqueriesacrossvarious
processes.
-Disadvantage:Morecomplextodesignand
maintainthantheotherschemas.
Fig:FactConstellationSchema
Slide 45
HowisaDataWarehousedesignedforeffectiveOLAP
implementationandusage?
DesigningaDataWarehouseforeffectiveOLAP(OnlineAnalyticalProcessing)
implementationandusageinvolvesseveralimportantstepstoensurethatthesystemis
optimizedforfastandcomplexqueries,aswellas,multidimensionaldataanalysis.
1.IdentifyBusinessRequirements:
-Objective:Thefirststepistounderstandthebusinessgoalsanddataneeds.What
kindofreportsandanalysesdotheusersneed?Theserequirementshelpdefinethe
structureofthedatawarehouse.
-Example:Aretailcompanymightneedtoanalyzesalestrendsbyregion,product,
andtimeperiod.
2.ChooseanOLAPModel:
-TherearetwomaintypesofOLAPsystems:ROLAP(RelationalOLAP)and
MOLAP(MultidimensionalOLAP).
-ROLAPusesrelationaldatabasestostoredataintablesandcanhandlelarge
amountsofdata.
-MOLAPstoresdatainmultidimensionalcubes,providingfasterqueryperformancebut
requiringmorestorage.
-ChoosingtherightOLAPmodeldependsonthedatavolumeandperformanceneeds.
3.DesigntheDataWarehouseSchema:
-Chooseaschemathatsuitsthebusinessrequirements:
-StarSchema:Simplifiesqueriesbyhavingacentralfacttablesurroundedby
dimensiontables.
-SnowflakeSchema:Normalizesthedimensionsintomultiplerelatedtables,reducing
dataredundancy.
-FactConstellationSchema:Supportsmultiplefacttables,enablingcomplex
analysesacrossdifferentbusinessareas.
-Thisschemadefineshowdatawillbeorganizedandstoredinthedatawarehouse.
4.DataExtraction,Transformation,andLoading(ETL):
Slide 46
-ETLProcess:Dataisextractedfromvarioussources,cleanedandtransformedto
matchtheschema,andthenloadedintothedatawarehouse.
-Ensurethatdataisaccurate,consistent,andcleanbeforeitentersthewarehouse.This
processensuresthedataisreadyforOLAPoperations.
5.MultidimensionalDataModeling:
-DimensionsandMeasures:Dataisorganizedintodimensions(e.g.,time,location,
product)andmeasures(e.g.,sales,profit)tosupportanalysis.
-OLAPCubes:DataisarrangedintoOLAPcubes,whichallowuserstosliceanddice
thedata(viewitfromdifferentangles)anddrilldown(viewmoredetaileddata)orrollup
(viewaggregateddata).
-Example:AsalesOLAPcubemighthavedimensionsliketime,product,region,and
measuresliketotalsalesorprofit.
6.IndexingandAggregation:
-PrecomputeAggregations:Precalculateandstoreaggregateddata(e.g.,totalsales
perregionperyear).Thishelpsspeedupqueriesbyavoidingreal-timecalculations.
-Indexing:Useappropriateindexesonthefactanddimensiontablestoimprovequery
performance.Indexesallowfasterdataretrievalbyquicklylocatingtheneededrows.
7.EnsureScalabilityandPerformance:
-Designthedatawarehousetohandlegrowingdatavolumesandincreaseduser
queries.Ensurethatitcanscaleupbyaddingmorestorageorprocessingpoweras
needed.
-Usetechniqueslikepartitioninglargetablesintosmallerchunksoroptimizingthe
schematoensurefasterqueryresponses.
8.SecurityandAccessControl:
-Implementpropersecuritymeasurestoensurethatonlyauthorizeduserscanaccess
specificdata.Thismayinvolvesettingupuserroles,permissions,anddataencryption.
-OLAPsystemsshouldallowcontrolledaccesstosensitiveinformationwhilestill
enablinganalysis.
Slide 47
9.RegularMaintenanceandOptimization:
-Continuouslymonitorthesystemandperformmaintenancetaskslikeupdating
indexes,reprocessingOLAPcubes,andensuringdataaccuracy.
-Optimization:Periodicallyreviewandoptimizetheschema,indexes,andETL
processestokeepthedatawarehouserunningefficiently.
Thisstructuredapproachensuresthatthedatawarehouseiswell-preparedforOLAP,
allowingbusinessestomakeinformed,data-drivendecisions.
TheprocessofdatageneralizationusingAOI
(Attribute-OrientedInduction)inaDataWarehouse.
DataGeneralizationusingAOI(Attribute-OrientedInduction)isaprocessusedinData
Warehousingtosummarizelargedatasetsintohigher-levelconceptsforeasieranalysis.
Ithelpsreducethecomplexityofdatabytransformingdetailedinformationintomore
abstractrepresentations,whichisusefulforidentifyingpatternsandtrends.
DataGeneralization:
-DataGeneralizationinvolvestakinglow-leveldata(detailed,rawdata)and
summarizingitintohigher-levelconcepts(generalizeddata)tomakeiteasiertoanalyze
andunderstand.
-Thegoalistoconvertlargeamountsofdataintoamoremanageable,summarized
formwhilepreservingimportantpatternsandtrends.
WhatisAOI(Attribute-OrientedInduction)?
-Attribute-OrientedInduction(AOI)isatechniqueusedtoperformdatageneralization.
Itsystematicallyreplacesspecificvaluesinadatasetwithgeneralconceptsbylooking
attheattributes(columns)ofthedata.
-ThisisespeciallyhelpfulforOLAPoperationsanddataminingwhenyouwantto
explorethedataatdifferentlevelsofabstraction.
StepsintheDataGeneralizationProcessUsingAOI:
1.SelecttheRelevantData:
-First,choosethesubsetofdatayouwanttogeneralizebasedonspecificcriteria(e.g.,
selectsalesdataforaparticularregionortimeperiod).
Slide 48
-Example:Ifyou'reanalyzingsalesdata,youmightfocusonattributeslikeproduct,
region,andsalesamount.
2.SettheGeneralizationThreshold:
-Definethethresholdlevelforgeneralization.Thisthresholddetermineshowmuchthe
datawillbegeneralized,i.e.,howmanylevelsofabstractionwillbeapplied.
-Example:Youmaywanttogeneralizedatesfromindividualdaystomonthsoryears,
andproductsfromspecificitemstobroadercategories.
3.AttributeGeneralization:
-AOIfocusesongeneralizingtheattributesinthedataset.Foreachattribute(column),
replacedetailedvalueswithhigher-levelconcepts.
-Example:
-Replacespecificproductnames("LaptopModelA")withageneralcategory
("Electronics").
-Replacespecificcities("NewYork,LosAngeles")withageneralregion("USA").
4.GeneralizationOperators:
-AOIusesdifferentoperatorstogeneralizethedata:
-ConceptHierarchies:Replacevalueswithhigher-levelconceptsusingpredefined
hierarchies.Forinstance,thehierarchyfordatescouldbe:Day→Month→Year.
-AttributeRemoval:Ifanattributebecomestoogeneralizedorirrelevant,itmaybe
removed.
-Example:
-Replaceindividualtransactiondates(e.g.,"March12,2023")withthemonth("March
2023")ortheyear("2023").
5.SummarizationandAggregation:
-Oncegeneralizationisappliedtotheattributes,summarizethedatabyaggregating
values,suchassummingsalesoraveragingprofits.
-Example:Ifyougeneralizedfromdailysalestomonthlysales,sumallthesalesfor
eachmonth.
6.GenerateaGeneralizedTable:
Slide 49
-Afterthegeneralizationprocess,theresultisageneralizedtablewithfewerrowsand
columns,representingasummaryoftheoriginaldata.
-Thistableprovidesinsightsatahigherlevelofabstraction,whichisusefulfor
decision-making.
-Example:Insteadofanalyzingsalesforeachproductsoldeachday,younowhave
summarizedsalesdatabyproductcategoryandmonth.
7.PerformOLAPorDataMining:
-ThegeneralizeddatacannowbeusedforOLAPoperations(e.g.,roll-up,drill-down)
orfurtherdataminingtoidentifypatternsandtrendsatamoreabstractlevel.
-Example:Youcanusethisgeneralizeddatatoanalyzetrendsinsalesacrossdifferent
regionsortimeperiods.
WhatarethebenefitsofusingOLAPforbusinessdecision-making,
andhowdoesitenhancedatainsights?
BenefitsofUsingOLAPforBusinessDecision-Making:
1.MultidimensionalDataAnalysis:
-OLAPallowsbusinessestoanalyzedatainmultipledimensions,suchastime,
product,location,andcustomer.Thismeanstheycanviewthesamedatafromdifferent
anglesandgetdeeperinsights.
-Example:Aretailcompanycananalyzesalesbyproductcategory,region,andtime
periodtoidentifythebest-sellingproductsinspecificregionsoverdifferentmonths.
2.FastQueryPerformance:
-OLAPisoptimizedforfastandcomplexqueriesonlargedatasets.Unliketraditional
databasesthatmighttakealongtimetoprocesscomplexqueries,OLAPsystemsare
designedtoprovideinstantresultsforaggregateddata.
-Example:Managerscanquicklygeneratereportsontotalsalesforthelastquarter
acrossallstoreswithoutwaitingforlongprocessingtimes.
Slide 50
3.DataSummarizationandAggregation:
-OLAPallowsbusinessestosummarizeandaggregatedata,makingiteasiertowork
withlargevolumesofinformation.Thisishelpfulforquicklyidentifyingtrendsand
patterns.
-Example:Insteadofviewingindividualsalestransactions,businessescanview**total
salesbyregionoraverageprofitbyproductcategory.
4.Supports"SliceandDice"Operations:
-OLAPallowsuserstoperform"sliceanddice"operations,wheretheycanbreakdown
dataintosmallerpartsorviewspecificsectionsofthedata.
-Example:Abusinesscan"slice"datatolookatsalesforonespecificregion*or"dice"
datatocomparesalesacrossdifferentproductcategoriesandtimeperiods
simultaneously.
5.Drill-DownandRoll-UpFunctionality:
-OLAPsupportsdrill-downandroll-upoperations,whichallowuserstoviewdataat
differentlevelsofdetail.
-Drill-Down:Zoomingintoviewmoredetaileddata.
-Roll-Up:Zoomingouttoviewsummarizeddata.
-Example:Ausercandrilldownfromyearlysalesdatatoviewmonthlyordailysales.
Similarly,theycanrolluptoseequarterlyoryearlytotals.
6.HistoricalDataAnalysis:
-OLAPsystemsstorehistoricaldata,allowingbusinessestoperformtrendanalysis
overtime.Thishelpsthemidentifypatterns,predictfutureperformance,andmake
informeddecisions.
-Example:Acompanycancomparesalestrendsoverthepastfiveyearstoforecast
futuredemandandplaninventoryaccordingly.
7.ImprovedDecision-Making:
-Byprovidingaccesstoaccurate,up-to-date,andwell-organizeddata,OLAPhelps
decision-makersmakebetter,moreinformeddecisions.Itallowsthemtobasetheir
decisionsonfactsratherthanassumptions.
Slide 51
-Example:Amanagercananalyzecustomerdatatounderstandbuyingbehaviorand
makedecisionsaboutproductpricingorpromotionsbasedonactualdatainsights.
8.InteractiveandUser-FriendlyInterface:
-OLAPtoolsoftencomewitheasy-to-useinterfacesthatallownon-technicalusersto
exploreandanalyzedatawithoutneedingtowritecomplexqueries.Thisdemocratizes
accesstodataandmakesiteasierfordecision-makersacrossthebusinesstouse.
-Example:Amarketingmanagercancreateareportoncustomersegmentationbyage
andincomelevelusingdrag-and-dropfeatures,withoutneedinghelpfromtheIT
department.
9.Real-TimeAnalysis:
-SomeOLAPsystemssupportreal-timedataanalysis,meaningbusinessescanmake
decisionsbasedonthemostcurrentdataavailable.Thisisparticularlyimportantin
fast-movingindustrieswhereup-to-dateinformationiscrucial.
-Example:Inane-commercebusiness,decision-makerscanmonitorlivesalesdata
duringapromotionandadjuststrategiesonthegoifnecessary.
HowOLAPEnhancesDataInsights:
-ConsolidatesData:OLAPintegratesdatafromvarioussources(sales,marketing,
finance,etc.)intoasingleplatform,providingacomprehensiveviewofthebusiness.
-IdentifiesHiddenPatterns:Byanalyzingdatafromdifferentperspectivesandat
variouslevelsofdetail,OLAPhelpsuncoverhiddentrendsandpatternsthatmightnot
bevisibleinrawdata.
-SupportsPredictiveAnalysis:HistoricaldatastoredinOLAPsystemscanbeused
forforecastingandpredictingfuturetrends,helpingbusinessestoanticipatemarket
changes.
-CustomizationofReports:OLAPallowsuserstocreatecustomreportsand
dashboardstailoredtospecificbusinessneeds,ensuringthattheinsightsarerelevant
tothequestionsbeingasked.
Tags
data
Categories
General
Download
Download Slideshow
Get the original presentation file
Quick Actions
Embed
Share
Save
Print
Full
Report
Statistics
Views
38
Slides
54
Age
334 days
Related Slideshows
22
Pray For The Peace Of Jerusalem and You Will Prosper
RodolfoMoralesMarcuc
32 views
26
Don_t_Waste_Your_Life_God.....powerpoint
chalobrido8
33 views
31
VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf
JaiJai148317
31 views
14
Fertility awareness methods for women in the society
Isaiah47
30 views
35
Chapter 5 Arithmetic Functions Computer Organisation and Architecture
RitikSharma297999
27 views
5
syakira bhasa inggris (1) (1).pptx.......
ourcommunity56
29 views
View More in This Category
Embed Slideshow
Dimensions
Width (px)
Height (px)
Start Page
Which slide to start from (1-54)
Options
Auto-play slides
Show controls
Embed Code
Copy Code
Share Slideshow
Share on Social Media
Share on Facebook
Share on Twitter
Share on LinkedIn
Share via Email
Or copy link
Copy
Report Content
Reason for reporting
*
Select a reason...
Inappropriate content
Copyright violation
Spam or misleading
Offensive or hateful
Privacy violation
Other
Slide number
Leave blank if it applies to the entire slideshow
Additional details
*
Help us understand the problem better