Data Science Unit 01 PPT - SPPU Sem 6.pdf

ThejasviniBoorla 15 views 85 slides Mar 11, 2025
Slide 1
Slide 1 of 85
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85

About This Presentation

Data Science unit 1 - SPPU


Slide Content

DATA SCIENCE
UNIT -I

PRACTICAL

COURSE OUTCOMES

Types of Data Science Job
Ifyoulearndatascience,thenyougettheopportunitytofindthevariousexciting
jobrolesinthisdomain.Themainjobrolesaregivenbelow:
1.DataScientist
2.DataAnalyst
3.Dataengineer
4.DataArchitect
5.DataAdministrator
6.BusinessAnalyst
7.BusinessIntelligenceManager

UNIT 1: INTRODUCTION TO DATA SCIENCE
◼BasicsandneedofDataScience,ApplicationsofDataScience,
◼RelationshipbetweenDataScienceandInformationScience,
◼BusinessintelligenceversusDataScience,
◼Data:DataTypes,DataCollection.
◼NeedofDatawrangling,Methods:DataCleaning,DataIntegration,
DataReduction,DataTransformation,andDataDiscretization.

DATA SCIENCE
◼DataScienceistheareaofstudywhichinvolvesextractinginsights
fromvastamountsofdatausingvariousscientificmethods,
algorithms,andprocesses.
◼Ithelpsyoutodiscoverhiddenpatternsfromtherawdata.
◼ThetermDataSciencehasemergedbecauseoftheevolutionof
mathematicalstatistics,dataanalysis,andbigdata.
◼DataScienceisafieldthatallowsyoutoextractknowledgefrom
structuredorunstructureddata.

DATA SCIENCE

Need of Data Science
1.InformedDecisionMaking:
◼Enablesorganizationstomakedata-drivendecisionsforbetteroutcomes.
1.BigDataHandling:
◼Extractsinsightsfromvastandcomplexdatasets,whichtraditionalmethods
struggletomanage.
1.BusinessInnovation:
◼Drivesinnovation,leadingtonewopportunitiesandmaintainingacompetitive
advantage.

Need of Data Science
4.OperationalEfficiency:
◼Enhancesoperationalefficiencybyoptimizingprocesses,reducingcosts,and
improvingproductivity.
4.PersonalizationandCustomerExperience:
◼Utilizesdataforpersonalizedproducts,services,andmarketing,improving
customerexperience.
4.RiskManagementandFraudDetection:
◼Playsavitalroleinriskassessment,frauddetection,andcreditscoring.

Need of Data Science
7.HealthcareAdvancements:
◼Contributestoadvancementsinhealthcarethroughdataanalysis,personalized
medicine,anddiseaseprediction.
8.ScientificResearch:
◼Acceleratesscientificdiscoveriesbyanalyzinglargedatasetsinvariousresearch
fields.
9.SupplyChainOptimization:
◼Optimizessupplychainprocesses,leadingtoinventoryefficiencyandcost
savings.
10.SmartCitiesandUrbanPlanning:
◼Facilitatesthedevelopmentofsmartcitiesthroughdataanalysisforbetter
urbanplanningandresourcemanagement.

APPLICATIONS
1.Healthcare:
◼PredictiveAnalytics:Predictingpatientoutcomesandidentifyingpotential
healthrisks.
◼DiseaseSurveillance:Monitoringandpredictingthespreadofdiseases.
◼PersonalizedMedicine:Tailoringtreatmentplansbasedonindividualpatient
data.
2.Finance:
◼FraudDetection:Identifyingunusualpatternstodetectandpreventfraudulent
activities.
◼CreditScoring:Assessingcreditworthinessofindividualsandbusinesses.
◼AlgorithmicTrading:Usingalgorithmstomakeinvestmentdecisions.

APPLICATIONS
3.Retail:
◼RecommendationSystems:Providingpersonalizedproductrecommendations
tocustomers.
◼InventoryManagement:Optimizingstocklevelsbasedondemandforecasts.
◼CustomerSegmentation:Dividingcustomersintogroupsfortargeted
marketing.
4.Manufacturing:
◼PredictiveMaintenance:Anticipatingequipmentfailurestoreducedowntime.
◼QualityControl:Analyzingproductiondatatoensureproductquality.
◼SupplyChainOptimization:Streamlininglogisticsandinventorymanagement.

APPLICATIONS
5.Telecommunications:
◼ChurnPrediction:Identifyingcustomersatriskofleavingtheservice.
◼NetworkOptimization:Enhancingnetworkperformanceandefficiency.
◼CustomerExperienceAnalysis:Improvingservicesbasedoncustomerfeedback
andbehavior.
6.Marketing:
◼CustomerSegmentation:Dividingcustomersintogroupsfortargetedcampaigns.
◼SocialMediaAnalytics:Analyzingsocialmediadatatounderstandcustomer
sentiment.

APPLICATIONS
7.Education:
◼AdaptiveLearningPlatforms:Personalizingeducationalcontentbasedonstudent
performance.
◼StudentRetention:Identifyingfactorsinfluencingstudentdropoutrates.
◼PerformanceAnalysis:Analyzingdatatoimproveteachingmethodologies.
8.Energy:
◼PredictiveMaintenanceforInfrastructure:Monitoringandmaintainingenergy
infrastructure.
◼SmartGrids:Optimizingenergydistributionforefficiency.
◼EnergyConsumptionForecasting:Predictingfutureenergydemand.

APPLICATIONS
9.Transportation:
◼RouteOptimization:Findingthemostefficientroutesforvehicles.
◼DemandForecasting:Predictingtransportationdemandforbetterplanning.
◼TrafficManagement:Analyzingtrafficpatternsforimprovedcityplanning.
10.Government:
◼CrimePrediction:Predictingareaswithhighlikelihoodofcriminalactivities.
◼PublicHealthMonitoring:Trackingandmanagingpublichealthcrises.
◼PolicyPlanning:Analyzingdataforevidence-basedpolicydecision-making.

Relationship between Data Science and Information Science
◼DataScienceandInformationSciencearecloselyrelatedfieldsbutfocuson
differentaspectsofhandlingandutilizinginformation.
◼DataScience:
◼Primarilydealswithextractinginsightsandknowledgefromstructuredand
unstructureddata.Itinvolvesacombinationofstatistics,machinelearning,and
domainexpertisetoanalyzeandinterpretdata.
◼InformationScience:
◼Focusesontheorganization,classification,retrieval,anddisseminationof
information.Itencompassesabroaderview,includingthestudyofinformation
systems,knowledgemanagement,andthedesignofinformationstructures.

Scope
◼DataScience:Primarilyfocusesontheextractionofknowledgeandinsights
fromdatatosupportdecision-makingandpredictions.
◼InformationScience:Encompassesabroaderspectrum,includingthestudyof
informationprocesses,systems,andtheeffectiveuseofinformationinvarious
domains.

Methods and Techniques
◼DataScience:Utilizesstatisticalanalysis,machinelearningalgorithms,data
modeling,andprogrammingtoextractpatternsandinsightsfromdata.
◼InformationScience:Emphasizestheorganizationandretrievalofinformation,
ofteninvolvingthedesignandmanagementofdatabases,informationsystems,
andknowledgerepositories.

Data vs Information
◼Dataissomethingraw,meaningless,anobjectthat,whenanalyzedorconverted
toausefulform,becomesinformation.
◼Informationisalsodefinedas“datathatareendowedwithmeaningand
purpose.
◼Forexample,thenumber“480,000”isadatapoint.Butwhenweaddan
explanationthatitrepresentsthenumberofdeathsperyearintheUSAfrom
cigarettesmoking,itbecomesinformation.

Applications: Information Science
1.LibraryScience:Organization,cataloging,andclassification.
2.InformationRetrievalSystems:Searchenginedevelopment,algorithmdesign.
3.DatabaseManagement:Creation,maintenance,andoptimization.
4.KnowledgeManagement:Capture,organization,anddistribution.
5.InformationArchitecture:User-friendlystructuredesign.
6.DigitalAssetManagement:Organizationofdigitalmediaassets.
7.DocumentManagement:Tracking,accesscontrol,versioning.
8.HealthInformationManagement:Patientrecordorganization.
9.RecordsManagement:Lifecyclemanagementandcompliance.
10.DigitalPreservation:Strategiesforpreservingdigitalcontent.

Overlap
◼ThereisanoverlapbetweenDataScienceandInformationScience,especiallyin
areassuchasinformationretrieval,datamanagement,andthedevelopmentof
informationsystems.

Business Intelligence vs Data Science
◼Businessintelligence(BI)isasetofstrategiesandtechnologiesenterprisesuseto
analyzebusinessinformationandtransformitintoactionableinsightsthatinform
strategicandtacticalbusinessdecisions.
◼BItoolsaccessandanalyzedatasetsandpresentanalyticalfindingsinreports,
summaries,dashboards,graphs,charts,andmapstoprovideuserswithdetailed
intelligenceaboutthestateofthebusiness.

Business Intelligence versus Data Science
Factors Business Intelligence Data Science
Concept It is a collection of processes, tools, and
technologies that help a business with data
analysis.
It consists of mathematical and statistical models
used for processing the data, discovering hidden
patterns, and predicting future actions based on
those patterns.
Data It deals mainly with structured data. It accepts both structured and unstructured data.
FlexibilityData sources should be planned before the
visualization.
Data Sources can be added anytime based on the
requirements.
Approach It has both statistical and visual approaches toward
data analysis.
Graph analysis, NLP, machine learning, neural
networks, and other methods can be used to
process the data.
ExpertiseIt is made for business users to visualize raw
business information without any technical
knowledge.
It requires sound knowledge of data analysis and
programming.
ComplexityFor a single user, compared to data science,
business intelligence is much simpler to use and
visualize data.
Data science is much more complex when
compared to business intelligence.
Tools Business intelligence tools include MS Excel,
Power BI, SAS BI, MicroStrategy, IBM Cognos,
Throughput, and more.
Some of the most popular Data Science tools are
Python, Hadoop, Spark, R, TensorFlow, and more.

Data
1.Data Types
a.Structured data
b.Unstructured data
1.Data Collection
a.Open Data
b.Social Media Data
c.Multimodal Data
d.Data Storage and Presentation

Data Types: Structured data
1.Structureddataisthemostimportantdatatype.
1.Highlyorganizedinformationthatcanbeseamlesslyincludedinadatabase
andreadilysearchedviasimplesearchoperations.
1.Someonewouldhavetocollect,store,andpresentthedatainsuchaformat.

Example

Data Types: Unstructured Data
◼Unstructureddataisdatathatdoesnothaveapre-defineddatamodelorformat,
makingitlessorganizedandmorechallengingtoanalyzeusingtraditionalmethods.
◼Examplesofunstructureddataincludetextdocuments,emails,socialmediaposts,
images,videos,audiorecordings,etc.
◼ChallengeswithUnstructuredData:
◼Thelackofstructuremakescompilationandorganizingunstructureddataa
timeandenergy-consumingtask.
◼Structureddataisakintomachinelanguage,inthatitmakesinformationmuch
easiertobeparsedbycomputers.

Data Collection: Open Data
a.OpenData
◼Datashouldbefreelyavailableinapublicdomain
◼Canbeusedbyanyoneastheywish,withoutrestrictionsfromcopyright,
patents,orothermechanismsofcontrol.
◼Listofprinciplesassociatedwithopendata
1.Public
2.Accessible
3.Described
4.Reusable
5.Complete
6.Timely

Data Collection: Social Media Data & Multimodal Data
b.SocialMediaData
◼Socialmediahasbecomeagoldmineforcollectingdatatoanalyzeforresearch
ormarketingpurposes.
◼ThisisfacilitatedbytheApplicationProgrammingInterface(API)thatsocial
mediacompaniesprovidetoresearchersanddevelopers.
b.MultimodalData
◼Multimodaldatareferstodatathatinvolvesmultiplemodesortypesof
information.
◼Eachmoderepresentsadistinctformofdata,suchastext,images,audio,video,
orothertypes,andthesemodesareofteninterconnected.

Data Collection: Multimodal Data
1.Text:Involveswrittenorspokenlanguage,suchasdocuments,articles,transcripts,
andtextualinformation.
2.Image:Representsvisualcontent,includingphotographs,graphics,andothervisual
elements.
3.Audio:Involvessoundorspokenwords,capturedinaudiofiles,recordings,or
otherformats.
4.Video:Integratesmovingimagesandaudio,oftendepictingdynamicscenesor
events.
5.SensorData:Capturesinformationfromvarioussensors,suchastemperature
sensors,accelerometers,orenvironmentalsensors.
6.GeospatialData:Involveslocation-basedinformation,includingmaps,GPS
coordinates,andspatialdata.

Data Storage and Presentation
◼Dependingonitsnature,dataisstoredinvariousformats.
◼Ifdataisstructured,itiscommontostoreandpresentitinsomekindofdelimited
way.
◼Thatmeansvariousfieldsandvaluesofthedataareseparatedusingdelimiters,such
ascommasortabs.
◼DataFormats:
1.CSV(Comma-SeparatedValues)
2.TSV(Tab-SeparatedValues)
3.XML(eXtensibleMarkupLanguage)
4.RSS(ReallySimpleSyndication)
5.JSON(JavaScriptObjectNotation)

CSV (Comma-Separated Values)
◼CSV(Comma-SeparatedValues)formatisthemostcommonimportandexport
formatforspreadsheetsanddatabases.
◼Forexample,Depression.csvisadatasetthatisavailableatUFHealth,UF
Biostatisticsfordownloading.
◼Thedatasetrepresentstheeffectivenessofdifferenttreatmentprocedureson
separateindividualswithclinicaldepression.
◼AnyspreadsheetprogramsuchasMicrosoftExcelorGoogleSheetscanreadily
openaCSVfileanddisplayitcorrectlymostofthetime.

CSV File Format
treat,before,after,diff
No Treatment,13,16,3
No Treatment,10,18,8
No Treatment,16,16,0
Placebo,16,13,-3
Placebo,14,12,-2
Placebo,19,12,-7
Seroxat (Paxil),17,15,-2
Seroxat (Paxil),14,19,5
Seroxat (Paxil),20,14,-6
Effexor,17,19,2
Effexor,20,12,-8
Effexor,13,10,-3

TSV (Tab-Separated Values)
◼TSV (Tab-Separated Values) files are used for raw data and can be imported into
and exported from spreadsheet software.
◼Tab-separated values files are essentially text files, and the raw data can be viewed
by text editors, though such files are often used when moving raw data between
spreadsheets.
Name<TAB>Age<TAB>Address
Ryan<TAB>33<TAB>1115 W
Franklin Paul<TAB>25<TAB>Big Farm
Way Jim<TAB>45<TAB>W Main St
Samantha<TAB>32<TAB>28 George St

XML (eXtensible Markup Language)
◼XML(eXtensibleMarkupLanguage)wasdesignedtobebothhumanandmachine
readable,andcanthusbeusedtostoreandtransportdata.
◼Intherealworld,computersystemsanddatabasescontaindatainincompatible
formats.
◼AstheXMLdataisstoredinplaintextformat,itprovidesasoftwareandhardware
independentwayofstoringdata.
◼Thismakesitmucheasiertocreatedatathatcanbesharedbydifferent
applications.

XML File Format
<?xml version=“1.0” encoding=“UTF-8”?> <bookstore>
<book category=“information science” cover=“hardcover”> <title lang=“en”>Social
Information Seeking</title> <author>Chirag Shah</author>
<year>2017</year>
<price>62.58</price>
</book>
<book category=“data science” cover=“paperback”> <title lang=“en”>Hands-On
Introduction to Data
Science</title> <author>Chirag Shah</author> <year>2019</year> <price>50.00</price>
</book>
</bookstore>

RSS & JSON
◼RSS(ReallySimpleSyndication)
◼Itisaformatusedtosharedatabetweenservices,andwhichwasdefinedinthe
1.0versionofXML.
◼JSON(JavaScriptObjectNotation)
◼Itisalightweightdata-interchangeformat.
◼Itisnotonlyeasyforhumanstoreadandwrite,butalsoeasyformachinesto
parseandgenerate.ItisbasedonasubsetoftheJavaScriptProgramming
Language.

Data Pre-processing
◼Dataintherealworldisoftendirty;thatis,itisinneedofbeingcleanedupbefore
itcanbeusedforadesiredpurpose.Thisisoftencalleddatapre-processing.
◼Herearesomeofthefactorsthatindicatethatdataisnotcleanorreadyto
process:
◼Incomplete:
◼Whensomeoftheattributevaluesarelacking,certainattributesofinterest
arelacking,orattributescontainonlyaggregatedata.
◼Noisy:
◼Whendatacontainserrorsoroutliers.Forexample,someofthedata
pointsinadatasetmaycontainextremevaluesthatcanseverelyaffectthe
dataset’srange.

Data Pre-processing
◼Inconsistent:
◼Datacontainsdiscrepanciesincodesornames.
◼Forexample,ifthe“Name”columnforregistrationrecordsofemployees
containsvaluesotherthanalphabeticalletters,orifrecordsdonotstartwitha
capitalletter,discrepanciesarepresent.
◼Themostimportanttasksinvolvedindatapre-processingare:
◼DataCleaning
◼DataIntegration
◼DataTransformation
◼DataReduction
◼DataDiscretization

Data Cleaning
◼Sincethereareseveralreasonswhydatacouldbe“dirty,”therearejustas
manywaysto“clean”it.
◼Therearethreekeymethodsthatdescribewaysinwhichdatamaybe
“cleaned,”orbetterorganized,orscrubbedofpotentiallyincorrect,incomplete,
orduplicatedinformation.
1.DataMunging
2.HandlingMissingData
3.SmoothNoisyData

Data Munging
◼Often,thedataisnotinaformatthatiseasytoworkwith.
◼Forexample,itmaybestoredorpresentedinawaythatishardtoprocess.
◼Thus,weneedtoconvertittosomethingmoresuitableforacomputerto
understand.
◼Toaccomplishthis,thereisnospecificscientificmethod.
◼Theapproachestotakeareallaboutmanipulatingorwrangling(ormunging)the
datatoturnitintosomethingthatismoreconvenientordesirable.
◼Thiscanbedonemanually,automatically,or,inmanycases,semi-automatically.
◼Considerthefollowingtextrecipe.
◼“Addtwodicedtomatoes,threeclovesofgarlic,andapinchofsaltinthemix.”
◼Thiscanbeturnedintoatable

Handling Missing Data
◼Sometimesdatamaybeintherightformat,butsomeofthevaluesaremissing.
◼Consideratablecontainingcustomerdatainwhichsomeofthehomephone
numbersareabsent.Thiscouldbeduetothefactthatsomepeopledonothave
homephones,insteadtheyusetheirmobilephonesastheirprimaryoronlyphone.
◼Othertimesdatamaybemissingduetoproblemswiththeprocessofcollecting
data,oranequipmentmalfunction.Or,comprehensivenessmaynothavebeen
consideredimportantatthetimeofcollection.
◼Furthermore,somedatamaygetlostduetosystemorhumanerrorwhilestoring
ortransferringthedata.
◼Strategiestocombatmissingdataincludeignoringthatrecord,usingaglobal
constanttofillinallmissingvalues,imputation,inference-basedsolutions(Bayesian
formulaoradecisiontree),etc.

Smooth Noisy Data
◼Therearetimeswhenthedataisnotmissing,butitiscorruptedforsomereason.
◼Thisis,insomeways,abiggerproblemthanmissingdata.
◼Datacorruptionmaybearesultoffaultydatacollectioninstruments,dataentry
problems,ortechnologylimitations.
◼Forexample,adigitalthermometermeasurestemperaturetoonedecimalpoint
(e.g.,70.1°F),butthestoragesystemignoresthedecimalpoints.
◼So,nowwehave70.1°Fand70.9°Fbothstoredas70°F.Thismaynotseemlikea
bigdeal,butforhumansa99.4°Ftemperaturemeansyouarefine,and99.8°F
meansyouhaveafever,andifourstoragesystemrepresentsbothofthemas99°F,
thenitfailstodifferentiatebetweenhealthyandsickpersons!

Smooth Noisy Data
◼Thereisnoonewaytoremovenoise,orsmoothoutthenoisinessinthedata.
◼However,therearesomestepstotry.First,youshouldidentifyorremove
outliers.
◼Forexample,recordsofpreviousstudentswhosatforadatascienceexamination
showallstudentsscoredbetween70and90points,barringonestudentwho
receivedjust12points.
◼Itissafetoassumethatthelaststudent’srecordisanoutlier(unlesswehavea
reasontobelievethatthisanomalyisreallyanunfortunatecaseforastudent!).
◼Second,youcouldtrytoresolveinconsistenciesinthedata.
◼Forexample,allentriesofcustomernamesinthesalesdatashouldfollowthe
conventionofcapitalizingallletters,andyoucouldeasilycorrectthemiftheyare
not.

Smooth Noisy Data
◼SimpleMovingAverage:Averagesasetofdatapointswithinaspecified
window,providingasmoothedrepresentationoftheunderlyingtrend.
◼ExponentialMovingAverage:Givesmoreweighttorecentdatapoints,
smoothingatimeserieswhileemphasizingthemostrecenttrends.
◼Z-Score:Measureshowmanystandarddeviationsadatapointisfromthemean,
identifyingoutliersbasedontheirdeviationfromtheaverage.
◼InterQuartileRange(IQR):Definesarangebetweenthefirstandthird
quartiles,detectingoutliersbasedonvaluesoutsidethisrange.

Z-Score

IQR

IQR

Data Integration
◼Tobeasefficientandeffectiveforvariousdataanalysesaspossible,datafrom
varioussourcescommonlyneedstobeintegrated.
◼Thefollowingstepsdescribehowtointegratemultipledatabasesorfiles.
1.Combinedatafrommultiplesourcesintoacoherentstorageplace(e.g.,asingle
fileoradatabase).
2.Engageinschemaintegration,orthecombiningofmetadatafromdifferent
sources.

Data Integration
3.Detectandresolvedatavalueconflicts.Forexample:
a.Aconflictmayarise;forinstance,suchasthepresenceofdifferent
attributesandvaluesfromvarioussourcesforthesamereal-worldentity.
b.Reasonsforthisconflictcouldbedifferentrepresentationsordifferent
scales;forexample,metricvs.Britishunits.

Data Integration
4.Addressredundantdataindataintegration.Redundantdataiscommonly
generatedintheprocessofintegratingmultipledatabases.Forexample:
a.Thesameattributemayhavedifferentnamesindifferentdatabases.
b.Oneattributemaybea“derived”attributeinanothertable;forexample,
annualrevenue.
c.Correlationanalysismaydetectinstancesofredundantdata.

Data Transformation
◼Datamustbetransformedsoitisconsistentandreadable(byasystem).The
followingfiveprocessesmaybeusedfordatatransformation.
1.Smoothing:Removenoisefromdata.
2.Aggregation:Summarization,datacubeconstruction.
3.Generalization:Concepthierarchyclimbing.

Data Transformation
4.Normalization: Scaled to fall within a small, specified range and aggregation.
Some of the techniques that are used for accomplishing normalization are:
a.Min–max normalization.
b.Z-score normalization.
c.Normalization by decimal scaling.
5.Attribute or feature construction: New attributes constructed from the
given ones.

Aggregation
◼Aggregationisaprocessindataanalysisthatinvolvescombiningandsummarizing
datafrommultiplesourcesorrowsintoasinglevalue.
◼Itisoftenusedforcreatingsummarystatistics,constructingdatacubes,orderiving
insightsfromlargedatasets.
◼Therearevariousaggregationfunctionsthatcanbeapplieddependingonthe
natureofthedataandthedesiredsummary.
◼Twocommontechniquesinvolvingaggregationaresummarizationanddatacube
construction.

Aggregation: Summarization
◼Summarizationisaspecificformofaggregationwherethegoalistoprovidea
condensedoverviewofkeycharacteristicsinadataset.
◼Summarystatistics,suchasmean,median,mode,range,andstandarddeviation,are
oftenusedtocaptureessentialfeaturesofthedata.
◼Summarizationiscrucialforunderstandingthecentraltendenciesandvariability
withinadataset.

Aggregation: Data Cube Construction
◼Definition:Adatacubeisamulti-dimensionalrepresentationofdatathatallowsforthe
analysisofinformationalongmultipledimensions.
◼Dimensions:Datacubeshavedimensions,whicharethecategoricalvariablesalongwhich
dataisanalyzed(e.g.,time,geography,product).
◼Measures:Measuresarethenumericvaluesbeinganalyzed(e.g.,sales,revenue).
◼AggregationalongDimensions:Datacubesinvolveaggregatingmeasuresalong
differentdimensionstoprovideacomprehensiveviewofthedata.
◼OLAP(OnlineAnalyticalProcessing):DatacubesareoftenassociatedwithOLAP
systems,whereuserscaninteractivelyexploreandanalyzedatainamultidimensionalway.

Data Cube Operations: Roll-Up
◼Theroll-upoperation(alsoknownasdrill-uporaggregationoperation)performs
aggregationonadatacube,byclimbingdownconcepthierarchies,i.e.,dimension
reduction.
◼Roll-upislikezooming-outonthedatacubes.

Data Cube Operations: Drill-Down
◼Thedrill-downoperation(alsocalledroll-down)isthereverseoperationofroll-up.
◼Drill-downislikezooming-inonthedatacube.
◼Itnavigatesfromlessdetailedrecordtomoredetaileddata.

Data Cube Operations: Slice & Dice
◼Asliceisasubsetofthecubescorrespondingtoasinglevalueforoneormore
membersofthedimension.
◼Forexample,asliceoperationisexecutedwhenthecustomerwantsaselectionon
onedimensionofathree-dimensionalcuberesultinginatwo-dimensionalsite.
◼Thediceoperationdescribesasubcubebyoperatingaselectionontwoormore
dimension.

Data Cube Operations: Pivot
◼Thepivotoperationisalsocalledarotation.
◼Pivotisavisualizationoperationswhichrotatesthedataaxesinviewtoprovidean
alternativepresentationofthedata.

Pivot

Generalization: Concept Hierarchy Climbing.
◼Generalizationistheprocessofsummarizingdetailedandspecificdataintomore
abstractandgeneralizedforms.Itinvolvesmovingupinthehierarchytoahigher
levelofdetail.
◼Concepthierarchyclimbingistheprocessofnavigatingupthelevelsofaconcept
hierarchytoaccessmoregeneralorsummarizeddata.
◼Example:Startingfromdailysalesdata,climbingtheconcepthierarchywould
involveaggregatingtomonthly,quarterly,andyearlylevels.
◼Allowsuserstoviewdataatdifferentlevelsofabstractionbasedontheiranalytical
needs.

Min–max normalization

Normalization by Decimal Scaling
◼Decimalscalingisonemethodofnormalizationwherethevaluesarescaledbyapowerof10.
◼Thegoalistobringallthevalueswithinafeaturetoasimilarscalewithoutchangingtheirrelative
proportions.
◼Theformulafordecimalscalingnormalizationis:

Example

Data Reduction
◼Datareductionisakeyprocessinwhichareducedrepresentationofadatasetthat
producesthesameorsimilaranalyticalresultsisobtained.
◼Twoofthemostcommontechniquesusedfordatareduction.
1.DataCubeAggregation
2.DimensionalityReduction

Dimensionality Reduction
◼Dimensionalityreductionisatechniqueusedinmachinelearninganddataanalysis
toreducethenumberoffeaturesorvariablesinadataset.
◼High-dimensionaldatasets,wherethenumberoffeaturesislarge,canpose
challengessuchasincreasedcomputationalcomplexity,overfitting,anddifficultyin
visualization.
◼Dimensionalityreductionmethodsaimtocapturethemostimportantinformation
inthedatawhilereducingitsdimensionality.
◼Methods:PrincipalComponentAnalysis(PCA),Autoencoders,t-Distributed
StochasticNeighborEmbedding(t-SNE)

Overfitting
◼Overfittinginmachinelearningiswhenamodellearnsthetrainingdatatoowell,
capturingnoiseandspecificdetailsthatdon'tgeneralizetonewdata.
◼Itleadstohighaccuracyontrainingbutpoorperformanceonunseendata.

Data Discretization
◼Weareoftendealingwithdatathatarecollectedfromprocessesthatare
continuous,suchastemperature,ambientlight,andacompany’sstockprice.
◼Butsometimesweneedtoconvertthesecontinuousvaluesintomoremanageable
parts.
◼Thismappingiscalleddiscretization.

Data Discretization
◼Therearethreetypesofattributesinvolvedindiscretization:
1.Nominal:Valuesfromanunorderedset
2.Ordinal:Valuesfromanorderedset
3.Continuous:Realnumbers
◼Toachievediscretization,dividetherangeofcontinuousattributesintointervals.
◼Forinstance,wecoulddecidetosplittherangeoftemperaturevaluesintocold,
moderate,andhot,orthepriceofcompanystockintoaboveorbelowitsmarket
valuation.

Nominal Values
◼Instatisticsandmathematics,nominalvaluesarecategoricaldatathatrepresent
differentcategoriesorgroups,buttheorderamongthesecategoriesisnot
meaningful.
◼Nominaldatacanonlybeclassifiedandcannotberankedorordered.
◼Here'sanexampletoillustratenominalvalues:
◼Example:ColorsofCars
◼Consideradatasetthatrecordsthecolorsofcarsinaparkinglot.Thecolorsare
nominalbecausetheyrepresentdifferentcategories,butthereisnoinherentorder
amongthem.

Nominal Values
◼Thepossiblecolorsmightinclude:
◼Red
◼Blue
◼Green
◼Black
◼White
◼Inthiscase,youcancategorizethecarsbasedontheircolors,butyoucannotsay
thatonecoloris"greater"or"higher"thananotherinanymeaningfulway.
◼Theassignmentofcolorstothecarsisarbitrary,andthereisnoinherentorderor
rankingassociatedwiththecolors.

Ordinal Values
◼Incontrasttonominaldata,ordinalvaluesrepresentcategorieswithameaningfulorderor
ranking.
◼However,theintervalsbetweenthevaluesarenotnecessarilyuniformormeasurable.
◼Here'sanexampletoillustrateordinalvalues:
◼Imagineasurveythatcollectscustomersatisfactionratingsforaproductorservice.The
ratingsareonascalefrom1to5:
1.VeryDissatisfied
2.Dissatisfied
3.Neutral
4.Satisfied
5.VerySatisfied

Ordinal Values
◼Inthiscase,thevalueshaveaclearorder,with"VeryDissatisfied"beingthelowest
levelofsatisfactionand"VerySatisfied"beingthehighest.
◼However,theintervalsbetweenthesatisfactionlevelsarenotnecessarilyuniform
orquantifiable.
◼Thedifferenceinsatisfactionbetween"Dissatisfied"and"Neutral"maynotbethe
sameasthedifferencebetween"Satisfied"and"VerySatisfied."
◼Thedataisordinalbecausethereisameaningfulordertothesatisfactionlevels,but
theintervalsbetweenthecategoriesaresubjectiveandmaynotbeconsistently
measurable.

Continuous Values
◼Continuousdatainvolvesrealnumbersandrepresentsmeasurementsthatcantake
anyvaluewithinagivenrange.
◼Unlikediscretedata,whichconsistsofdistinctandseparatevalues,continuousdata
canhaveaninfinitenumberofpossiblevalues.
◼Realnumberscanincludedecimalsandfractions,allowingforacontinuous
spectrum.
◼Here'sanexample:
◼Example:HeightMeasurement
◼Considermeasuringtheheightofindividuals.Heightsarecontinuousdatabecausea
person'sheightcantakeanyvaluewithinacertainrange.

Continuous Values
◼Youcouldmeasuresomeone'sheightas165.2cm,andit'sconceivablethatthenext
measurementmightbe165.201cm,withaninfinitenumberofpossiblevalues
betweenthem.
◼Inthiscase,heightisacontinuousvariablebecauseitcanbemeasuredwithahigh
levelofprecision,andthereisnolimittothenumberofdecimalplacesthatcould
beconsidered.
◼Continuousdataisoftenassociatedwithmeasurementsinthephysicalworld,
wherevaluescanbeaspreciseasthemeasuringinstrumentsallow.
Tags