MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf

UnsupervisedLearning
Clustering–K-means–K-mode-K-median–Hierarchicalclustering–
DBSCAN–PrincipalComponentAnalysis–IndependentComponent
Analysis

ADVANCEDLEARNING
Sampling–Basicsamplingmethods–MonteCarlo–GibbsSampling–ComputationalLearning
theory–Reinforcementlearning–MarkovDecisionProcesses.

INTRODUCTION-
WhatIsClustering?
•Clusteringistheclassificationofobjectsintodifferentgroups,ormore
precisely,thepartitioningofadatasetintosubsets(clusters),sothatthe
dataineachsubset(ideally)sharesomecommontrait-oftenaccordingto
somedefineddistancemeasure.
•Intra-ClusterSimilarity
•Pointswithinthesameclusterarehighlysimilar.
•Inter-ClusterDissimilarity
•Pointsfromdifferentclustersaresignificantlydifferent.
•Centroid
•Thecenterofacluster,oftenrepresentingthe"average"ofallpointsinthecluster

TypesOfClustering
1.Connectivity-basedclustering:Thistypeofclustering
algorithmbuildstheclusterbasedontheconnectivitybetween
thedatapoints.
2.Centroid-basedclustering:Thistypeofclusteringalgorithm
formsaroundthecentroidsofthedatapoints.
3.Distribution-basedclustering:Thistypeofclustering
algorithmismodeledusingstatisticaldistributions.Itassumes
thatthedatapointsinaclusteraregeneratedfromaparticular
probabilitydistribution,andthealgorithmaimstoestimatethe
parametersofthedistributiontogroupsimilardatapointsinto
clusters
4.Density-basedclustering:Thistypeofclusteringalgorithm
groupstogetherdatapointsthatareinhigh-density
concentrationsandseparatespointsinlow-concentrations
regions.Thebasicideaisthatitidentifiesregionsinthedata
spacethathaveahighdensityofdatapointsandgroupsthose
pointstogetherintoclusters.

K-MEANSCLUSTERING
Kmeansalgorithm–unsupervisedalgorithmforpartitioning
datasetsinto‘k’groupsbasedonthesimilarityoffeature,where
k<n.
Datapointswithinthesamegroup(cluster)aremoresimilarto
eachotherthantothoseinothergroups.
Doesn’trequirelabeleddata.
Centroidisthecentralpointofaclusterwhichiscalculatedas
themeanofallpointsinthecluster.
K-meansaimstominimizethesum-of-squarederrorswithin
clusters:
wherexnisavectorrepresentingthen
th
datapointandujisthe
geometriccentroidofthedatapointsinSj.
Kisthenumberofclusters
Sjisthesetofpointsincluserj

CommonDistancemeasures
willdeterminehowthe oftwo
elementsiscalculatedanditwillinfluencetheshapeofthe
clusters.
Theyinclude:
1.TheEuclideandistance(alsocalled2-normdistance)isgiven
by:
2.TheManhattandistance(alsocalledtaxicabnormor1-norm)
isgivenby:
3.Minkowskidistance
4.CosineSimilarity

Stepsofk-meansalgorithm
•Initialization:
•Choosethenumberofclusters,K.
•RandomlyselectKinitialcentroids(eitherrandomlychosendatapointsorrandomlygeneratedpo
•AssignmentStep:
•Assigneachdatapointtothenearestcentroidusingadistancemetric(usuallyEuclideandistanc
•ThisformsKclusters.
•UpdateStep:
•Calculatethenewcentroidforeachclusterbyfindingthemeanofalldatapointsassignedtothat
•ConvergenceCheck:
•Repeattheassignmentandupdatestepsuntil:
•Thecentroidsnolongerchangesignificantly,or
•Amaximumnumberofiterationsisreached.

Advantagesofk-means
•Simple&Fast:Easytoimplementandcomputationally
efficient,especiallyforlargedatasets.
•Scalable:Workswellwithhigh-dimensionaldata.
•Flexible:Canbeadaptedwithdifferentdistancemeasures.

Limitationsofk-means
•RequiresK:Youneedtospecifythenumberofclustersinadvance.
•SensitivetoInitialization:Poorinitializationcanleadtosuboptimal
solutions(solvedbyK-means++initialization).
•AssumesSphericalClusters:Worksbestwhenclustersareof
similarsizeanddensity.
•OutliersImpact:Sensitivetooutlierswhichcandistortthecentroid
positions.

Applicationsofk-means
•ImageCompression:Reducingthenumberofcolorsinanimage.
•CustomerSegmentation:Groupingcustomersbasedonpurchasing
behavior.
•AnomalyDetection:Identifyingoutliersindata.
•DocumentClustering:Organizingdocumentswithsimilarcontent.

K-meansexampledataset
Point X Y
A 1 2
B 1 4
C 1 0
D 10 2
E 10 4
F 10 0
Thefollowingisa2Ddatasetwith6datapointsrepresenting(x,y)coordinates

•Step1:ChoosetheNumberofClusters(K)
•Let’ssetK=2(Expectingtwoclustersbasedonthedata
distribution).
•Step2:InitializeCentroids
•Randomlyselecttwopointsasinitialcentroids:
•Centroid1(C1)=A(1,2)
•Centroid2(C2)=D(10,2)

•AssignPointstotheNearestCentroid
UsingtheEuclideandistanceformula:
•d=sqrt{(x2-x1)^2+(y2-y1)^2}
•DistanceCalculations:
•PointA(1,2):
•ToC1(1,2):d=0(samepoint)
•ToC2(10,2):sqrt{(10-1)^2+(2-2)^2}=9
•PointB(1,4):
•ToC1:d=sqrt{(1-1)^2+(4-2)^2}=2
•ToC2:d=sqrt{(10-1)^2+(4-2)^2} ≈9.22

•PointC(1,0):
•ToC1:sqrt{(1-1)^2+(0-2)^2}=2
•ToC2:sqrt{(10-1)^2+(0-2)^2} ≈9.22
•PointD(10,2):
•ToC1:d=9
•ToC2:d=0
•PointE(10,4):
•ToC1:sqrt{(10-1)^2+(4-2)^2} ≈9.22
•ToC2:sqrt{(10-10)^2+(4-2)^2}=2
•PointF(10,0):
•ToC1:sqrt{(10-1)^2+(0-2)^2} ≈9.22
•ToC2:sqrt{(10-10)^2+(0-2)^2} =2

•ClusterAssignments(Iteration1):
•Cluster1(C1):A(1,2),B(1,4),C(1,0)
•Cluster2(C2):D(10,2),E(10,4),F(10,0)
•Step4:UpdateCentroids
•Calculatethemeanpositionforeachcluster.
•NewC1:
•(1+1+1/3,2+4+0/3)=(1,2)//Nochangefromtheinitialcentroid
•NewC2:
•(10+10+10/3,2+4+0/3)=(10,2)//Nochangefromtheinitialcentroid
•Step5:CheckforConvergence
•Sincethecentroidsdidn’tchange,thealgorithmconverges,andclusteringiscomplete.
•FinalClusters:
•Cluster1:A(1,2),B(1,4),C(1,0)
•Cluster2:D(10,2),E(10,4),F(10,0)

K-ModeClustering
K-modeclusteringisanunsupervisedmachine-learningtechniqueusedtogroupaset
ofdataobjectsintoaspecifiednumberofclusters,basedontheircategoricalattributes.The
algorithmiscalled“K-Mode”becauseitusesmodes(i.e.themostfrequentvalues)insteadof
meansormedianstorepresenttheclusters.
WhyK-ModeClustering?
InK-meansclusteringwhenweusedcategoricaldataafterconvertingitintoa
numericalform.itdoesn’tgiveagoodresultforhigh-dimensionaldata.
So,Somechangesaremadeforcategoricaldatat.
•ReplaceEuclideandistancewithDissimilaritymetric
•ReplaceMeanbyModeforclustercenters.
•Applyafrequency-basedmethodineachiterationtoupdatethemode.
AndthenthisiscalledK-MODEClusteringbecauseofMODE.

K-ModesvsK-Means
Aspect K-Means K-Modes
TypeofData Workswithcontinuous
(numerical)data.
Designedforcategoricaldata.
MeasureofSimilarity Usesdistancemeasures(e.g.,
Euclideandistance).
Usesdissimilaritymeasures
(totalmismatches).
SimilarityInterpretation Thelesserthedistance,themore
similarthedatapointsare.
Thelesserthedissimilarity
(mismatches),themoresimilar
thedatapointsare.
CentroidUpdateMechanismCentroidsareupdatedusingthe
meanofthedatapoints.
Centroidsareupdatedusing
themodeofthedatapoints.
ApplicationUseCase Suitableforclusteringnumerical
data,suchasheights,weights,
etc.
Suitableforclustering
categoricaldata,suchascolors
brands,orgenres.

DissimilarityMeasurementsBetweenTwoDataObjects
•TheK-modesalgorithmclusterscategoricaldatabypartitioninga
datasetintoapredefinednumberofclusters.
•Eachclusterisdefinedbyitsmode,whichrepresentsthemost
frequentcategoricalvalueinthatcluster.
•Thealgorithmcalculatesthedistancebetweendataobjectsusing
similarityanddissimilaritymeasurements.
•ForK-modes,theHammingdistanceisusedasthedissimilarity
measure.
•TheHammingdistancedeterminesthenumberofdiffering
categoricalattributesbetweentwodataobjects.
•Letxandybetwocategoricaldataobjectsdefinedbymfeatures
orattributes.
Where,

HowDoestheK-ModesAlgorithmWork?
UnlikeHierarchical
clusteringmethods,weneedto
upfrontspecifytheK.
•PickKobservationsatrandom
andusethemasleaders/clusters
•Calculatethedissimilaritiesand
assigneachobservationtoitsclosest
cluster
•Definenewmodesforthe
clusters
•Repeat2–3stepsuntilthereare
isnore-assignmentrequired

Example:Imaginewehaveadatasetthathastheinformationabouthaircolor,
eyecolor,andskincolorofpersons.Weaimtogroupthembasedonthe
availableinformation.
Haircolor,eyecolor,andskincolorareallcategoricalvariables.Belowis
howdatasetlookslike.
Numberofclusters(K)=3

Step1:

Step2:
Iterativelycomparetheclusterdatapointstoeachoftheobservations.Similardata
pointsgive0,dissimilardatapointsgive1.

Comparingleader/ClusterP1totheobservationP1gives0dissimilarities.
Comparingleader/clusterP1totheobservationP2gives3(1+1+1)dissimilarities.

Likewise,calculateallthedissimilaritiesandputtheminamatrixasshownbelowand
assigntheobservationstotheirclosestcluster(clusterthathastheleastdissimilarity)
Afterstep2,theobservationsP1,P2,P5areassignedtocluster1;P3,P7areassignedto
Cluster2;andP4,P6,P8areassignedtocluster3.

Step3:
Modeissimplythe
Marktheobservationsaccordingtotheclustertheybelongto.ObservationsofCluster1are
markedinYellow,Cluster2aremarkedinBrickred,andCluster3aremarkedinPurple.

Consideringoneclusteratatime,foreachfeature,lookfortheModeandupdatethe
newleaders.
Explanation:Cluster1observations(P1,P2,P5)hasbrunetteasthemostobservedhair
color,amberasthemostobservedeyecolor,andfairasthemostobservedskincolor.
Note:Ifweobservethesameoccurrenceofvalues,takethemoderandomly.Inthis
case,theobservationsofCluster3(P3,P7)haveoneoccurrenceofbrown,fairskin
color.Sochosebrownasthemode.
Belowarenewleadersaftertheupdate.
Repeatsteps2–4

Afterobtainingthenewleaders,againcalculatethedissimilaritiesbetweenthe
observationsandthenewlyobtainedleaders.

ComparingCluster1totheobservationP1gives1dissimilarity.
ComparingCluster1totheobservationP2gives2dissimilarities.

Likewise,calculateallthedissimilaritiesandputtheminamatrix.Assigneach
observationtoitsclosestcluster.
TheobservationsP1,P2,P5areassignedtoCluster1;P3,P7areassignedtoCluster2;
andP4,P6,P8areassignedtoCluster3.
Sincetherearenochangesintheassignmentofobservations,wecanstophere.

HierarchicalClustering

WhatisHierarchicalClustering?
•Hierarchicalclusteringisaconnectivity-basedclusteringmodel
usedtogroupdatapointsbasedontheirsimilarityordistance.
•Themodelassumesthatdatapointsclosertoeachotherare
moresimilarcomparedtothosefartherapart.
•Theclusteringprocessoperatesontheprincipleofproximity,
groupingpointsintoclustersinastep-by-stepmanner.

WhatistheDistanceMeasure?
Distancemeasuredeterminesthesimilaritybetweentwo
elementsanditinfluencestheshapeoftheclusters.
Someofthewayswecancalculatedistancemeasuresinclude:
•Euclideandistancemeasure
•SquaredEuclideandistancemeasure
•Manhattandistancemeasure
•Cosinedistancemeasure

EuclideanDistanceMeasure
•Themostcommonmethodtocalculate
distancemeasuresistodeterminethe
distancebetweenthetwopoints.Let’s
saywehaveapointPandpointQ:the
.
•Asthisisthesumofmorethantwo
dimensions,wecalculatethedistance
betweeneachofthedifferent
dimensionssquaredandthentakethe
squarerootofthattogettheactual
distancebetweenthem.
Theformulafordistancebetween
twopointsisshownbelow:

SquaredEuclideanDistanceMeasurement
ThisisidenticaltotheEuclideanmeasurement
method;exceptwedon'ttakethesquarerootatthe
end.
Dependingonwhetherthepointsarefartherapartor
closertogether,thenthedifferenceindistancescanbe
computedfasterbyusingsquaredEuclideandistance
measurement.
Whilethismethodgivesustheexactdistance,itwon't
makeadifferencewhencalculatingwhichissmaller
andwhichislarger.Removingthesquarerootcan
makethecomputationfaster.
Theformulaisshownbelow:

ManhattanDistanceMeasurement
Thismethodisasimplesumofhorizontalandvertical
componentsorthedistancebetweentwopointsmeasured
alongaxesatrig
Thismethodisdifferentbecausewe'renotlookingatthe
directline,andincertaincases,theindividualdistances
measuredwillgiveabetterresult.
Sinceitisfaster,Euclideansquaredmethodisusedmore
often.ButwhenusingtheManhattandistance,youmeasure
eithertheXdifferenceortheYdifferenceandtakethe
absolutevalueofit.htangles.
Theformulaisshownbelow:

CosineDistanceMeasure
Thecosinedistancesimilaritymeasurestheangle
betweenthetwovectors.
Asthetwovectorsseparate,thecosinedistance
becomesgreater.Thismethodissimilartothe
Euclideandistancemeasure,andyoucanexpectto
getsimilarresultswithbothofthem.
Note:TheManhattanmeasurementmethodwill
produceaverydifferentresult.Itcanendupwith
biasifthedataisveryskewedorifbothsetsof
valueshaveadramaticsizedifference.
Theformulaisshownbelow:

TheRoleoftheDendrogram
•Adendrogramisatree-likediagramproducedbyhierarchical
clustering.
•Bottomofthedendrogram:Representsindividualdata
points.
•Topofthedendrogram:Representsthelargestcluster,
whichincludesalldatapoints.
•Thedendrogramshowsthehierarchicalrelationshipsbetween
clusters.
•Byslicingthedendrogramatvariousheights,different
numbersofclusterscanbeobtained.

HowtheDendrogramisConstructed?
•IterativeProcess:
•Clustersareeithermerged(agglomerativeclustering)orsplit(divisive
clustering)basedontheirsimilarityordistance.
•Thisisrepeateduntil:
•Alldatapointsformasinglecluster.
•Or,thepredefinednumberofclustersisachieved.
•SimilarityorDistanceMeasures:
•CommonmeasuresincludeEuclideandistance,Manhattandistance,or
cosinesimilarity.
•Thesemeasuresdeterminehowcloselyrelatedthedatapointsorclusters
are.

TypesofHierarchicalClustering
•Agglomerative(Bottom-Up):
•Startswitheachdatapointasitsown
cluster.
•Graduallymergestheclosestclusters
untilonelargeclusterisformed.
•Divisive(Top-Down):
•Startswithalldatapointsinasingle
cluster.
•Recursivelysplitsclustersintosmaller
clustersbasedondissimilarity.

HierarchicalAgglomerativeClustering
•HierarchicalAgglomerativeClusteringisalsoknownasthe
•Astructurethatismoreinformativethantheunstructuredsetofclusters
returnedbyflatclustering.
•Thisclusteringalgorithmdoesnotrequireustoprespecifythenumberof
clusters.
•Bottom-upalgorithmstreateachdataasasingletonclusterattheoutset
andthensuccessivelyagglomeratepairsofclustersuntilallclustershave
beenmergedintoasingleclusterthatcontainsalldata.

AgglomerativeClusteringSteps
1.Computethedissimilaritymatrixbyusingaparticular
distancemetric.
2.Assigneachdatapointtoacluster.
3.Mergetheclustersbasedonalinkagecriterionforthe
similaritybetweenclusters.
4.Updatethedistancematrix.
5.Repeatsteps3and4untilasingleclusterremainsoranystop
criterionismet.

AgglomerativeClusteringSteps
Algorithm:
givenadataset(d1,d2,d3,....dN)ofsizeN
#computethedistancematrix
fori=1toN:
#asthedistancematrixissymmetricabout
#theprimarydiagonalsowecomputeonlylower
#partoftheprimarydiagonal
forj=1toi:
dis_mat[i][j]=distance[di,dj]
eachdatapointisasingletoncluster
repeat
mergethetwoclusterhavingminimumdistance
updatethedistancematrix
untilonlyasingleclusterremains

HierarchicalAgglomerativeClustering
1.Considereachletterasitsownclusterandcalculate
thedistancebetweeneachcluster.
2.Inthesecondstep,similarclustersaremerged
together.
3.Forexample,clusters(B)and(C)aresimilar,sotheyare
merged.Thesamehappenswithclusters(D)and(E).
4.Afterthis,wehavetheclusters[(A),(BC),(DE),(F)].
5.Werecalculatethedistancesandmergetheclosest
clusters,whichare[(DE)and(F)],tocreate[(A),(BC),
(DEF)].
6.Repeatingthisprocess,clusters(DEF)and(BC)are
mergedinto[(A),(BCDEF)].
7.Finally,thelasttwoclustersaremergedintoone:
[(ABCDEF)].

HierarchicalDivisiveclustering
•Itisalsoknownasatop-downapproach.
•Thisalgorithmalsodoesnotrequiretoprespecifythenumberof
clusters.
•Top-downclusteringrequiresamethodforsplittingaclusterthat
containsthewholedataandproceedsbysplittingclustersrecursively
untilindividualdatahavebeensplitintosingletonclusters.

HierarchicalDivisiveClusteringSteps
1.StartwithalldatapointsforadatasetsizeN(d1,d2,d3...
dN)inonecluster.
2.Splittheclusterintotwodissimilarorheterogeneousclusters
byusingaflatclusteringmethodsuchasthek-means
algorithm.
3.Repeatstep2,choosingthebestclustertosplitand
removingtheoutliersfromtheleastcohesiveclusterafter
eachiteration.
4.Stopwheneachexampleisinitsownsinglecluster,
otherwiserepeatstep3.

HierarchicalDivisiveClustering
Algorithm
1.Givenadataset(d1,d2,d3,....dN)ofsizeN
2.Atthetopwehavealldatainonecluster
3.Theclusterissplitusingaflatclusteringmethodeg.K-
Meansetc
4.Repeat
5.Choosethebestclusteramongalltheclusterstosplit
6.Splitthatclusterbytheflatclusteringalgorithm
7.Untileachdataisinitsownsingletoncluster

ComputingDistanceMatrix
Whilemergingtwoclusterswecheckthedistancebetweentwoeverypairofclustersand
mergethepairwiththeleastdistance/mostsimilarity.Butthequestionishowisthat
distancedetermined.TherearedifferentwaysofdefiningInterClusterdistance/similarity.
Someofthemare:
•MinDistance:Findtheminimumdistancebetweenanytwopointsofthecluster.
•MaxDistance:Findthemaximumdistancebetweenanytwopointsofthecluster.
•GroupAverage:Findtheaveragedistancebetweeneverytwopointsoftheclusters.
•Ward’sMethod:Thesimilarityoftwoclustersisbasedontheincreaseinsquarederror
whentwoclustersaremerged.

ComputingDistanceMatrix
Forexample,ifwegroupagivendatausingdifferentmethods,wemaygetdifferentresults:

ImplementationsCodeandOutput
importnumpyasnp
fromscipy.cluster.hierarchyimportdendrogram,linkage
importmatplotlib.pyplotasplt
#randomlychosendataset
X=np.array([[1,2],[1,4],[1,0],[4,2],[4,4],[4,0]])
#Performhierarchicalclustering
Z=linkage(X,'ward')
#Plotdendrogram
dendrogram(Z)
plt.title('HierarchicalClusteringDendrogram')
plt.xlabel('Datapoint')
plt.ylabel('Distance')
plt.show()

HierarchicalAgglomerativevsDivisive
Clustering
Aspect AgglomerativeClustering DivisiveClustering
Simpler,startswithalldatapointsandmergesthem.Morecomplex,asitrequiresaflatclustering
methodasasubroutineforsplittingclusters.
TimecomplexityisO(n³)withoutoptimizations. Moreefficientifacompletehierarchyisnot
generated.
O(n³)fornaiveapproach,O(n²logn)withpriority
queues,canbereducedtoO(n²)withoptimizations.
Linear(O(n))withefficientalgorithmslike
K-Meanswhenfixednumberoflevelsareused.
Lessaccurate,asdecisionsarebasedonlocal
patternsornearbypoints,andearlydecisionscannot
bechanged.
Moreaccurateasitconsiderstheglobal
distributionofdatafortop-levelpartitioning.
Makesdecisionsbasedonlocalpatterns,without
consideringtheoveralldistributionofthedataatfirst.
Makestop-leveldecisionsconsideringtheglobal
datadistribution.
Reliesheavilyonlocalneighborpointstomake
decisions.
Doesn'trelyonneighboringpointsinthesame
wayasagglomerativeclustering.

K-meansvsHierarchicalClustering
Aspect
K-meansClustering
HierarchicalClustering
NumberofClustersPre-definedandspecifiedby“K”. Thenumberofclusterscanrangefromone
tothetotalnumberofdataobservations.
DatasetSize Moreeffectiveforlargerdatasets. Bettersuitedforsmallerdatasets.
ClusterShape Workswellforspheroidal(round)clusters. Lesseffectiveforspheroidalclusters.
Scalability Highlyscalableforlargedatasets. Limitedscalabilityduetohigher
computationalcomplexity.
AlgorithmType Partitionalclustering(dividesthedataintoK
groups).
Hierarchicalclustering(createsatree-like
structure).

References
•ParthShukla,
https://www.analyticsvidhya.com/blog/2022/11/hierarchical-clustering-in-mach
ine-learning/
•GbeminiyiJohnOyewole,etal ,
Volume56,pages6439–6475,(2023)
•“ ”,
https://geeksforgeeks.org/hierarchical-clustering/?ref=ml_lbp
•
https://www.ibm.com/think/topics/hierarchical-clustering
•

MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......