Machine Learning_SVM_KNN_K-MEANSModule 2.pdf

DrShivashankar1 1,046 views 96 slides Jul 19, 2024
Slide 1
Slide 1 of 96
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96

About This Presentation

SUPPORT VECTOR MACHINE, KNN,
K-MEANS and CLUSTERING


Slide Content

MACHINE LEARNING (INTEGRATED)
(21ISE62)
Dr. Shivashankar
Professor
Department of Information Science & Engineering
GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru
7/19/2024 1Dr. Shivashankar, ISE, GAT
GLOBAL ACADEMY OF TECHNOLOGY
Ideal Homes Township, RajarajeshwariNagar, Bengaluru –560 098
Department of Information Science & Engineering

Course Outcomes
AfterCompletionofthecourse,studentwillbeableto:
IllustrateRegressionTechniquesandDecisionTreeLearning
Algorithm.
ApplySVM,ANNandKNNalgorithmtosolveappropriateproblems.
ApplyBayesianTechniquesandderiveeffectivelearningrules.
IllustrateperformanceofAIandMLalgorithmsusingevaluation
techniques.
Understandreinforcementlearninganditsapplicationinrealworld
problems.
TextBook:
1.TomM.Mitchell,MachineLearning,McGrawHillEducation,IndiaEdition2013.
2.EthemAlpaydın,Introductiontomachinelearning,MITpress,Secondedition.
3.Pang-NingTan,MichaelSteinbach,VipinKumar,IntroductiontoDataMining,
Pearson,FirstImpression,2014.
7/19/2024 2Dr. Shivashankar, ISE, GAT

MODULE-2
SUPPORT VECTOR MACHINE
•SupportVectorMachinecalledasSVMisoneofthemostpopularSupervisedLearning
algorithms,whichisusedforClassificationaswellasRegressionpredictiontoolthat
usesmachinelearningtheorytomaximizepredictiveaccuracywhileautomatically
avoidingover-fittothedata.
•SVMcanbedefinedassystemswhichusehypothesisspaceofalinearfunctionsina
highdimensionalfeaturespace,trainedwithalearningalgorithm.
•ThegoaloftheSVMalgorithmistocreatethebestlineordecisionboundarythatcan
segregaten-dimensionalspaceintoclassessothatwecaneasilyputthenewdata
pointinthecorrectcategoryinthefuture.
•Thisbestdecisionboundaryiscalledahyperplane.
•SVMbecomesfamouswhen,usingpixelmapsasinput;itgivesbestaccuracy.
•SVMwasdevelopedbyVladimirVapnikinthe1970s.
•SVMalgorithmhelpstofindthebestlineordecisionboundary;thisbestboundaryor
regioniscalledasahyperplane.
•SVMalgorithmfindstheclosestdatapointsofthelinesfromboththeclasses.
•Thesepointsarecalledsupportvectors.
•Thedistancebetweenthevectorsandthehyperplaneiscalledasmarginandthegoal
ofSVMistomaximizethismargin.
•Thehyperplanewithmaximummarginiscalledtheoptimalhyperplane.
7/19/2024 3Dr. Shivashankar, ISE, GAT

Cont…
SVMalgorithmcanbeusedforFacedetection,imageclassification,text
categorization,etc.
TypesofSVM:
LinearSVM:Usedforlinearlyseparabledata,ifadatasetcanbeclassifiedintotwoclasses
byusingasinglestraightline,thensuchdataistermedaslinearlyseparabledata,and
classifierisusedcalledasLinearSVMclassifier.
Non-linearSVM:Usedfornon-linearlyseparateddata,ifadatasetcannotbeclassifiedby
usingastraightline,thensuchdataistermedasnon-lineardataandclassifierusedis
calledasNon-linearSVMclassifier.
7/19/2024 4Dr. Shivashankar, ISE, GAT
Fig. 2.1. Concept of SVM Technique

Examples of Bad Decision Boundaries
Class 1
Class 2
Class 1
Class 2
Fig. 3: Examples of Bad Decision Boundaries

Linearly Separable Case
Ifadatasetcanbeclassifiedintotwoclassesbyusingasinglestraightline,thensuchdata
istermedaslinearlyseparabledata,andclassifierisusedcalledasLinearSVMclassifier
andclassificationproblemisBinaryclassificationortwoclassclassification.
Binaryclassificationcanbeviewedasthetaskofseparatingclassesinfeaturespace:
Hyperplane:
Where
7/19/2024 6Dr. Shivashankar, ISE, GAT
f(x)= (w
T
x+ b)
–w:weightvector
–x:inputvector
–b:biasoroffsetvalue
Fig 2.2: Linearly Separable classification

Cont..
DefinethehyperplanesHsuchthat
w•x
i+b≥1,wheny
i=+1
w•x
i+b<-1,wheny
i=–1
H
1andH
2arethemargins:
H
1:w•x
i+b=+1
H
2:w•x
i+b=–1
ThepointsonthemarginsH
1andH
2arethetipsoftheSupportVectors.
TheplaneH
0isthemedianinbetween,wherew•x
i+b=0
d+=theshortestdistancetotheclosestpositivepoint.
d-=theshortestdistancetotheclosestnegativepoint.
Themargin(gutter)ofaseparatinghyperplaneisd++d–.
7/19/2024 7Dr. Shivashankar, ISE, GAT

Maximizing the margin
Wewantaclassifierwithasbigmarginaspossible
Recallthedistancefromapoint(x
0,y
0)toaline:
Ax+By+c=0is|Ax
0+By
0+c|/sqrt(A
2
+B
2
)
ThedistancebetweenH
1andH
2is:
|w•x+b|/||w||=1/||w||
ThedistancebetweenH
1andH
2is:2/||w||
Inordertomaximizethemargin,weneedtominimize||w||.Withthe
conditionthattherearenodatapointsbetweenH
1andH
2:
x
i•w+b+1wheny
i=+1
x
i•w+b-1wheny
i=-1 Canbecombinedintoy
i(x
i•w)1
7/19/2024 8Dr. Shivashankar, ISE, GAT

Constrained optimizationproblem
•Theproblemoffindingtheoptimalhyperplaneisanoptimizationproblem
andcanbesolvedbyoptimizationtechniques.
•ItcanbesolvedbytheLagrangianMultiplermethod(α
i),Whichcanbe
formulatedas:
&#3627408484;=෍
&#3627408470;=1
&#3627408474;
??????
&#3627408470;&#3627408485;
&#3627408470;&#3627408486;
&#3627408470;
??????
??????:theLagrangemultiplier,weneedaLagrangemultiplier??????
iforeachofthe
constraints
&#3627408485;
&#3627408470;&#3627408462;&#3627408475;&#3627408465;&#3627408486;
&#3627408470;arecalledasthesupportvectors.
7/19/2024 9Dr. Shivashankar, ISE, GAT

Cont…
Problems:
1.Drawthehyperplaneforthegivendatapoints(1,1)(2,1)(1,-1)(2,-1)(4,0)(5,1)(5,-1)
(6,0)usingSVMandclassifyingnewdatapoints(2,-2).
Solution:
1.Plotthegraph:
??????&#3627408466;&#3627408473;&#3627408466;&#3627408464;&#3627408481;&#3627408480;&#3627408481;ℎ&#3627408466;&#3627408483;&#3627408466;&#3627408464;&#3627408481;&#3627408476;&#3627408479;&#3627408480;&#3627408482;&#3627408477;&#3627408477;&#3627408476;&#3627408479;&#3627408481;&#3627408480;:??????
1=
2
1
??????
2=
2
−1
??????
3=
4
0
??????
1??????
2??????
3--SupportVectorbecausetheseareclosestdatapointstothecentroid(3-x-axis)
2.Toprovidevectorrepresentation,weneedtoaddbiasonallsupportvectors.Herewe
assumebias=1.So,oursupportvectornowbecome:
ҧ??????
1
2
1
1
ҧ??????
2
2
−1
1
ҧ??????
3
4
0
1
7/19/2024 10Dr. Shivashankar, ISE, GAT

Cont…
3.Consideronepartofthesupportvectoras+veandotheras–ve.
Here,??????
1and??????
2&#3627408462;&#3627408479;&#3627408466;−&#3627408483;&#3627408466;&#3627408462;&#3627408475;&#3627408465;??????
3??????&#3627408480;+&#3627408483;&#3627408466;.
4.Ourobjectiveistofindanoptimalhyperplanewhichmeans,weneedtofind
thevaluesofwandboftheoptimalhyperplane.
f&#3627408537;=w.x+b=0
5.Tofindtheoptimalhyperplane,weuseLagrange(α)Multipliermethod.
NowletuscompletewandbwhichdeterminetheOptimalhyperplane.
AccordingtoLagrangeequation,
&#3627408484;=෍
&#3627408470;=1
&#3627408474;
??????
&#3627408470;&#3627408485;
&#3627408470;&#3627408486;
&#3627408470;
Here,&#3627408485;
&#3627408470;&#3627408462;&#3627408475;&#3627408465;&#3627408486;
&#3627408470;&#3627408462;&#3627408479;&#3627408466;&#3627408481;ℎ&#3627408466;supportvectors,??????
1,??????
2&#3627408462;&#3627408475;&#3627408465;??????
3
Letussubstitutesupportvectorsinaboveequations

1
ഥ??????
1
ഥ??????
1+∝
2
ഥ??????
1??????
2+∝
3
ഥ??????
1??????
3=−1

1??????
2??????
1+∝
2??????
2??????
2+∝
3??????
2??????
3=−1

1??????
3??????
1+∝
2??????
3??????
2+∝
3??????
3??????
3=1
7/19/2024 11Dr. Shivashankar, ISE, GAT

Cont…
Letussubstitutevaluesof??????
1,??????
2&#3627408462;&#3627408475;&#3627408465;??????
3

1
2
1
1
2
1
1
+∝
2
2
1
1
2
−1
1
+∝
3
2
1
1
4
0
1
=−1

1
2
−1
1
2
1
1
+∝
2
2
−1
1
2
−1
1
+∝
3
2
−1
1
4
0
1
=−1

1
4
0
1
2
1
1
+∝
2
4
0
1
2
−1
1
+∝
3
4
0
1
4
0
1
=1
Therefore,
6∝
1+4∝
2+9∝
3=-1
4∝
1+6∝
2+9∝
3=-1
9∝
1+9∝
2+17∝
3=1
Aftersolvingtheaboveequations,weget

1=-3.25

2=-3.25

3=3.5
7/19/2024 12Dr. Shivashankar, ISE, GAT

Cont…
Nowletusfindw,i.e.
&#3627408536;=෍∝
&#3627408522;
ഥ??????
&#3627408522;
&#3627408484;=−3.25
2
1
1
−3.25
2
−1
1
+3.5
4
0
1
W=
1
0
−3
Therefore,hyperplaneequation,f(x)=w.x+b
So,w=
1
0
andoffsetorbias,b=-3
5.Plothyperplane
7/19/2024 13Dr. Shivashankar, ISE, GAT

Cont…
Sinceb=-3,ahyperplaneisdrawn+3tothepositivesideandwis
1
0
,thehyperplaneis
drawnparalleltoy–axis.
Nowletusclarifythenewdatapoints
2
−2
Weknowthat
w.x+b≥&#3627409358;−−−&#3627408515;&#3627408518;&#3627408525;&#3627408528;&#3627408527;&#3627408520;&#3627408532;&#3627408533;&#3627408528;&#3627408516;&#3627408525;&#3627408514;&#3627408532;&#3627408532;+&#3627409359;
w.x+b<&#3627409358;−−−&#3627408515;&#3627408518;&#3627408525;&#3627408528;&#3627408527;&#3627408520;&#3627408532;&#3627408533;&#3627408528;&#3627408516;&#3627408525;&#3627408514;&#3627408532;&#3627408532;−&#3627409359;
Letussubstitutethevaluesintheaboveequation
Y=w.x+b
Y=
1
0
2
−2
−&#3627409361;
Y=2-0-3=-1
Therefore,newdatapoint
2
−2
belongstoclass-1
7/19/2024 14Dr. Shivashankar, ISE, GAT

Cont…
Proble-2:
DrawthehyperplaneforthegivendatapointsPositivelylabelleddatapoints
(3,1)(3,-1)(5,1)(5,-1)andNegativelylabelleddatapoints(1,0)(0,1)(0,-1)(-1,0)
usingSVMandclassifyingthesolution.
Solution:
??????&#3627408466;&#3627408473;&#3627408466;&#3627408464;&#3627408481;&#3627408480;&#3627408481;ℎ&#3627408466;&#3627408483;&#3627408466;&#3627408464;&#3627408481;&#3627408476;&#3627408479;&#3627408480;&#3627408482;&#3627408477;&#3627408477;&#3627408476;&#3627408479;&#3627408481;&#3627408480;:??????
1=
1
0
??????
2=
3
1
??????
3=
3
−1
Eachvectorisaugmentedwithbias1
So,2.Toprovidevectorrepresentation,weneedtoaddbiasonallsupportvectors.Here
weassumebias=1.So,oursupportvectornowbecome:
ҧ??????
1
1
0
1
ҧ??????
2
3
1
1
ҧ??????
3
3
−1
1

1=-3.5

2=0.75

3=0.75
W=
1
0
−3
,So,w=
1
0
andoffsetorbias,b=-2
7/19/2024 15Dr. Shivashankar, ISE, GAT

Non-Linear SVM or Nonlinear Separable Case
•Ifdataislinearlyarranged,thenwecanseparateitbyusingastraightline,butfornon-
lineardata,wecannotdrawasinglestraightline.
•Sotoseparatethesedatapoints,weneedtoaddonemoredimension.Forlinear
data,wehaveusedtwodimensionsxandy,sofornon-lineardata,wewilladda
thirddimensionz.Itcanbecalculatedas:
z=x
2
+y
2
-------(1)
•WemustuseanonlinearSVM(i.e.weneedtoconvertdatafromonefeaturespaceto
another).Fornonlinearseparablecase:
•Φ
1
&#3627408485;1
&#3627408485;2
=
4−&#3627408485;
2+&#3627408485;
1−&#3627408485;
2
4−&#3627408485;
1+&#3627408485;
1−&#3627408485;
2??????&#3627408467;&#3627408485;
1
2
+&#3627408485;
2
2
>2
&#3627408485;
1
&#3627408485;
2
&#3627408476;&#3627408481;ℎ&#3627408466;&#3627408479;&#3627408484;??????&#3627408480;&#3627408466;
7/19/2024 16Dr. Shivashankar, ISE, GAT
Fig. 11: Nonlinear data pointsFig. 12: Added 3
rd
axis Fig. 11: After added 3
rd
axis, best
hyperplane for nonlinear SVM

Conti…
Problem1:DrawthehyperplaneforthegivendatapointsPositivelylabelleddatapoints
(2,2)(2,-2)(-2,-2)(-2,2)andNegativelylabelleddatapoints(1,1)(1,-1)(-1,-1)(-1,1)using
nonlinearSVMandclassifyingthesolution.
Solution:
1.Plotthegraph
2.Nonlinearseparablecase:
•Fromtheplottedgraph,thereisnohyperplaneexistsintheinputspace.
•WemustuseanonlinearSVM(i.e.weneedtoconvertdatafromonefeaturespaceto
another).Fornonlinearseparablecase:
•Φ
1
&#3627408485;1
&#3627408485;2
=
4−&#3627408485;
2+&#3627408485;
1−&#3627408485;
2
4−&#3627408485;
1+&#3627408485;
1−&#3627408485;
2??????&#3627408467;&#3627408485;
1
2
+&#3627408485;
2
2
>2
&#3627408485;
1
&#3627408485;
2
&#3627408476;&#3627408481;ℎ&#3627408466;&#3627408479;&#3627408484;??????&#3627408480;&#3627408466;
7/19/2024 17Dr. Shivashankar, ISE, GAT

Conti…
Byapplyingnonlinearequation,convertthegivendatapintsintootherfeatures.
So,positiveexamplesare
&#3627409360;
&#3627409360;
,
&#3627409360;
−&#3627409360;
,
−&#3627409360;
−&#3627409360;
,
−&#3627409360;
&#3627409360;
-----
&#3627409360;
&#3627409360;
,
&#3627409359;&#3627409358;
&#3627409364;
,
&#3627409364;
&#3627409364;
,
&#3627409364;
&#3627409359;&#3627409358;
Andnegativeexamplesare
&#3627409359;
&#3627409359;
,
&#3627409359;
−&#3627409359;
,
−&#3627409359;
−&#3627409359;
,
−&#3627409359;
&#3627409359;
-----
&#3627409359;
&#3627409359;
,
&#3627409359;
−&#3627409359;
,
−&#3627409359;
−&#3627409359;
,
−&#3627409359;
&#3627409359;
3.Nowplotthegraphforobtainednewdatapoints
NowwecanclassifyeasilyidentifytheSupportvectors
??????
&#3627409359;=
&#3627409359;
&#3627409359;
,??????
&#3627409360;=
&#3627409360;
&#3627409360;
Eachvectorisaugmentedwith1asbiasinput
ҧ??????
1
1
1
1
&#3627408462;&#3627408475;&#3627408465;ҧ??????
2
2
2
1
7/19/2024 18Dr. Shivashankar, ISE, GAT

Conti..
AccordingtoLagrangeequation,
&#3627408484;=෍
&#3627408470;=1
&#3627408474;
??????
&#3627408470;&#3627408485;
&#3627408470;&#3627408486;
&#3627408470;
Here,&#3627408485;
&#3627408470;&#3627408462;&#3627408475;&#3627408465;&#3627408486;
&#3627408470;&#3627408462;&#3627408479;&#3627408466;&#3627408481;ℎ&#3627408466;supportvectors,??????
1&#3627408462;&#3627408475;&#3627408465;??????
2
Letussubstitutesupportvectorsinaboveequations

1
ഥ??????
1
ഥ??????
1+∝
2
ഥ??????
1??????
2=−1

1
ഥ??????
1??????
2+∝
2??????
2??????
2=1
Aftersubstitute??????
1and??????
2valuesandsimplifiedtheaboveequations,
3∝
1+5∝
2=−1
5∝
1+9∝
2=−1
Therefore,∝
1=−7&#3627408462;&#3627408475;&#3627408465;∝
2=4
&#3627408536;=෍∝
&#3627408522;
ഥ??????
&#3627408522;
&#3627408484;=−7
1
1
1
+4
2
2
1
=
1
1
−3
.
Therefore,hyperplaney=wx+b,withw=
&#3627409359;
&#3627409359;
andbias=-3
7/19/2024 19Dr. Shivashankar, ISE, GAT

Support Vector Machine Terminology
Hyperplane:Thehyperplanetriesthatthemarginbetweentheclosestpointsof
differentclassesshouldbeasmaximumaspossible.Inthecaseoflinear
classifications,itwillbealinearequationi.e.wx+b=0.
SupportVectors:Theclosestdatapointstothehyperplane,whichmakesacritical
roleindecidingthehyperplaneandmargin.
Margin:isthedistancebetweenthesupportvectorandhyperplane.Themain
objectiveoftheSVMalgorithmistomaximizethemargin.Thewidermarginindicates
betterclassificationperformance.
Kernel:isthemathematicalfunction,whichisusedinSVMtomaptheoriginalinput
datapointsintohigh-dimensionalfeaturespaces.Someofthecommonkernel
functionsarelinear,polynomialandradialbasisfunction(RBF).
HardMargin:Alsocalledasthemaximum-marginhyperplaneisahyperplanethat
properlyseparatesthedatapointsofdifferentcategorieswithoutany
misclassifications.
SoftMargin:Whenthedataisnotperfectlyseparableorcontainsoutliers,SVM
permitsasoftmargintechnique.Itdiscoversacompromisebetweenincreasingthe
marginandreducingviolations.
HingeLoss:AtypicallossfunctioninSVMsishingeloss.Itpunishesincorrect
classificationsormarginviolations.7/19/2024 20Dr. Shivashankar, ISE, GAT

How Does Support Vector Machine Algorithm Work?
•ThebestwaytounderstandtheSVMalgorithmisbytheSVMclassifier.
•Thishyper-paneischosenbasedonmarginasthehyperplaneprovidingthe
maximummarginbetweenthetwoclassesisconsidered.
•ThesemarginsarecalculatedusingdatapointsknownasSupportVectors.
SupportVectorsarethosedatapointsthatareneartothehyper-planeand
helpinpositioningdatapointsit.
7/19/2024 21Dr. Shivashankar, ISE, GAT

Cont…
ThefunctioningofSVMclassifieristobeunderstoodmathematicallythenitcan
beunderstoodinthefollowingways-
Step1:SVMalgorithmpredictstheclasses.Oneoftheclassesisidentifiedas1
whiletheotherisidentifiedas-1.
Step2:Asallmachinelearningalgorithmsconvertthebusinessproblemintoa
mathematicalequationinvolvingunknowns.Theseunknownsarethenfoundby
convertingtheproblemintoanoptimizationproblem.
Step3:Thislossfunctioncanalsobecalledacostfunctionwhosecostis0when
noclassisincorrectlypredicted.Ifthisisnotthecase,thenerror/lossis
calculated.
Step4:Asisthecasewithmostoptimizationproblems,weightsareoptimizedby
calculatingthegradientsusingadvancedmathematicalconceptsofcalculusviz.
partialderivatives.
Step5:Thegradientsareupdatedonlybyusingtheregularizationparameter
whenthereisnoerrorintheclassificationwhilethelossfunctionisalsoused
whenmisclassificationhappens.
7/19/2024 22Dr. Shivashankar, ISE, GAT

Important Concepts in SVM
•Supportvectorsarethosedatapointswhosebasisthemarginsarecalculated
andmaximized.
•Thenumberofsupportvectorsorthestrengthoftheirinfluenceisoneofthe
hyper-parameters.
7/19/2024 23Dr. Shivashankar, ISE, GAT
Fig. 2: Presents Support vectors, margin and Classes

Cont…
HardMargin:
•HardMarginreferstothatkindofdecisionboundarythatmakessurethatall
thedatapointsareclassifiedcorrectly.
•WhilethisleadstotheSVMclassifiernotcausinganyerror,itcanalsocause
themarginstoshrinkthusmakingthewholepurposeofrunninganSVM
algorithmwithoutresults.
SoftMargin:
•SoftMarginSVMintroducesflexibilitybyallowingsomemarginviolations
(misclassifications)tohandlecaseswherethedataisnotperfectlyseparable.
7/19/2024 24Dr. Shivashankar, ISE, GAT

SVM Implementation in Python
InPython,anSVMclassifiercanbedevelopedusingthesklearnlibrary.
Step1:Loadtheimportantlibraries
>>importpandasaspd
>>importnumpyasnp
>>importsklearn
>>fromsklearnimportsvm
>>fromsklearn.model_selectionimporttrain_test_split
>>fromsklearnimportmetrics
Step2:ImportdatasetandextracttheXvariablesandYseparately.
>>df=pd.read_csv(“mydataset.csv”)
>>X=df.loc[:,[‘Var_X1’,’Var_X2’,’Var_X3’,’Var_X4’]]
>>Y=df[[‘Var_Y’]]
Step3:Dividethedatasetintotrainandtest
>>X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,
random_state=123)
Step4:InitializingtheSVMclassifiermode
>>svm_clf=svm.SVC(kernel=‘linear’)
7/19/2024 25Dr. Shivashankar, ISE, GAT

Cont…
Step5:FittingtheSVMclassifiermodel
>>svm_clf.fit(X_train,y_train)
Step6:Comingupwithpredictions
>>y_pred_test=svm_clf.predict(X_test)
Step7:Evaluatingmodel’sperformance
>>metrics.accuracy(y_test,y_pred_test)
>>metrics.precision(y_test,y_pred_test)
>>metrics.recall(y_test,y_pred_test)
7/19/2024 26Dr. Shivashankar, ISE, GAT

Advantages & Disadvantagesof SVM
Advantages
•Itisoneofthemostaccuratemachinelearningalgorithms.
•Itisadynamicalgorithmandcansolvearangeofproblems,includinglinearand
non-linearproblems,binary,binomial,andmulti-classclassificationproblems,
alongwithregressionproblems.
•SVMusestheconceptofmarginsandtriestomaximizethedifferentiation
betweentwoclasses;itreducesthechancesofmodeloverfitting,makingthe
modelhighlystable.
•SVMisknownforitscomputationspeedandmemorymanagement.Itusesless
memory,especiallywhencomparedtomachinevsdeeplearningalgorithmswith
whomSVMoftencompetes.
Disadvantages:
•WhileSVMisfastandcanworkinhighdimensions,itstillfailsinfrontofNaïve
Bayes,providingfasterpredictionsinhighdimensions.Also,ittakesarelatively
longtimeduringthetrainingphase.
•ComparedtootherlinearalgorithmssuchasLinearRegression,SVMisnothighly
interpretable,especiallywhenusingkernelsthatmakeSVMnon-linear.Thus,it
isn’teasytoassesshowtheindependentvariablesaffectthetargetvariable.
7/19/2024 27Dr. Shivashankar, ISE, GAT

Cont…
ApplicationsofSVM:
•Textcategorization
•Semanticrolelabeling(predicate,agent,..)
•Imageclassification
•Imagesegmentation
•Hand-writtenrecognition
CharacteristicsofSVM
•Basedonsupervisedlearningmethods
•Usingforclassificationorregressionanalysis
•Anon-probabilisticbinarylinearclassifier
•Representationoftheexamplesaspointsinspace
•Examplesoftheseparatecategoriesaredividedbyacleargapthatisas
wideaspossible.
•Newexamplesarethenmappedintothatsamespaceandpredictedto
belongtoacategorybasedonthesideofthegaponwhichtheyfall
•Performinglinearclassification.
7/19/2024 28Dr. Shivashankar, ISE, GAT

K-Nearest Neighbour
•Thek-NearestNeighbors(KNN)algorithmisanon-parametric,supervisedlearningclassifier,
whichusesproximitytomakeclassificationsorpredictionsaboutthegroupingofanindividual
datapoint.
•Itisoneofthepopularandsimplestclassificationandregressionclassifiersusedinmachine
learningtoday.
•ThenearestneighborsofaninstancearedefinedintermsofthestandardEuclideanDistance.
Moreprecisely,letanarbitraryinstancexbedescribedbythefeaturevector
(&#3627408462;
1&#3627408485;,&#3627408462;
2&#3627408485;,…….,&#3627408462;
&#3627408475;(&#3627408485;))
Distancebetweentwoinstances&#3627408485;
&#3627408470;and&#3627408485;
&#3627408471;isdefinedtobed(&#3627408537;
&#3627408522;,&#3627408537;
&#3627408523;),where,
&#3627408465;&#3627408485;
&#3627408470;,&#3627408485;
&#3627408471;≡෍
&#3627408479;=1
&#3627408475;
&#3627408462;
&#3627408479;&#3627408485;
&#3627408470;−&#3627408462;
&#3627408479;&#3627408485;
&#3627408471;
2
TheK-NNRealvaluedtargetfunctioncanbedefinedas:
f(x)=
σ
&#3627408522;=&#3627409359;
&#3627408524;
&#3627408536;
&#3627408522;&#3627408519;(&#3627408537;
&#3627408522;)
σ
&#3627408522;=&#3627409359;
&#3627408524;
&#3627408536;&#3627408522;
Where,&#3627408536;
&#3627408522;=
&#3627409359;
&#3627408517;&#3627408537;??????,&#3627408537;&#3627408522;
&#3627409360;
7/19/2024 29Dr. Shivashankar, ISE, GAT
Fig 2.1: K-NN example

Cont…
7/19/2024 30Dr. Shivashankar, ISE, GAT
&#3627408492;&#3627408516;&#3627408525;&#3627408522;&#3627408517;&#3627408518;&#3627408514;&#3627408527;&#3627408491;&#3627408522;&#3627408532;&#3627408533;&#3627408514;&#3627408527;&#3627408516;&#3627408518;&#3627408515;&#3627408518;&#3627408533;&#3627408536;&#3627408518;&#3627408518;&#3627408527;&#3627408488;&#3627408514;&#3627408527;&#3627408489;=&#3627408511;
&#3627409360;−&#3627408511;
&#3627409359;
&#3627409360;
+&#3627408512;
&#3627409360;−&#3627408512;
&#3627409359;
&#3627409360;

K-Nearest Neighbor(KNN) Algorithm for Machine Learning
•K-NNisoneofthesimplestMachineLearningalgorithmsbasedonSupervised
Learningtechnique.
•K-NNalgorithmassumesthesimilaritybetweenthenewcase/dataand
availablecasesandputthenewcaseintothecategorythatismostsimilarto
theavailablecategories.
•K-NNalgorithmstoresalltheavailabledataandclassifiesanewdatapoint
basedonthesimilarity.Thismeanswhennewdataappearsthenitcanbe
easilyclassifiedintoawellsuitecategorybyusingK-NNalgorithm.
•K-NNalgorithmcanbeusedforRegressionaswellasforClassificationbut
mostlyitisusedfortheClassificationproblems.
•K-NNisanon-parametricalgorithm,whichmeansitdoesnotmakeany
assumptiononunderlyingdata.
•Itisalsocalledalazylearneralgorithmbecauseitdoesnotlearnfromthe
trainingsetimmediatelyinsteaditstoresthedatasetandatthetimeof
classification,itperformsanactiononthedataset.
•KNNalgorithmatthetrainingphasejuststoresthedatasetandwhenitgets
newdata,thenitclassifiesthatdataintoacategorythatismuchsimilartothe
newdata.7/19/2024 31Dr. Shivashankar, ISE, GAT

How does K-NN work?
TheK-NNworkingcanbeexplainedonthebasisofthebelowalgorithm:
Step-1:SelectthenumberKoftheneighbors
Step-2:CalculatetheEuclideandistanceofKnumberofneighbors
Step-3:TaketheKnearestneighborsasperthecalculatedEuclideandistance.
Step-4:Amongthesekneighbors,countthenumberofthedatapointsineachcategory.
Step-5:Assignthenewdatapointstothatcategory,numberoftheneighborismaximum.
Step-6:Ourmodelisready.
Supposewehaveanewdatapointandweneedtoputitintherequiredcategory.
Considerthebelowimage:
7/19/2024 32Dr. Shivashankar, ISE, GAT
Fig. 2.13: K-NN for best
classifier

Cont…
•Firstly,wewillchoosethenumberofneighbors,sowewillchoosethekvalue.
•Next,wewillcalculatetheEuclideandistancebetweenthedatapoints.The
Euclideandistanceisthedistancebetweentwopoints,whichwehavealready
studiedingeometry.Itcanbecalculatedas:
•Euc-dist[(&#3627408485;
1,&#3627408486;
1);(&#3627408485;
2,&#3627408486;
2)=&#3627408485;
2−&#3627408485;
1
2
+&#3627408486;
2−&#3627408486;
1
2
•BycalculatingtheEuclideandistancewegotthenearestneighbors,asthree
nearestneighborsincategoryAandtwonearestneighborsincategoryB.
Considerthebelowgraph:
•Aswecanseethe3nearestneighborsarefromcategoryA,hencethisnew
datapointmustbelongtocategoryA.
7/19/2024 33Dr. Shivashankar, ISE, GAT

Why do we need a K-NN Algorithm?
•Supposetherearetwocategories,i.e.,CategoryAandCategoryB,andwe
haveanewdatapointx1,sothisdatapointwilllieinwhichofthese
categories.
•Tosolvethistypeofproblem,weneedaK-NNalgorithm.WiththehelpofK-
NN,wecaneasilyidentifythecategoryorclassofaparticulardataset.
7/19/2024 34Dr. Shivashankar, ISE, GAT
Fig. 11. Presents the importance of KNN

AdvantagesandDisadvantagesofKNNAlgorithm
Advantages:
•Itissimpletoimplement.
•Itisrobusttothenoisytrainingdata
•Itcanbemoreeffectiveifthetrainingdataislarge.
Disadvantages:
•AlwaysneedstodeterminethevalueofKwhichmaybecomplex
sometime.
•Thecomputationcostishighbecauseofcalculatingthedistance
betweenthedatapointsforallthetrainingsamples.
7/19/2024 35Dr. Shivashankar, ISE, GAT

Applications of K-nearest Neighbor
1.Creditscore
TheKNNalgorithmcomparesanindividual'screditratingtootherswithcomparablecharacteristicstohelp
calculatetheircreditrating.
2.Approvaloftheloan
Thek-nearestneighbortechnique,similartocreditscoring,isusefulindetectingpeoplewhoaremorelikely
todefaultonloansbycomparingtheirattributestothoseofsimilarpeople.
3.Preprocessingofdata
Manymissingvaluescanbefoundindatasets.MissingdataimputationisaprocedurethatusestheKNN
algorithmtoestimatemissingvalues.
4.Healthcare:
KNNhasalsohadapplicationwithinthehealthcareindustry,makingpredictionsontheriskofheartattacks
andprostatecancer.Thealgorithmworksbycalculatingthemostlikelygeneexpressions..
5.Predictionofstockprices
TheKNNalgorithmisusefulinestimatingthefuturevalueofstocksbasedonpreviousdatasinceithasa
knackforanticipatingthepricesofunknownentities.
6.Recommendationsystems
KNNcanbeusedinrecommendationsystemssinceitcanhelplocatepeoplewithcomparabletraits.Itcan
beusedinanonlinevideostreamingplatform,forexample,toproposecontentthatauserismorelikelyto
viewbasedonwhatotheruserswatch.
7.ComputerVision
Forpictureclassification,theKNNalgorithmisused.It'simportantinavarietyofcomputervision
applicationssinceitcangroupcomparabledatapointstogether,suchascatsanddogsinseparateclasses.
8.Easytoimplement:
Giventhealgorithm’ssimplicityandaccuracy,itisoneofthefirstclassifiersthatanewdatascientistwill
learn.
7/19/2024 36Dr. Shivashankar, ISE, GAT

Conti..
Problem1:Fromthegivendataset,find(x,y)=(170,57)whetherbelongstounderor
normalweight.AssumeK=3.
Solution:
FindtheEuc-dist:d=&#3627408485;
2−&#3627408485;
1
2
+&#3627408486;
2−&#3627408486;
1
2
d1=170−167
2
+57−51
2
=3
2
+6
2
=45=6.70
d2=12
2
+5
2
=169=13
Andsoon
7/19/2024 37Dr. Shivashankar, ISE, GAT
Height(cm)Weight (kg)Class
167 51 Underweight
182 62 Normal
176 69 Normal
173 64 Normal
172 65 Normal
174 56 Underweight
169 58 Normal
173 57 Normal
170 55 Normal
170 57 ?

Conti..
SinceK=3,withmaximum3ranks
withdistances.
Thesmallestdistanceis
•(169,58)-1.414:Normal
•(170,55)-2:Normal
•(173,57)-3:Normal
Henceall3points,so(170,57)belongs
tonormalclass,
7/19/2024 38Dr. Shivashankar, ISE, GAT
Height(cm)Weight (kg)Class Distance
167 51 Underweight 6.7
182 62 Normal 13
176 69 Normal 13.4
173 64 Normal 7.6
172 65 Normal 8.2
174 56 Underweight4.1
169 58 Normal 1.414-1(R)
173 57 Normal 3-3(R)
170 55 Normal 2-2(R)
170 57 Normal 3

Conti..
Problem2:Fromthegivendataset,find(x,y)=(157,54)whetherbelongstomediumor
longer.AssumeK=3.
Solution:
FindtheEuc-dist:d=&#3627408485;
2−&#3627408485;
1
2
+&#3627408486;
2−&#3627408486;
1
2
7/19/2024 39Dr. Shivashankar, ISE, GAT
Sl. No.Height Weight Target
1 150 50 Medium
2 155 55 Medium
3 160 60 Longer
4 161 59 Longer
5 158 65 Longer
6 157 54 ?
Sl. No.Height Weight Target Distance
1 150 50 Medium 8.06
2 155 55 Medium 2.24 (1)
3 160 60 Longer6.71(3)
4 161 59 Longer6.40(2)
5 158 65 Longer11.05
6 157 54 ?

Conti..
FromthetableandK=3,withmaximum3rankswithdistances
Wehave2.24(medium),6.40(Longer)and6.71(Longer)
f(&#3627408485;
&#3627408483;)=
f(&#3627408485;
&#3627408483;)=angmax
&#3627408483;&#3627409174;??????

&#3627408470;=1
&#3627408472;
??????&#3627408483;,&#3627408467;(&#3627408485;
&#3627408483;)−−−??????&#3627408462;,&#3627408463;=1??????&#3627408467;&#3627408462;==&#3627408463;
??????&#3627408462;,&#3627408463;=0??????&#3627408467;&#3627408462;≠&#3627408463;
Comparemediumwith2.24(m),6.40(L)and6.71(L)
==??????&#3627408448;,&#3627408448;+??????&#3627408448;,&#3627408447;+??????&#3627408448;,&#3627408447;
1+0+0=1
Comparelongerwith2.24(m),6.40(L)and6.71(L)
==??????&#3627408447;,&#3627408448;+??????&#3627408447;,&#3627408447;+??????&#3627408447;,&#3627408447;
0+1+1=2
Since2islonger,(157,54)belongtolonger
Ifweconsiderthedistance2.24,6.71and6.40,-----2.24issmaller,hencemediumcouldbeconsider.
DistanceweightedNN:
1.Discretevaluedtargetfunction
2.Realvaluedtargetfunction
7/19/2024 40Dr. Shivashankar, ISE, GAT

Conti..
Discretevaluedfunction:
f(&#3627408485;
&#3627408483;)=angmax
&#3627408483;&#3627409174;??????

&#3627408470;=1
&#3627408472;
&#3627408484;
&#3627408470;??????&#3627408483;,&#3627408467;(&#3627408485;
&#3627408470;)
Where,&#3627408484;
&#3627408470;=
1
??????&#3627408485;??????,&#3627408485;
??????
2
W.r.t.medium:
f(&#3627408485;
&#3627408478;)=0.199*??????&#3627408474;,&#3627408474;+0.022∗??????&#3627408474;,&#3627408473;+
0.024*??????&#3627408474;,&#3627408473;
=0.199*1+0.022∗0+0.024∗0=0.199
W.r.t.Longer:
f(&#3627408485;
&#3627408478;)=0.199*??????&#3627408473;,&#3627408474;+0.022∗??????&#3627408473;,&#3627408473;+
0.024*??????&#3627408473;,&#3627408473;
=0.199*0+0.022∗1+0.024∗1=0.046
Since0.199>0.046—newinstanceis
Classifiedtomedium.
7/19/2024 41Dr. Shivashankar, ISE, GAT
Sl.
No.
Height Weight Target Distance 1
&#3627408465;??????&#3627408480;&#3627408481;&#3627408462;&#3627408475;&#3627408464;&#3627408466;
2
1 150 50 Mediu
m
8.06
2 155 55 Mediu
m
2.24 (1)0.199
3 160 60Longer6.71(3)0.022
4 161 59Longer6.40(2)0.024
5 158 65Longer11.05
6 157 54 Mediu
m

Conti..
Realvaluedtargetfunction:
f(x)=
σ
??????=1
??????
&#3627408484;
????????????(&#3627408485;
??????)
σ
??????=1
??????
&#3627408484;
??????
Where,&#3627408484;
&#3627408470;=
1
??????&#3627408485;??????,&#3627408485;
??????
2 weightedvectors-randomlywewillconsider
f(&#3627408485;
&#3627408478;)=
(0.199∗1.2+0.022∗1.8+0.024∗2.1)
0.45+0.15+0.16
=1.51
7/19/2024 42Dr. Shivashankar, ISE, GAT
Sl. No.Height Weight Target Distance 1
&#3627408465;??????&#3627408480;&#3627408481;&#3627408462;&#3627408475;&#3627408464;&#3627408466;
2
1 150 50 1.5 8.06
2 155 55 1.2 2.24 (1)0.199
3 160 60 1.8 6.71(3)0.022
4 161 59 2.1 6.40(2)0.024
5 158 65 1.7 11.05
6 157 54 1.5

Conti..
Problem3:Calculatethecentroidclassifierforthegivedataandthegivenatestinstance
(6,5),predicttheclass.
Solution:
•Step1:Computethemean/centroidofeachclass.
• Thereare2classes,A&B.
•CentroidofclassA=(3+5+4,1+2+3)/3=(12,6)/3=(4,2)
•CentroidofclassB=(7+6+8,6+7+5)/3=(21,18)/3=(7,6)
•Step2:calculatetheEuclideandistancebetweentestinstance(6,5)andeachofthe
centroid.
7/19/2024 43Dr. Shivashankar, ISE, GAT
XY Class
31 A
52 A
43 A
76 B
67 B
85 B

Conti..
Euc-dist[(&#3627408485;
1,&#3627408486;
1);(&#3627408485;
2,&#3627408486;
2)=&#3627408485;
2−&#3627408485;
1
2
+&#3627408486;
2−&#3627408486;
1
2
ClassA:[(6,5);(4,2)]=4−6
2
+2−5
2
=3.6
ClassB:[(6,5);(7,6)]=7−6
2
+6−5
2
=1.414
ThetestinstancehassmallerdistancetoclassB
Hence,theclassofthistestinstanceispredictedasB.
Problem4.Giventhefollowingtraininginstancesinthetable,eachhavingtwoattributes
(x1andx2).Computetheclasslabelfortestinstance&#3627408481;
1=3,7,using3nearestneighbors
(k=3).
7/19/2024 44Dr. Shivashankar, ISE, GAT
Training
Instances
&#3627408485;
1 &#3627408485;
2 Output
??????
1 7 7 0
??????
2 7 4 0
??????
3 3 4 1
??????
4 1 4 1

Conti..
Euc-dist[(&#3627408485;
1,&#3627408486;
1);(&#3627408485;
2,&#3627408486;
2)=d=&#3627408485;
1−&#3627408486;
1
2
+&#3627408485;
2−&#3627408486;
2
2
d=&#3627408485;
1−&#3627408486;
1
2
+&#3627408485;
2−&#3627408486;
2
2
Neighborrank
&#3627408465;
1=7−3
2
+7−7
2
=4 3
&#3627408465;
2=7−3
2
+4−7
2
=5 4
&#3627408465;
3=3−3
2
+4−7
2
=3 1
&#3627408465;
4=1−3
2
+4−7
2
=3.6 2
ForK=3,wewillconsider??????
1=3,??????
3=1,and??????
4=2
SoK=3,&#3627408481;
2=(3,7)-----outputis1
Highestvote=0.11,
sooutput=1
7/19/2024 45Dr. Shivashankar, ISE, GAT
d &#3627408465;
2
Vote
=1/&#3627408465;
2
Rank
4 16 1/16=0.06 3
5 25 1/25=0.04 4
3 9 1/9=0.11 1
3.612.961/12.96=0
.08
2

Conti..
Problem5:ApplyKNNclassifiertopredictthediabeticpatiencewiththegivenfeatures
BMI,Age.Ifthetrainingexamplesare:AssumeK=3,Testexample:BMI=43.6,Age=40,
Sugar=?
7/19/2024 46Dr. Shivashankar, ISE, GAT
BMIAge Sugar
33.650 1
26.630 0
23.440 0
43.167 0
35.323 1
35.967 1
36.745 1
25.746 0
23.329 0
31 56 1

Conti..
Solution:
Firstcalculatethedistancebetweenthetestinstancesandtraininginstance:Testexamples:BMI=43.6.
Age=40,sugar=?
Euc-dist=d=&#3627408485;
2−&#3627408485;
1
2
+&#3627408486;
2−&#3627408486;
1
2
,&#3627408517;
&#3627409359;=43.6−33.6
2
+40−50
2
=14.14
Therefore,fortestexamples:BMI=43.6,Age=40,sugar=1,becauseintherank1,sugar=1
7/19/2024 47Dr. Shivashankar, ISE, GAT
BMI Age Sugar Distance to newRank
33.6 50 1 14.14 2
26.6 30 0 19.72 5
23.4 40 0 20.20 6
43.1 67 0 27.00 9
35.3 23 1 18.92 4
35.9 67 1 28.08 10
36.7 45 1* 8.52 1
25.7 46 0 18.88 3
23.3 29 0 23.09 8
31 56 1 20.37 7

Cont…
Problem6:giventhetrainingdata,predicttheclassofthefollowingnewexamplesusingKNNforK=5,
age<=30,income=medium,student=yes,creditrating=fair.
7/19/2024 48Dr. Shivashankar, ISE, GAT
Age Income Student Credit
rating
Buys
computers
<=30 High No Fair No
<=30 High No Excellent No
30..40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31..40 Low Yes Excellent Yes
<=30 Medium No Fair no
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
>40 Medium no Excellent No

Cont…
Solution:
•Forsimilaritymeasures,useasinglematchofattributevalues:
•σ
&#3627408470;=1
4
&#3627408484;
&#3627408470;∗
&#3627409173;&#3627408462;
??????,&#3627408463;
??????
4
•Where,&#3627409173;&#3627408462;
&#3627408470;,&#3627408463;
&#3627408470;=1if&#3627408462;
&#3627408470;=&#3627408463;
&#3627408470;and
• =0otherwise.
•&#3627408462;
&#3627408470;&#3627408462;&#3627408475;&#3627408465;&#3627408463;
&#3627408470;areeitherage,income,studeorcreditrating
•Weightareall1exceptforincomeitis2.
•Now,newexamplesusingKNNforK=5,age<=30,income=medium,student=yes,
creditrating=fair.
•ForRID=1class=no,distancetonew:
(1*1+2*0+1*0+1*1)/4=0.5
7/19/2024 49Dr. Shivashankar, ISE, GAT
Age<=30 from the
table
Age<=30 from the
given new examples
Income-high from
the table
Income-medium
Student-no from
the table
Student-yes
Credit rating-fair
from the table
Credit rating-fair
from new example

Cont…
7/19/2024 50Dr. Shivashankar, ISE, GAT
Age Income Student Credit ratingBuys
computers
RID class distance
<=30 High No Fair No 1 No 0.5
<=30 High No Excellent No 2 No 0.25
30..40 High No Fair Yes 3 Yes 0.25
>40 Medium No Fair Yes* 4 Yes 0.75
>40 Low Yes Fair Yes 5 Yes 0.5
>40 Low Yes Excellent No 6 No 0.25
31..40 Low Yes Excellent Yes 7 Yes 0.25
<=30 Medium No Fair No 8 No 1
<=30 Low Yes Fair Yes* 9 Yes 0.75
>40 Medium Yes Fair Yes* 10 Yes 1
<=30 Medium Yes Excellent Yes* 11 Yes 1
31..40 Medium No Excellent Yes 12 Yes 0.5
31..40 High Yes Fair Yes 13 Yes 0.5
>40 Medium no Excellent No 14 No 0.5

Cont…
•Therefore,amongthefivenearestneighbors(RIDanddistance
values:4-0.75,8-1,9—0.75,10-1,11-1),fourarefromclassYesand
onefromclassNo.
•Hence,theKNN-classifier,buycomputers=yes.
7/19/2024 51Dr. Shivashankar, ISE, GAT

Clustering K-means
•Thetaskofgroupingdatapointsbasedontheirsimilaritywitheachotheris
calledClusteringorClusterAnalysis.
•ThismethodisdefinedunderthebranchofUnsupervisedLearning,which
aimsatgaininginsightsfromunlabelleddatapoint
•Clusteranalysisdividesthedataintogroups(clusters)thataremeaningful,
useful,orboth.
•Forinstance,clusteringcanberegardedasaformofclassificationinthatit
createsalabelingofobjectswithclass(cluster)labels.
7/19/2024 52Dr. Shivashankar, ISE, GAT

K-means
•K-MeansClusteringisanunsupervisedlearningalgorithmthatisusedtosolvetheclustering
problemsinmachinelearningordatascience.
•Kmeansclustering,assignsdatapointstooneoftheKclustersdependingontheirdistance
fromthecenteroftheclusters.
•Itstartsbyrandomlyassigningtheclusterscentroidinthespace.
•Theneachdatapointassigntooneoftheclusterbasedonitsdistancefromcentroidofthe
cluster.
•Afterassigningeachpointtooneofthecluster,newclustercentroidsareassigned.
•Thisprocessrunsiterativelyuntilitfindsgoodcluster.
•Here,Kdefinesthenumberofpre-definedclustersthatneedtobecreatedintheprocess,
asifK=2,therewillbetwoclusters,andforK=3,therewillbethreeclusters,andsoon.
•Hence,eachclusterhasdatapointswithsomecommonalities/similarities,anditisaway
fromotherclusters.
7/19/2024 53Dr. Shivashankar, ISE, GAT

The Basic K-means Algorithm
•First,werandomlyinitializekpoints,calledmeansorclustercentroids.
•Wecategorizeeachitemtoitsclosestmean,andweupdatethemean’scoordinates,
whicharetheaveragesoftheitemscategorizedinthatclustersofar.
•Werepeattheprocessforagivennumberofiterationsandattheend,wehaveour
clusters.
Basic K-means algorithm
Step-1:SelectthenumberK(clusters)randomlytodecidethenumberofclusters.
Step-2:SelectrandomKpointsorcentroids.(Itcanbeotherfromtheinputdataset).
Step-3:Assigneachdatapointtotheirclosestcentroid,whichwillformthepredefinedK
clusters.
Step-4:Calculatethevarianceandplaceanewcentroidofeachcluster.
Step-5:Repeatthethirdsteps,whichmeansreassigneachdatapointtothenewclosest
centroidofeachcluster.
Step-6:Ifanyreassignmentoccurs,thengotostep-4elsegotoFINISH.
Step-7:Themodelisready.
7/19/2024 54Dr. Shivashankar, ISE, GAT

Strengths and Weaknesses
Strength
•K-meansissimpleandcanbeusedforawidevarietyofdatatypes.
•Itisalsoquiteefficient,eventhoughmultiplerunsareoftenperformed.
•Thisalgorithmisveryeasytounderstandandimplement.
•Thisalgorithmisefficient,Robust,andFlexible
•Ifdatasetsaredistinctandsphericalclusters,thengivethebestresult
Weaknesses
•Thisalgorithmneedspriorspecificationforthenumberofclustercentersthatisthe
valueofK.
•Itcannothandleoutliersandnoisydata,asthecentroidsgetdeflected
•Itdoesnotworkwellwithaverylargesetofdatasetsasittakeshugecomputational
time.
7/19/2024 55Dr. Shivashankar, ISE, GAT

Cont…
Problem1:Dividethegivensampledataintotwoclusters[2]usingKmeansalgorithm
S={2,3,4,10,11,12,20,25,30}.GivenK=2,fornewdatapoint15,identifytheclusterbelongs
to.
Solution:
1.Choose2randomclustersfromthegivendatasetsC1=4,C2=12.
2.Findthedistancebetweengivensamplesandcentroids,putthesampleinthenearest
cluster.
3.Repeatthesameforalldatapoints.
Clusterk1={2,3,4}-------------(2-4=2,3-4=1,4-4=0,10-4=6,……..
2-12=10,3-12=9,5-12=7,10-12=2,……..
so2,3and4areplacedincluster1asitsdistanceisnearest
toC1=4and
ClusterK2={10,11,12,20,25,30}
4.Computenewcentroids
K1={2,3,4} K2={10,11,12,20,25,30}
C1={2+3+4/3}=3 C2={10+11+12+20+25+30}/6=18
SoC1=3 C2=18
7/19/2024 56Dr. Shivashankar, ISE, GAT

Cont…
5.FindnewclusteringC1=3andC2=18
K1={2,3,4,10} K2={11,12,20,25,30}
C1=2+3+4+10/4=4.75K2=11+12+20+25+30/5=19.6
6.FindnewclusteringC1=4.75andC2=19.6
K1={2,3,4,10,11,12} K2-{20,25,30}
C1=2+3+4+10+11+12/6=7C2=20+25+30/3=25
7.FindnewclusteringC1=7andC2=25
K1={2,3,4,10,11,12} K2-{20,25,30}
Sinceclusteringandcentroidvaluesremainssame.
Sothegivendatasetisdividinginto2clustersas
K1={2,3,4,10,11,12} K2-{20,25,30}
WithcentroidsC1=7andC2=25.
8.Identifytheclusterfornewdatapoints15
Distancebetween15andC1(15-7)=8
Distancebetween15andC2(15-25)=10
Sincedistancebetween15andC1isless,newdatapoint15belongstoC1(=7).
7/19/2024 57Dr. Shivashankar, ISE, GAT

Cont…
Problem2:DividethefollowingdatapointsintotwoclustersusingK-meanandidentify(5,4)belongs
towhichcluster.
Solution:
Step1:Choosingrandomly2clusterscenters
C1=(2,1)andC2=(2,3)
Step2:Findingdistancebetweentwoclusterscentersandeachdatapoint(ApplyEuclideandistance)
Fordatapoints,(1,1)andC1(2,1):d=1−2
2
+1−1
2
=1
(2,1)and(2,1):d=2−2
2
+1−1
2
=0
(2,3)and(2,1):d=2−2
2
+3−1
2
=2andsoon
7/19/2024 58Dr. Shivashankar, ISE, GAT
X 1 2 2 3 4 5
Y 1 1 3 2 3 5
Data pointsDistance fromC1
(2,1)
Distance fromC2(2,3)New clusters
(1,1) 1 2.24 C1
(2,1) 0 2 C1
(2,3) 2 0 C2
(3,2) 1.41 1.41 C1
(4,3) 2.83 2 C2
(5,5) 5 3.61 C2

Cont…
Step3:cluster1ofC1={(1,1),(2,1),(3,2)}
cluster2ofC2={(2,3),(4,3),(5,5)}
Step4:Recalculateclustercenter
C1=
1
3
[(1,1)+(2,1)+(3,2)]=
1
3
[6,4]=(2,1.33)
C2=
1
3
[(2,3)+(4,3)+(5,5)]=
1
3
[11,11]=(3.67,3.67)
Step5:Repeatthestep2untilwegetsameclustercenterorsameclusterelements
7/19/2024 59Dr. Shivashankar, ISE, GAT
Data points Distance from
C1(2,1.33)
Distance from
C2(3.67,3.67)
New clusters
(1,1) 1.05 3.78 C1
(2,1) 0.33 3.15 C1
(2,3) 1.67 1.8 C1
(3,2) 1.204 1.8 C1
(4,3) 2.605 0.75 C2
(5,5) 4.74 1.88 C2

Cont…
cluster1ofC1={(1,1),(2,1),(2,3),(3,2)}
cluster2ofC2={(4,3),(5,5)}
Step6:Recalculateclustercenter
C1=
1
4
[(1,1)+(2,1)+(2,3)+(3,2)]=
1
4
[8,7]=(2,1.75)
C2=
1
2
[(4,3)+(5,5)]=
1
2
[9,8]=(4.5,4)
Step7:Repeatthestep2untilwegetsameclustercenterorsameclusterelements
Step8:cluster1ofC1={(1,1),(2,1),(2,3),(3,2)}
cluster2ofC2={(4,3),(5,5)}
Sinceclusterelementsaresameascomparedtopreviousiteration,stop.
7/19/2024 60Dr. Shivashankar, ISE, GAT
Data pointsDistance fromC1(2,1.75)Distance fromC2(4.5,4)New clusters
(1,1) 1.25 4.61 C1
(2,1) 0.75 3.9 C1
(2,3) 1.25 2.69 C1
(3,2) 1.03 2.5 C1
(4,3) 2.36 1.12 C2
(5,5) 4.42 1.12 C2

Cont…
Problem3UseK-meansclusteringtoclusterthefollowingdataintotwogroups.Data
points{2,4,10,12,3,20,30,11,25},initialclustercentroidsareM1=4andM2=11.
Solution:Initialcentroids:M1=4,M2=11.
Distancetoiscalculatedbyd(&#3627408485;
2,&#3627408485;
1)=&#3627408485;
2−&#3627408485;
1
2
Therefore,C1={2,4,3}
M1=(2+4+3)/3=3
C2={10,12,20,30,11,25}
M2=(10+12+20+30+11+25)/6=18
sonewcentroids:M1=3,M2=18
7/19/2024 61Dr. Shivashankar, ISE, GAT
Data
points
Distance to Cluster New
cluster
M1(4)M2(11)
2 2 9 C1
4 0 7 C1
10 6 1 C2
12 8 1 C2
3 1 8 C1
20 16 9 C2
30 26 19 C2
11 7 0 C2
25 21 14 C2

Cont…
Currentcentroids:M1=3,M2=18
Therefore,C1={2,4,20,3}
C2={12,20,30,11,25}
So,
Newcentroids:M1=4.75
M2=19.6
7/19/2024 62Dr. Shivashankar, ISE, GAT
Data
points
Distance to Cluster New
cluster
M1 M2
2 1 16 C1 C1
4 1 14 C1 C1
10 7 8 C2 C1
12 9 6 C2 C2
3 0 15 C1 C1
20 17 2 C2 C2
30 27 12 C2 C2
11 8 7 C2 C2
25 22 7 C2 C2

Cont…
Currentcentroids:M1=4.75,M2=19.6
Therefore,C1={2,4,10,11,12,3}
C2={20,30,25}
So,
Newcentroids:M1=7
M2=25
7/19/2024 63Dr. Shivashankar, ISE, GAT
Data
points
Distance to Cluster New
cluster
M1 M2
2 2.7517.6C1 C1
4 0.7515.6C1 C1
10 5.259.6 C1 C1
12 7.257.6 C2 C1
3 1.7516.6C1 C1
20 15.250.4 C2 C2
30 25.2510.4C2 C2
11 6.258.6 C2 C1
25 20.255.4 C2 C2

Cont…
Currentcentroids:M1=7,M2=25
Therefore,finalclusterare
•C1=(2,4,10,11,12,13}
•C2={20,30,5}
7/19/2024 64Dr. Shivashankar, ISE, GAT
Data
points
Distance to Cluster New
cluster
M1 M2
2 5 23 C1 C1
4 3 21 C1 C1
10 3 15 C1 C1
12 5 13 C1 C1
3 4 22 C1 C1
20 13 5 C2 C2
30 23 5 C2 C2
11 4 14 C1 C1
25 18 0 C2 C2

Cont…
Problem4:UseK-meansclusteringtoclusterandsupposethatthedataminingtaskistocluster
pointsinto3cluster.WherethedatapointsareA1(2,10),A2(2,5),A3(8,4),B1(5,8),B2(7,5),B3(6,4)
andC1(1,2),c2(4,9).SupposeinitiallyweassignA1,B1andC1asthecenterofeachcluster
respectively.
Solution:Initialcentroids:A1=(2,10),B1=(5,8),C1=(1,2)
Distancetoiscalculatedbyd(??????
1,??????
2)=&#3627408485;
2−&#3627408485;
1
2
+&#3627408486;
2−&#3627408486;
1
2
Therefore,C1={2,10}
C2={(8,5,7,6,4)(4,8,5,4,9)}
C3={(2,1)(5,2)}
Sonewcentroids:
A1=(2,10),B1=(6,6)and
C1=(1.5,3.5)
7/19/2024 65Dr. Shivashankar, ISE, GAT
Data points Distance to Cluster New
cluster
2105812
A12 100 3.61 8.06 1
A22 5 5 4.24 3.16 3
A38 4 8.49 5 7.28 2
B15 8 3.61 0 7.21 2
B27 5 7.07 3.61 6.71 2
B36 4 7.21 4.12 5.39 2
C11 2 8 7.21 0 3
C24 9 2.24 1.41 7.62 2

Cont…
Currentcentroids:A1=(2,10),B1=(6,6),C1=(1.5,3.5)
Therefore,C1={2,4)(10,9)}
C2={(8,5,7,6)(4,8,5,4)}
C3={(2,1)(5,2)}
Sonewcentroids:
A1=(3,9.5),
B1=(6.5,5.25)and
C1=(1.5,3.5)
7/19/2024 66Dr. Shivashankar, ISE, GAT
Data points Distance to Cluster New
cluster
210661.53.5
A12 100 5.66 6.52 1 1
A22 5 5 4.12 1.58 3 3
A38 4 8.49 2.83 6.52 2 2
B15 8 3.61 2.24 5.7 2 2
B27 5 7.07 1.41 5.7 2 2
B36 4 7.21 2.00 4.53 2 2
C11 2 8.06 6.46 1.58 3 3
C24 9 2.24 3.61 6.04 2 1

Cont…
Currentcentroids:A1=(3,9.5),B1=(6.5,5.25),C1=(1.5,3.5)
Therefore,thenewcentroids:
A1=(3.6,7.9),
B1=(7,4.33)and
C1=(1.5,3.5)
7/19/2024 67Dr. Shivashankar, ISE, GAT
Data points Distance to Cluster New
cluster
39.56.565.
25
1.53.5
A12 101.12 6.54 6.52 1 1
A22 5 4.61 4.51 1.58 3 3
A38 4 7.43 1.95 6.52 2 2
B15 8 2.5 3.13 5.7 2 1
B27 5 6.02 0.56 5.7 2 2
B36 4 6.26 1.35 4.53 2 2
C11 2 7.76 6.39 1.58 3 3
C24 9 1.12 4.51 6.04 1 1

Cont…
Currentcentroids:A1=(3.6,7.9),B1=(7,4.33),C1=(1.5,3.5)
Therefore,thefinalclusters:
C1={(2,5,4)(10,8,9)},
C2={(8,7,6)(4,5,4)}
C3={(2,1)(5,2)}
7/19/2024 68Dr. Shivashankar, ISE, GAT
Data points Distance to Cluster New
cluster
3.
6
7.974.3
3
1.53.5
A12 101.94 7.56 6.52 1 1
A22 5 4.33 5.04 1.58 3 3
A38 4 6.62 1.05 6.52 2 2
B15 8 1.67 4.18 5.70 1 1
B27 5 5.21 0.67 5.70 2 2
B36 4 5.52 1.05 4.53 2 2
C11 2 7.49 6.44 1.58 3 3
C24 9 0.33 5.55 6.04 1 1

Hierarchical Clustering
•Hierarchicalclusteringisanotherunsupervisedmachinelearningalgorithm,
whichisusedtogrouptheunlabeleddatasetsintoacluster.
•Itisaconnectivity-basedclusteringmodelthatgroupsthedatapoints
togetherthatareclosetoeachotherbasedonthemeasureofsimilarityor
distance.
•Theassumptionisthatdatapointsthatareclosetoeachotheraremore
similarorrelatedthandatapointsthatarefartherapart.
•Itisbasedontheideaofcreatingahierarchyofclusters,whereeachcluster
ismadeupofsmallerclustersthatcanbefurtherdividedintoevensmaller
clusters.
•Thishierarchicalstructuremakesiteasytovisualizethedataandidentify
patternswithinthedata.
Hierarchicalclusteringisoftwotypes.
Agglomerativeclustering
Divisiveclustering
7/19/2024 69Dr. Shivashankar, ISE, GAT

Agglomerative Clustering
•Agglomerativeclusteringisatypeofdataclusteringmethodusedin
unsupervisedlearning.
•ItbeginswithNgroups,eachcontaininginitiallyoneentity,andthenthetwo
mostsimilargroupsmergeateachstageuntilthereisasinglegroup
containingallthedata.
•Itisaniterativeprocessthatgroupssimilarobjectsintoclustersbasedon
somemeasureofsimilarity.
•Itusesabottom-upapproachfordividingdatapointsintoclusters.
•Thealgorithmbeginsbyassigningeachobjecttoitsowncluster.
•Itthenusesadistancemetrictodeterminethesimilaritybetweenobjectsand
clusters.
•Iftwoclustershavesimilarelements,theyaremergedtogetherintoalarger
cluster.
•Thiscontinuesuntilallobjectsaregroupedintoonefinalcluster.
7/19/2024 70Dr. Shivashankar, ISE, GAT

Agglomerative Hierarchical Clustering Algorithm
•Step1:Considereachdatasetasasingleclusterandcalculatethedistanceof
oneclusterfromalltheotherclusters.
•Step2:Inthesecondstep,comparableclustersaremergedtogethertoform
asinglecluster.Let’ssaycluster(B)andcluster(C)areverysimilartoeach
other,thereforewemergetheminthesecondstepsimilarlytocluster(D)
and(E)andatlast,wegettheclusters[(A),(BC),(DE),(F)]
•Step3:Werecalculatetheproximityaccordingtothealgorithmandmerge
thetwonearestclusters([(DE),(F)])togethertoformnewclustersas[(A),
(BC),(DEF)]
•Step4:Repeatingthesameprocess;TheclustersDEFandBCarecomparable
andmergedtogethertoformanewcluster.We’renowleftwithclusters[(A),
(BCDEF)].
•Step4:Atlast,thetworemainingclustersaremergedtogethertoforma
singlecluster[(ABCDEF)].
7/19/2024 71Dr. Shivashankar, ISE, GAT

Cont…
Theaveragelinkageclusteringusestheaverageformula,i.e.distancebetweentwo
clusteringA&B
d(A,B)=avg{d(a,y):x&#3627409174;&#3627408436;,&#3627408486;&#3627409174;&#3627408437;}
d(A,B)=
∈??????&#3627408485;,&#3627408486;:x&#3627409174;&#3627408436;,&#3627408486;&#3627409174;&#3627408437;
&#3627408436;&#3627408437;
7/19/2024 72Dr. Shivashankar, ISE, GAT
Fig.9. Concept of Agglomerative Clustering

Key Issues in Hierarchical Clustering
LackofaGlobalObjectiveFunction:
•Agglomerativehierarchicalclusteringtechniquesusevariouscriteriatodecidelocally,
ateachstep,whichclustersshouldbemerged(orsplitfordivisiveapproaches).This
approachyieldsclusteringalgorithmsthatavoidthedifficultyofattemptingtosolvea
hardcombinatorialoptimizationproblem.
•Donothaveproblemswithlocalminimaordifficultiesinchoosinginitialpoints.
AbilitytoHandleDifferentClusterSizes:
•Therearetwoapproaches:weighted,whichtreatsallclustersequally,andunweighted,
whichtakesthenumberofpointsineachclusterintoaccount.
•Treatingclustersofunequalsizeequallygivesdifferentweightstothepointsin
differentclusters,whiletakingtheclustersizeintoaccountgivespointsindifferent
clustersthesameweight.
MergingDecisionsareFinal:
•Agglomerativehierarchicalclusteringalgorithmstendtomakegoodlocaldecisions
aboutcombiningtwoclusterssincetheycanuseinformationaboutthepairwise
similarityofallpoints.
•Thisapproachpreventsalocaloptimizationcriterionfrombecomingaglobal
optimizationcriterion.
7/19/2024 73Dr. Shivashankar, ISE, GAT

Advantage and disadvantages of Agglomerative Hierarchical
Clustering Algorithm
Advantages
1.Performance:Itiseffectiveindataobservationfromthedatashapeandreturnsaccurateresults
2.Easy:Itiseasytouseandprovidesbetteruserguidancewithgoodcommunitysupport.Somuch
contentandgooddocumentationareavailableforabetteruserexperience.
3.MoreApproaches:Twoapproachesarethereusingwhichdatasetscanbetrainedandtested,
agglomerativeanddivisive.
4.PerformanceonSmallDatasets:Thehierarchicalclusteringalgorithmsareeffectiveonsmall
datasetsandreturnaccurateandreliableresultswithlowertrainingandtestingtime.
Disadvantages
1.TimeComplexity:Asmanyiterationsandcalculationsareassociated,thetimecomplexityof
hierarchicalclusteringishigh.Insomecases,itisoneofthemainreasonsforpreferringK-Means
clustering.
2.SpaceComplexity:Asmanycalculationsoferrorswithlossesareassociatedwitheveryepoch,the
spacecomplexityofthealgorithmisveryhigh.Duetothis,whileimplementingthehierarchical
clustering,thespaceofthemodelisconsidered.Insuchcases,wepreferK-Meansclustering.
3.PoorperformanceonLargeDatasets:Whentrainingahierarchicalclusteringalgorithmforlarge
datasets,thetrainingprocesstakessomuchtimewithspacewhichresultsinpoorperformanceofthe
algorithms.
7/19/2024 74Dr. Shivashankar, ISE, GAT

Exercise problems
Problem1:Considerthefollowingsetof6onedimensionaldatapoints:18,22,25,
42,27,43.mergetheclustersusingminimumdistanceandupdateproximitymatrix
accordingly.Showproximitymatrixtoeachiteration.
Solution:
Sinceminimumdistanceis1—(42,43)or(43,42),so,merge42and43
Frommatrix2,since2isminimumdistance,merge(25,27)
7/19/2024 75Dr. Shivashankar, ISE, GAT
182225274243
1804792425
2240352021
2573021718
2795201516
422420171501
432521181610
18 22 25 27 42,43
18 0 4 7 9 24
22 4 0 3 5 20
25 7 3 0 2 17
27 9 5 2 0 15
42,4324 20 17 15 0

Exercise problems
Since3isminimumdistance,merge22,25.and27---{22,(25,27)}
Since4isminimumdistance,merge18,22,25,27---[18,{22,(25,27)}]
Drawthedendrogramforthemergeddatapoints.
7/19/2024 76Dr. Shivashankar, ISE, GAT
18 22 25,2742,43
18 0 4 7 24
22 4 0 3 20
25,277 3 0 15
42,4324 20 15 0
18 22,25,2742,43
18 0 4 24
22,25,274 0 15
42,4324 15 0

Problems
Problem2:Forthegivendataset,findtheclustersusingasinglelinktechnique.Use
Euclideandistanceanddrawthedendrogram.
Solution:
Step1:ComputethedistancematrixusingEuclideandistance.
LetA(&#3627408485;
1,&#3627408486;
1)&#3627408462;&#3627408475;&#3627408465;B(&#3627408485;
2,&#3627408486;
2)
ThenEuclideandistancebetweentwopoints
d(A,B)=x
2−x
1
2
+y
2−y
1
2
7/19/2024 77Dr. Shivashankar, ISE, GAT
Sample NoX Y
P1 0.400.53
P2 0.220.38
P3 0.350.32
P4 0.260.19
P5 0.080.41
P6 0.450.30

Conti..
d(P1,P2)=&#3627409358;.&#3627409360;&#3627409360;−&#3627409358;.&#3627409362;&#3627409358;
&#3627409360;
+&#3627409358;.&#3627409361;??????−&#3627409358;.&#3627409363;&#3627409361;
&#3627409360;
=0.23
d(P1,P3)=&#3627409358;.&#3627409361;&#3627409363;−&#3627409358;.&#3627409362;&#3627409358;
&#3627409360;
+&#3627409358;.&#3627409361;&#3627409360;−&#3627409358;.&#3627409363;&#3627409361;
&#3627409360;
=0.22
d(P2,P3)=&#3627409358;.&#3627409361;&#3627409363;−&#3627409358;.&#3627409360;&#3627409360;
&#3627409360;
+&#3627409358;.&#3627409361;&#3627409360;−&#3627409358;.&#3627409361;??????
&#3627409360;
=0.14andsoon
Step2:Mergingthetwoclosestmembers
Here,theminimumvaluesis0.10andhencewecombineP3andP6(as0.10cameintheP6rowand
p3column).
Now,formtheclustersofelementscorrespondingtotheminimumvalueandupdatethedistance
matrix.
7/19/2024 78Dr. Shivashankar, ISE, GAT
P1 P2 P3 P4 P5 P6
P1 0
P2 0.230
P3 0.220.140
P4 0.370.190.130
P5 0.340.140.280.230
P6 0.240.240.100.220.390

Conti..
(P3,P6)
Mergetwoclosestmembersofthetwoclusters.Theminimumvalueis0.13andhencewecombine
P3,P6,P4
{(P3,P6),P4}
7/19/2024 79Dr. Shivashankar, ISE, GAT
P1P2P3P4P5P6
P10
P20.230
P30.220.140
P40.370.190.130
P50.340.140.280.230
P60.240.240.100.220.390
P1 P2 P3,P6P4 P5
P1 0
P2 0.230
P3,P60.220.140
P4 0.370.190.130
P5 0.340.140.280.230
P1 P2 P3,P6P4 P5
P1 0
P2 0.230
P3,P60.220.140
P4 0.370.190.130
P5 0.340.140.280.230
P1 P2 P3,P6,P4P5
P1 0
P2 0.230
P3,P6,P40.220.140
P5 0.340.140.28 0

Conti..
NowcombinedP2andP5
[{(P3,P6),P4},(P2,P5)]
NowupdatethematrixandmergeP2,P5,P3,P6andP4
([{(P3,P6),P4},(P2,P5)],P1)
Nowwehavereachedtothesolution.
7/19/2024 80Dr. Shivashankar, ISE, GAT
P1 P2 P3,P6,P4P5
P1 0
P2 0.230
P3,P6,P40.220.140
P5 0.340.140.28 0
P1 P2,P5 P3,P6,P4
P1 0
P2,P5 0.23 0
P3,P6,P40.22 0.14 0
P1 P2,P5 P3,P6,P4
P1 0
P2,P5 0.23 0
P3,P6,P40.22 0.14 0
P1 P2,P5,P3,P6,P
4
P1 0
P2,P5,P3,P6,P4 0.22 0

Conti
Thedendrogramasperthesolutionisasfollow
P3
P6
P4
P2
P5
P1
DendrogramoftheclusterformedforthegroupP1,P2,P3,P4,P5andP6.
7/19/2024 81Dr. Shivashankar, ISE, GAT

Conti..
Problem3:Givenaonedimensionaldataset{1,5,8,10,2},usetheAgglomerativeclustering
algorithmwithcompletelinkwithEuclideandistancetoestablishahierarchicalgrouping
relationship.Byusingthecuttingthresholdof5,howmanyclustersarethere?Whatis
theremembershipineachgroup?
Solution:
Euclideandistance=&#3627408485;
2−&#3627408485;
1
2
+&#3627408486;
2−&#3627408486;
1
2
for1dimensionalEuc-dist=&#3627408485;
2−&#3627408485;
1
2
Apply1DEuclideandistancetocalculatethematrix
7/19/2024 82Dr. Shivashankar, ISE, GAT
1 5 8 10 2
1 0 4 7 9 1
5 4 0 3 5 3
8 7 3 0 2 6
10 9 5 2 0 8
2 1 3 6 8 0
1 2 3 4 5
1 0 4 7 9 1
2 4 0 3 5 3
3 7 3 0 2 6
4 9 5 2 0 8
5 1 3 6 8 0

Conti..
Fromthedistancematrix,wecanfinddistancebetweenpoints1and5issmallest,i,e.2.
Thenmerge{1,5}.
Nowrecalculatethedistance:
d(2,{1,5}}=max{d(2,1),d(2,5)}=max(4,3)=4
d(3,{1,5}}=max{d(3,1),d(3,5)}=max(7,6)=7
d(4,{1,5}}=max{d(4,1),d(4,5)}=max(4,5)=9
Fromthematrix,thedistancebetweenpoints3and4issmallest,i.e.2
Hencetheymergetogetherastoformacluster{3,4}.
Usingthecompletelink,wehavethedistancebetweendifferentpoints/clusterasfollows.
d({1,5},{3,4})=max{d({1,5},3),d({1,5},4)}=max(7,9)=9
d(2,{3,4})=max{d(2,3),d(2,4)}=max(3,5)=5
Thus,wecanupdatethedistancematrix,whererow2
correspondstopoint2,row1and3correspondsto
Cluster{1,5}and{3,4}asfollows.
7/19/2024 83Dr. Shivashankar, ISE, GAT
1,52 3 4
1,50 4 7 9
2 4 0 3 5
3 7 3 0 2
4 9 5 2 0
1,52 3,4
1,50 4 9
2 4 0 5
3,49 5 0

Conti..
Followingthesameprocedure,wemergepints2withthecluster{1,5}toform{1,2,5}and
updatethedistancematrixasfollows.
Afterincreasethedistancethresholdto9,
allclusterswouldmerge.
Fig12:Dendogramforthegivendatasets
7/19/2024 84Dr. Shivashankar, ISE, GAT
[1,5],2[3,4]
[1,5],20 9
[3,4]9 0

Conti..
Problem3:Giventhedataset{a,b,c,d,e}andfollowingdistancematrix.Constructa
dendrogrambyaveragelinkagehierarchicalclusteringusingtheAgglomerativemethod.
Solution:
Theaveragelinkageclusteringusestheaverageformula,i.e.distancebetweentwo
clusteringA&B
d(A,B)=avg{d(a,y):x&#3627409174;&#3627408436;,&#3627408486;&#3627409174;&#3627408437;}
d(A,B)=
∈??????&#3627408485;,&#3627408486;:x&#3627409174;&#3627408436;,&#3627408486;&#3627409174;&#3627408437;
&#3627408436;&#3627408437;
7/19/2024 85Dr. Shivashankar, ISE, GAT
a b c d e
a 0 9 3 6 11
b 9 0 7 5 10
c 3 7 0 9 2
d 6 5 9 0 8
e 11 10 2 8 0

Conti..
Dataset:{a,b,c,d,e}
Initialclustering(Singletoasets)
C1={a},{b},{c},{d},{e}
Fromthetable,theminimumdistanceisthedistancebetweentheclusters{c}and{e}.
Also,d({c}:{e})=2
Wemerge{c}ad{e}toformthecluster{c,e}
ThenewsetofclusterC2={a},{b},{d},{c,e}
7/19/2024 86Dr. Shivashankar, ISE, GAT
abcde
a093611
b907510
c37092
d65908
e1110280
a b c,ed
a 0 9 ? 6
b 9 0 ? 5
c,e? ? 0 ?
d 6 5 ? 0

Conti..
Letuscomputethedistanceof{c,e}fromotherclusters.
d({c,e},{a})=avg{d(c,a),d(e,a)}=
3+11
2∗1
=7
d({c,e},{b})=avg{d(c,b),d(e,b)}=
7+10
2∗1
=8.5
d({c,e},{d})=avg{d(c,d),d(e,d)}=
9+8
2∗1
=7
Nowupdatethetable.
FromC2table,theminimumdistanceisthedistancebetweenthecluster{d}and{b}.
Also,d({b},{d})=5
Wemerge{b}and{d}toformthecluster{b,d}
Thenewsetofcluster,C3:{a},{c,e},{b,d}
7/19/2024 87Dr. Shivashankar, ISE, GAT
a b c,ed
a 0 9 7 6
b 9 0 8.55
c,e7 8.50 8.5
d 6 5 8.50

Conti..
Letuscomputethedistanceof{b,d}fromotherclusters.
d({b,d},{a})=avg{d(b,a),d(d,a)}
d({b,d},{a})=
9+6
2∗1
=7.5
d({b,d},{c,e})=Avg{d(b,c):d(b,e),d(d,c),d(d,e)}
d({b,d},{c,e})=
7+10+9+8
2∗2
=8.5
7/19/2024 88Dr. Shivashankar, ISE, GAT
a b c,ed
a 0 9 7 6
b 9 0 8.55
c,e7 8.50 8.5
d 6 5 8.50
a b,d c,e
A 0 ? 7
b,d ? 0 ?
c,e 7 ? 0
a b,d c,e
a 0 7.5 7
b,d 7.5 0 8.5
c,e 7 8.5 0

Conti..
Fromthetable,theminimumdistanceisthedistancebetweentheclusters{a}and{c,e}is7.
Also,d({a});{c,e})=7
Wemerge{a}and{b,d}toformthecluster{a,b,d}
ThenewsetofclustersC4:{a,c,e},{b,d}
Letuscomputethedistanceof{a,c,e}fromothercluster.
D({a,c,e},{b,d})=Avg{d(a,b),d(a,d),d(c,b),d(c,d),d(e,b),d(e,d)
D({a,c,e};{bd})=
9+6+7+9+10+8
3∗2
=8.16
Fig11:Dendogramforthedataset{a,b,c,d,e}.
7/19/2024 89Dr. Shivashankar, ISE, GAT
a,c,eb,d
a,c,e0 ?
b,d ? 0
a,c,eb,d
a,c,e0 8.16
b,d 8.160

Divisive Clustering
•Divisiveclusteringisalsoatypeofhierarchicalclusteringthatisusedtocreate
clustersofdatapoints.
•Itisanunsupervisedlearningalgorithmthatbeginsbyplacingallthedata
pointsinasingleclusterandthenprogressivelysplitstheclustersuntileach
datapointisinitsowncluster.
•Itisusefulforanalyzingdatasetsthatmayhavecomplexstructuresor
patterns,asitcanhelpidentifyclustersthatmaynotbeobviousatfirst
glance.
•Divisiveclusteringworksbyfirstassigningallthedatapointstoonecluster.
•Then,itlooksforwaystosplitthisclusterintotwoormoresmallerclusters.
•Thisprocesscontinuesuntileachdatapointisinitsowncluster.
7/19/2024 90Dr. Shivashankar, ISE, GAT

Cont…
StepstoDivisiveHierarchicalClustering
Thealgorithmfordivisivehierarchicalclusteringinvolvesseveralsteps.
Step1:Considerallobjectsapartofonebigcluster.
Step2:Spiltthebigclusterintosmallclustersusinganyflat-clusteringmethod-ex.k-
means.
Step3:Selectsanobjectorsubgrouptosplitintotwosmallersub-clustersbasedonsome
distancemetricsuchasEuclideandistanceorcorrelationcoefficients.
Step4:Theprocesscontinuesrecursivelyuntileachobjectformsitsowncluster.
7/19/2024 91Dr. Shivashankar, ISE, GAT
Fig. 12: Concept of Divisive Hierarchical
Clustering

Cont…
7/19/2024 92Dr. Shivashankar, ISE, GAT
Fig.13. Presents the differences between Agglomerative and Divisive
algorithms.

Conti..
1.k-NN algorithm does more computation on test time rather than train time.
A)TRUE
B) FALSE
2. Which of the followingdistance metriccan not be used in k-NN?
A)Manhattan
B)Minkowski
C)Tanimoto
D)Jaccard
E)Mahalanobis
F)All can be used
3)Whichofthefollowingoptionistrueaboutk-NNalgorithm?
A)Itcanbeusedforclassification
B)Itcanbeusedforregression
C)Itcanbeusedinbothclassificationandregression
4)Whichofthefollowingmachinelearningalgorithmcanbeusedforimputingmissingvaluesofbothcategoricalandcontinuous
variables?
A)K-NN
B)LinearRegression
C)LogisticRegression
5)WhichofthefollowingwillbeEuclideanDistancebetweenthetwodatapointA(1,3)andB(2,3)?
A)1
B)2
C) 4
D) 8
7/19/2024 93Dr. Shivashankar, ISE, GAT

A.K-MeansClusteringcomesunder
1.SupervisedlearningAlgorithm
2.UnsupervisedLearningAlgorithm
3.ReinforcementLearning
4.Noneoftheabove
B.Whichofthefollowingistrueforclustering
1.Clusteringisatechniqueusedtogroupsimilarobjectsintoclusters.
2.partitiondataintogroups
3.dividingentiredata,basedonpatternsindata
4.Alloftheabove
C.WhichofthefollowingistrueforK-MeansClustering
1.Alldatapointsinaclustershouldbesimilartoeach
2.other.
3.Thedatapointsfromdifferentclustersshouldbeasdifferentaspossible.
4.Both1and2
5.Only1
6.Only2
D.Whichofthefollowingapplicationscomesunderclustering
1.CustomerSegmentation
2.TargetedMarketing
3.RecommendationEngines
4.Predictingthetemperature
5.Only1,2,3,4
6.Alltheabove
E.Whatisintraclusterdistance
1.distancebetweenpointsintheclustertoitscentroid
2.distancebetweeneachpointinthecluster
3.sumofsquaresofdistancesbetweenpoints
4.Noneoftheabove
7/19/2024 94Dr. Shivashankar, ISE, GAT

Conti..
Q1.Movierecommendationsystemsareanexampleof:
1.Classification
2.Clustering
3.ReinforcementLearning
4.Regression
Options:
A.2Only
B.1and2
C.1and3
D.2and3
E.1,2,and3
F.1,2,3,and4
Q2.SentimentAnalysisisanexampleof:
Regression
Classification
Clustering
ReinforcementLearning
Options:
A.1Only
B.1and2
C.1and3
D.1,2and3
E.1,2and4
F.1,2,3and4
7/19/2024 95Dr. Shivashankar, ISE, GAT

Conti..
Q3.Candecisiontreesbeusedforperformingclustering?
A.True
B.False
Q4.Whatistheminimumno.ofvariables/featuresrequiredtoperformclustering?
Options:
A.0
B.1
C.2
D.3
Q5.FortworunsofK-Meanclustering,isitexpectedtogetthesameclusteringresults?
A.Yes
B.No
Q6.Whichofthefollowingclusteringalgorithmssuffersfromtheproblemofconvergenceatlocaloptima?
A.K-Meansclusteringalgorithm
B.Agglomerativeclusteringalgorithm
C.Expectation-Maximizationclusteringalgorithm
D.Diverseclusteringalgorithm
Options:
A.1only
B.2and3
C.2and4
D.1and3
E.1,2and4
F.Alloftheabove
7/19/2024 96Dr. Shivashankar, ISE, GAT