How to choose the value of k for KNN Algorithm?
•ThevalueofkisverycrucialintheKNNalgorithmtodefinethe
numberofneighborsinthealgorithm.
•Thevalueofkinthek-nearestneighbors(k-NN)algorithmshouldbe
chosenbasedontheinputdata.Iftheinputdatahasmoreoutliersor
noise,ahighervalueofkwouldbebetter.
•Itisrecommendedtochooseanoddvalueforktoavoidtiesin
classification.
•Cross-validationmethodscanhelpinselectingthebestkvaluefor
thegivendataset.
Applications of the KNN Algorithm
•DataPreprocessing–WhiledealingwithanyMachineLearning
problemwefirstperformtheEDApartinwhichifwefindthatthe
datacontainsmissingvaluesthentherearemultipleimputation
methodsareavailableaswell.OneofsuchmethodisKNNImputer
whichisquiteeffectiveadgenerallyusedforsophisticatedimputation
methodologies.
•PatternRecognition–KNNalgorithmsworkverywellifyouhave
trainedaKNNalgorithmusingtheMNISTdataset(ModifiedNational
InstituteofStandardsandTechnologydatabase.Containsacollectionof70,000,28x28
imagesofhandwrittendigitsfrom0to9.Thedatasetisalreadydividedintotrainingand
testingsets.)andthenperformedtheevaluationprocessthenyoumust
havecomeacrossthefactthattheaccuracyistoohigh.
Applications of the KNN Algorithm
•RecommendationEngines–Themaintaskwhichisperformedbya
KNNalgorithmistoassignanewquerypointtoapre-existedgroup
thathasbeencreatedusingahugecorpusofdatasets.
•Thisisexactlywhatisrequiredintherecommendersystemstoassign
eachusertoaparticulargroupandthenprovidethem
recommendationsbasedonthatgroup’spreferences.
Advantages of the KNN Algorithm
•Easytoimplementasthecomplexityofthealgorithmisnotthathigh.
•AdaptsEasily–AspertheworkingoftheKNNalgorithmitstoresall
thedatainmemorystorageandhencewheneveranewexampleor
datapointisaddedthenthealgorithmadjustsitselfasperthatnew
exampleandhasitscontributiontothefuturepredictionsaswell.
•FewHyperparameters–Theonlyparameterswhicharerequiredin
thetrainingofaKNNalgorithmarethevalueofkandthechoiceof
thedistancemetricwhichwewouldliketochoosefromour
evaluationmetric.
Disadvantages of the KNN Algorithm
•Doesnotscale–AswehaveheardaboutthisthattheKNNalgorithm
isalsoconsideredaLazyAlgorithm.
•Themainsignificanceofthistermisthatthistakeslotsof
computingpoweraswellasdatastorage.Thismakesthis
algorithmbothtime-consumingandresourceexhausting.
•CurseofDimensionality–Thereisatermknownasthepeaking
phenomenonaccordingtothistheKNNalgorithmisaffectedbythe
curseofdimensionalitywhichimpliesthealgorithmfacesahardtime
classifyingthedatapointsproperlywhenthedimensionalityistoo
high.
Disadvantages of the KNN Algorithm
•PronetoOverfitting–Asthealgorithmisaffectedduetothecurseof
dimensionalityitispronetotheproblemofoverfittingaswell.
•Hencegenerallyfeatureselectionaswellasdimensionality
reductiontechniquesareappliedtodealwiththisproblem.
Some more details of the
KNN Algorithm
SmartIndiaHackathon,thisyear’seditionalsofocuses
on
“Softwareandhardwareapproachestodevelop
innovativesolutionsagainstchallengesfacedbyIndia's
governingagenciesincludingMinistries,Departments,
andPSUs.“
Thedeadlineforcollegeregistration,teamnomination,andidea
submissionis30September2023.Hurryup
visitwww.sih.gov.intoregister.
Introduction to
Support Vector Machines (SVM)
30
History of SVM
◼SVM is related tostatistical learning theory
◼SVM was first introduced in 1992
◼SVM becomes popular because of its success in handwritten digit
recognition
◼1.1% test error rate for SVM. This is the same as the error rates
of a carefully constructed neural network, LeNet4.
Support Vector Machines (SVM)
Can be used for both the supervised learning problems
•Classification
•(SVM Classifier)
•for binary as well as multi calss
classification also)
•Regression
•(SVM Regressor)
•Works similar like LR but with some additional features.
•One the very effective and popular algorithm in industry and academia.
Cost Function
Here in this “Cost Function”,
- “C” is the ‘hyperparameter’ tells how many number of misclassification
points can be avoided.
- “ξ (pronounces as ‘eeta’)“ is the summation of distance of wrong points
till the marginal points.
It will tell that “ how should be this
best fit line.”
Available Data Sets in Sklearn
Scikit-learnmakesavailableahostofdatasetsfortestinglearningalgorithms.
Theycomeinthreeflavors:
PackagedData:thesesmalldatasetsarepackagedwiththescikit-learn
installation,andcanbedownloadedusingthetoolsin
sklearn.datasets.load_*
DownloadableData:theselargerdatasetsareavailablefordownload,and
scikit-learnincludestoolswhichstreamlinethisprocess.Thesetoolscanbe
foundinsklearn.datasets.fetch_*
GeneratedData:thereareseveraldatasetswhicharegeneratedfrommodels
basedonarandomseed.Theseareavailableinthesklearn.datasets.make_*
Available Data Sets in Sklearn
You can explore the available dataset loaders, fetchers, and generators
using IPython'stab-completion functionality. After importing the datasets
submodule from sklearn, type
datasets.load_<TAB>
or
datasets.fetch_<TAB>
or
datasets.make_<TAB>
to see a list of available functions.
2023/11/8 40
Linear Classifiers
f
x
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x-b)
How would you
classify this data?
Estimation:
w: weight vector
x: data vector
2023/11/8 41
Linear Classifiers
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x-b)
How would you
classify this data?
2023/11/8 42
Linear Classifiers
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x-b)
How would you
classify this data?
2023/11/8 43
Linear Classifiers
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x-b)
How would you
classify this data?
2023/11/8 44
Linear Classifiers
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x-b)
Any of these would
be fine..
..but which is best?
2023/11/8 45
Classifier Margin
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x-b)
Define the marginof a
linear classifier as the
width that the boundary
could be increased by
before hitting a datapoint.
2023/11/8 46
Maximum Margin
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x-b)
The maximum margin
linear classifieris the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
SVM (Called an LSVM)
Linear SVM
2023/11/8 47
Maximum Margin
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x+ b)
The maximum margin
linear classifieris the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
SVM (Called an LSVM)
Support Vectors
are those
datapoints that
the margin pushes
up against
Linear SVM
2023/11/8 48
Why Maximum Margin?
denotes +1
denotes -1
f(x,w,b) = sign(w. x-b)
The maximum margin
linear classifieris the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
SVM (Called an LSVM)
Support Vectors
are those
datapoints that
the margin pushes
up against
2023/11/8 49
How to calculate the distance from a point to a line?
◼http://mathworld.wolfram.com/Point-LineDistance2-
Dimensional.html
◼In our case, w
1*x
1+w
2*x
2+b=0,
◼thus, w=(w
1,w
2),x=(x
1,x
2)
denotes +1
denotes -1
x
wx +b = 0
X–Vector
W–Normal Vector
b –Scale Value
W
2023/11/8 50
Estimate the Margin
•What is the distance expression for a point xto a
line wx+b= 0?
denotes +1
denotes -1
x
wx +b = 02 2
12
()
d
ii
bb
d
w
=
+ +
==
x w x w
x
w
X–Vector
W–Normal Vector
b –Scale Value
W
2023/11/8 51
Large-margin Decision Boundary
•The decision boundary should be as far away from the data of both
classes as possible
•We should maximize the margin, m
•Distance between the origin and the line w
t
x=-b is b/||w||
Class 1
Class 2
m
2023/11/8 52
Finding the Decision Boundary
•Let {x
1, ..., x
n} be our data set and let y
i{1,-1} be the class label of
x
i
•The decision boundary should classify all points correctly
•To see this: when y=-1, we wish (wx+b)<1, when y=1, we wish
(wx+b)>1. For support vectors, we wish y(wx+b)=1.
•The decision boundary can be found by solving the following
constrained optimization problem
2023/11/8 53
Next step… Optional
•Converting SVM to a form we can solve
•Dual form
•Allowing a few errors
•Soft margin
•Allowing nonlinear boundary
•Kernel functions
2023/11/8 54
The Dual Problem (we ignore the derivation)
•The new objective function is in terms of a
ionly
•It is known as the dual problem: if we know w, we know all a
i; if we
know all a
i, we know w
•The original problem is known as the primal problem
•The objective function of the dual problem needs to be maximized!
•The dual problem is therefore:
Properties of a
iwhen we introduce the
Lagrange multipliers
The result when we differentiate the
original Lagrangian w.r.t. b
2023/11/8 55
The Dual Problem
•This is a quadratic programming (QP) problem
•A global maximum of a
i can always be found
•wcan be recovered by
2023/11/8 56
Characteristics of the Solution
•Many of the a
iare zero (see next page for example)
•wis a linear combination of a small number of data points
•This “sparse” representation can be viewed as data
compression as in the construction of knn classifier
•x
iwith non-zero a
iare called support vectors (SV)
•The decision boundary is determined only by the SV
•Let t
j(j=1, ..., s) be the indices of the ssupport vectors.
We can write
•For testing with a new data z
•Compute and
classify zas class 1 if the sum is positive, and class 2
otherwise
•Note: wneed not be formed explicitly
2023/11/8 57
a
6=1.4
A Geometrical Interpretation
Class 1
Class 2
a
1=0.8
a
2=0
a
3=0
a
4=0
a
5=0
a
7=0
a
8=0.6
a
9=0
a
10=0
2023/11/8 58
Allowing errors in our solutions
•We allow “error” x
iin classification; it is based on the output of the
discriminant function w
T
x+b
•x
iapproximates the number of misclassified samples
Class 1
Class 2
2023/11/8 59
Soft Margin Hyperplane
•If we minimize
ix
i, x
ican be computed by
•x
iare “slack variables” in optimization
•Note that x
i=0 if there is no error for x
i
•x
iis an upper bound of the number of errors
•We want to minimize
•C: tradeoff parameter between error and margin
•The optimization problem becomes
2023/11/8 60
Extension to Non-linear Decision Boundary
•So far, we have only considered large-margin classifier with a
linear decision boundary
•How to generalize it to become nonlinear?
•Key idea: transform x
ito a higher dimensional space to
“make life easier”
•Input space: the space the point x
iare located
•Feature space: the space of f(x
i) after transformation
2023/11/8 61
Transforming the Data
•Computation in the feature space can be costly because
it is high dimensional
•The feature space is typically infinite-dimensional!
•The kernel trick comes to rescue
f( )
f( )
f( )
f( )f( )
f( )
f( )
f( )
f(.)
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
Feature space
Input space
Note: feature space is of higher dimension
than the input space in practice
2023/11/8 62
The Kernel Trick
•Recall the SVM optimization problem
•The data points only appear asinner product
•As long as we can calculate the inner product in the feature space, we do
not need the mapping explicitly
•Many common geometric operations (angles, distances) can be expressed
by inner products
•Define the kernel function Kby
2023/11/8 63
An Example for f(.) and K(.,.)
•Suppose f(.) is given as follows
•An inner product in the feature space is
•So, if we define the kernel function as follows, there is no need to carry out f(.)
explicitly
•This use of kernel function to avoid carrying out f(.) explicitly is known as the
kernel trick
2023/11/8 64
More on Kernel Functions
•Not all similarity measures can be used as kernel function, however
•The kernel function needs to satisfy the Mercer function, i.e., the function is
“positive-definite”
•This implies that
•the nby nkernel matrix,
•in which the (i,j)-th entry is the K(x
i, x
j), is always positive definite
•This also means that optimization problem can be solved in
polynomial time!
2023/11/8 65
Examples of Kernel Functions
•Polynomial kernel with degree d
•Radial basis function kernel with width s
•Closely related to radial basis function neural networks
•The feature space is infinite-dimensional
•Sigmoid with parameter kand q
•It does not satisfy the Mercer condition on all kand q
2023/11/8 66
Non-linear SVMs: Feature spaces
◼General idea: the original input space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x→φ(x)
2023/11/8 67
Example
•Suppose we have 5 one-dimensional data points
•x
1=1, x
2=2, x
3=4, x
4=5, x
5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2 y
1=1,
y
2=1, y
3=-1, y
4=-1, y
5=1
•We use the polynomial kernel of degree 2
•K(x,y) = (xy+1)
2
•C is set to 100
•We first find a
i(i=1, …, 5) by
2023/11/8 68
Example
•By using a QP solver, we get
•a
1=0, a
2=2.5, a
3=0, a
4=7.333, a
5=4.833
•Note that the constraints are indeed satisfied
•The support vectors are {x
2=2, x
4=5, x
5=6}
•The discriminant function is
•bis recovered by solving f(2)=1or by f(5)=-1or by f(6)=1, as x
2and x
5lie on
the line and x
4lies on the line
•All three give b=9
2023/11/8 69
Example
Value of discriminant function
1 2 4 5 6
class 2 class 1class 1
2023/11/8 70
Degree of
Polynomial
Features
X^1 X^2 X^3
X^4 X^5 X^6
2023/11/8 71
Choosing the Kernel Function
•Probably the most tricky part of using SVM.
2023/11/8 72
Software
•A list of SVM implementation can be found at http://www.kernel-
machines.org/software.html
•Some implementation (such as LIBSVM) can handle multi-class
classification
•SVMLight is among one of the earliest implementation of SVM
•Several Matlab toolboxes for SVM are also available
2023/11/8 73
Summary: Steps for Classification
•Prepare the pattern matrix
•Select the kernel function to use
•Select the parameter of the kernel function and the value of C
•You can use the values suggested by the SVM software, or you can set apart a
validation set to determine the values of the parameter
•Execute the training algorithm and obtain the a
i
•Unseen data can be classified using the a
i and the support vectors