Machine Learning
Zahra Sadeghi, PhD
1
Linear models
•The main difference between Regression and Classification algorithms that Regression algorithms are used to
predict the continuous values such as price, salary, age, etc. and Classification algorithms are used to
predict/Classify the discrete values such as Male or Female, True or False, Spam or Not Spam, etc.
Regression Algorithm Classification Algorithm
In Regression, the output variable must be of continuous
nature or real value.
In Classification, the output variable must be a discrete value.
In Regression, we try to find the best fit line, which can predict
the output more accurately.
In Classification, we try to find the decision boundary, which
can divide the dataset into different classes.
The regression Algorithm can be further divided into Linear and
Non-linear Regression.
The Classification algorithms can be divided into Binary
Classifier and Multi-class Classifier.
Linear regression
Linear regression
•The objective of Linear Regression is to find a line that minimizes
theprediction error of all the data points.
Linear separability
•if the data points can be separated using a line, linear function, or flat
hyperplane are considered linearly separable.
Linear Separable
Simple linear regression
•A model is said to be linear when it is linear in
parameters.
•y is termed as the dependent or study variable and X
is termed as the independent or explanatoryvariable.
•The terms are the parameters of the model.
•The is termed as an interceptterm, and the
parameter is termed as the slope parameter.
•These parameters are usually called as regression
coefficients.
•The unobservable error component accounts for
the failure of data to lie on straight line and represents
the difference between the true and observed
realization of y .
Multiple linear regression
•In practice, it is unlikely that any response variable Y depends solely
onone predictor x.
•We consider the problem of regression when the study variable depends
on more than one explanatory or independent variables
•Y depends on more than one explanatory variable
•Y have shapes other than straight lines, although it does not allow for arbitrary
shapes
Matrix form
•n observations
•k independent variables
Nonlinear Separable
Polynomial Regression
•The simplest non-linear model we can consider, for a response Y and
apredictor X, is a polynomial model of degree M
•we treat each as aseparate predictor.
•Although this model allows for a nonlinear relationship between Y and
X, polynomial regression is still considered linear regression since it is
linear in the regression coefficients,
Linear Regression (recap)
•alinear regressionmodel tells us to combine our features in
alinearway in order to approximate the response
•Linear regression is just a special case of nonlinear regression.
•Your choice of linear or nonlinear regression should be based on the
data you are fitting.
•Both linear and nonlinear regression find the values of the
parameters (slope and intercept for linear regression) that make the
line or curve come as close as possible to the data.
Regression vs correlation
•The most commonly used techniques for investigating the
relationship between two quantitative variables are correlation and
linear regression.
•Correlation quantifies the strength of the linear relationship between
a pair of variables,
•whereas regression expresses the relationship in the form of an
equation.
Correlation
•When investigating a relationship between two variables,
the first step is to show the data values graphically on a
scatter diagram.
•the closer the points lie to a straight line, the stronger the
linear relationship between two variables.
•if we have two variables x and y, and the data take the form
of n pairs (i.e. [x
1, y
1], [x
2, y
2], [x
3, y
3] ... [x
n, y
n]), then the
correlation coefficient is:
variance
Standard deviation
•r is the product moment correlation
coefficient (or Pearson correlation
coefficient).
•The value of r always lies between -1
and +1.
•A value of the correlation coefficient
close to +1 indicates a strong positive
linear relationship (i.e. one variable
increases with the other).
•A value close to -1 indicates a strong
negative linear relationship (i.e. one
variable decreases as the other
increases).
•A value close to 0 indicates no linear
relationship
•however, there could be a nonlinear
relationship between the variables
r= +0.9. Positive linear relationship.
r = -0.9. Negative linear relationship.
r = 0.04. No relationship.r = -0.03. Nonlinear relationship.
Regression
•we are interested in the effect of the
predictor or x variable on the response or y
variable.
•We want to estimate the underlying linear
relationship so that we can predict y for a
given x.
•Regression can be used to find the equation
of this line.
•This line is usually referred to as the
regression line.
Regression line for ln urea and age: ln
urea = 0.72 + (0.017 × age).
Correlation vs covariance
•Correlation analysis is a method of statistical evaluation used to study
the strength of a relationship between two continuous variables.
•It not only shows the kind of relation (in terms of direction) but also how
strong the relationship is.
•The covariance value can range from -∞ to +∞, with a negative value
indicating a negative relationship and a positive value indicating a
positive relationship.
•covariance only measures how two variables change together, not the
dependency of one variable on another one.
Covariance
•covariance is only useful to find the direction of the relationship
between two variables and not the magnitude.
•A positive covariance between two variables indicates that they are heading
in the same direction.
•Anegative covariance between two variables indicates higher values of one
variable correlate to lower values of another and vice versa.
directrelationship
Inverse relationshipNo relationship
Covariance vs Correlation
•Correlation and Covariance both measure only the linear
relationships between two variables.
•when the correlation coefficient is zero, the covariance is also zero.
•to measure relationship between variables, correlation is preferred
over covariance because it does not get affected by the change in
scale.
Autocovariance vs autocorrelation
•if Y is the same variable as X, the previous
expressions are called the autocovariance and
autocorrelation
•autocovariance is a function that gives the
covariance of the process with itself at pairs of time
points.
•Autocorrelation, is the correlation of a signal with a
delayed copy of itself as a function of delay.
Linear classification
Linear functions
•Linear function of input vector X:
•where w is a weight vector and b is a scalar-valued bias.
•Binary linear classifier
•we can obtain an equivalent model byreplacing the bias with b − r
andsetting r to 0.
bias
•to eliminate the bias, we simply add anotherinput dimension x0,
called a dummy feature, which always takes the value 1.
•so w0 effectively plays the role of a bias
•Eliminating the bias often simplifies the statements of algorithms, so
we’llsometimes use it for notational convenience.
Linear formulation
The perceptron algorithm
•An example of a linear classifier model is the perceptron of
FrankRosenblatt (1950s)
Perceptron
If e = 1 then w' = w + x
If e = -1 then w' = w – x
If e = 0 then w' = w
how do we update the weight vector?
Limits of perceptron
The XOR problem with neural
networks can be solved by using
Multi-Layer Perceptrons or a
neural network architecture with
an input layer, hidden layer, and
output layer.
These limitations were widely publicized by
Marvin Minsky and Seymour Papert.
It was not until the 1980s that these limitations
were overcome with improved (multilayer)
perceptron networks and associated learning
rules.
Perceptron (recap)
•Single layerPerceptronscan learn only linearly separable patterns.
•Multilayer Perceptron or feedforward neural network with two or
morelayers have the greater processing power and can process
non-linearpatterns as well.
•Perceptronscan implement Logic Gates like AND, OR, but not XOR.
Logistic regression
•is used for classification tasks: We can interpret y(x)as the probability
that the label of x is 1.
•The hypothesis class associated withlogistic regression is the
composition of a sigmoid function
The name “sigmoid” means “S-shaped”
Logistic regression
Multi-class classification
•The multiclass classification problem can be decomposed into several
binaryclassification tasks that can be solved efficiently using binary
classifiers.
•One-versus-one
•One-versus-all
One-versus-all (OVA)
•The simplest approach is to reduce the problem of classifying
among K classesinto K binary problems, where each problem
discriminates a given class from theother K − 1 classes
•we require N = K binary classifiers,where the k'th classifier is
trained with positive examples belonging to class kand negative
examples belonging to the other K − 1 classes.
•When testing anunknown example, the classifier producing the
maximum output is consideredthe winner, and this class label is
assigned to that example
•Suppose you have classes A, B, and C. We will build one model for
each class:
•Another way to think about the models is each class vs everything
else
Example:
•It leads to ambiguous regions
one-versus-one
•each class is compared to each other class
•A binary classifier is built to discriminate
between each pair of classes, while
discardingthe rest of the classes.
•This requires building K(K−1)/2 binary
classifiers.
•Whentesting a new example, a voting is
performed among the classifiers and the
classwith the maximum number of votes
wins.
•We can avoid these difficulties by considering a single
K-class discriminantcomprising K linear functions of
the form
•The decision regions of such a discriminant are
always singly connected andconvex.
If two points x Aand x B both lie inside the same
decision region R_k , then any point x
connecting these two points must also lie in
R_k, and hence the decision region must be
singly connected and convex.