Machine Learning Lecture - optimization-part1.pdf

zahsa 0 views 21 slides Oct 12, 2025
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

Machine Learning Lecture: optimization-part 1


Slide Content

Machine Learning
Zahra Sadeghi, PhD
1
Optimization and learning

Optimization
•Machine learning optimization is the process of iteratively improving
the accuracy of a machine learning model, lowering the degree of
error.
•updatethe θvalues to reach the best value thatminimizes the errorbetween
the predicted value and the true value through a loss/cost function.
•The cost function, or loss function, is the function to be minimized (or
maximized) by varying the decision variables (model parameters).

•The optimization function to be solved is defined as:

Empirical Risk Minimization (ERM)
•a learning algorithm receives as input a training set S,sampled from an
unknown distribution D and labeled by some target functionf, and should
output a predictor h_S : X → Y (the subscript S emphasizes thefact that the output predictor depends on S).
•The goal of the learning algorithm is to find h_S that minimizes the error with respect to
the unknown D and f .
•The terms empirical error and empirical risk are often used
interchangeablyfor the error:
•This learning paradigm – coming up with a predictor h that minimizes L_S (h)
is called Empirical Risk Minimization or ERM for short.

How to solve ERM?
•minimizing the cost function (mean of squared errors (MSE))
•Ordinary LeastSquares (OLS)
•Gradient Descent

Derivative and gradient
•the derivativeof a function shows you how much a value changes
when you modify its argument (or arguments).
•Derivatives are important for optimization because the zero
derivativesmight indicate a critical point.
•The gradient of a cost function ?????? of several independent variables ??????₁,
…, ??????ᵣ is denoted with ∇??????(??????₁, …, ??????ᵣ) and defined as the vector function
of the partial derivatives of ?????? with respect to each independent
variable:
•∇?????? = (∂??????/∂??????₁, …, ∂??????/??????ᵣ).
•The nonzero value of the gradient of a function ?????? at a given point
defines the direction and rate of the fastest decrease of ??????.
•This direction is determined by the negative gradient, −∇??????.

Ordinary least squares (OLS)
•Ordinary least squares (OLS) is a non-iterative method that fits a
model such that the MSE is minimized.
•The least-squares approach gives an exact closed-form solution for
thediscriminant function parameters.

Ordinary least squares (OLS)

Least squares algorithm

•the additional data points in the right-hand figure produce a significant change in the location of the
decision boundary, even though these point would be correctly classified by the original decision
boundary in the left-hand figure.
•The sum-of-squares error function penalizes predictions that are ‘too correct’ in that they lie a long way
on the correct side of the decision
•least squares is highly sensitive to outliers, unlike logistic regression.
the decision boundary found by least squares (magenta curve) and also by the logistic regression model (green curve)

Gradient descent
•A general-purpose optimization technique that can be applied
whenever the objectivefunction is differentiable.
•Gradient descent finds the linear model parameters iteratively.
•Unlike OLS, gradient descent does not try to solve any closed for
solution.
•It achieves the same objective by more crude way of starting with
random values of regression coefficients and iteratively changing
them to achieve the lowest value of cost function.

Gradient Descent
•The idea is to start with random θ
1 and θ
2 values and then iteratively
update the values, reaching minimum cost.
•find the partial derivitive of the cost function(J) with respect toθ
1and
θ
2

Gradient Descent
•Finding the coefficients of a linear equation
that best fits the training data is the objective
of linear regression.
•the respective weights and coefficient of X
will
• is the learning rate:

Gradient descent algorithm

Example
•f(x,y) = x*x + 2y*y
•∇f(x,y) = 2xi + 4yj
====================
•Initial t=0
•x[0] = (4,3) # This is just a randomly chosen point
•At t = 1
•x[1] = x[0] – a.∇f(x[0])
•x[1] = (4,3) – 0.1*(8,12)
•x[1] = (3.2,1.8)
•At t=2
•x[2] = x[1] – a.∇f(x[1])
•x[2] = (3.2,1.8) – 0.1*(6.4,7.2)
•x[2] = (2.56,1.08)

OLS GD
Provides exact solution Provides approximate solution
Non-iterative process Iterative process
Only requires input data Requires selection of step size and initial estimates
No risk of finding non-optimal solutions Risk of finding local minima or saddle points
Easy to explain Poor explaianbility
Suitable for smaller datasets Suitable for large datasets
Least square vs gradient descent

Full Gradient Descent (batch GD)
•We take the average of the gradients of all the training examples and
then use that mean gradient to update our parameters.
•Calculates the gradient foreach example
•only updates the model after all training examples have been
evaluated.
•time-consuming and computationally expensive.
•it is very slow on very large training data.

Stochastic Gradient Descent
•Each iteration of SGD computes the gradient on the basis of one
randomly chosen example from the dataset
•Repeats this process for all the examplesin the dataset
•This modification of GD can reduce the computational time
significantly.
•making it much faster as there is much fewer data to manipulate at a
single time, unlike Batch GD.
•thisapproach may lead tonoisierresults than the GD, because
ititerates one observation at atime.

Mini batch gradient descent
•Neither we use all the dataset all at once nor we use the single
example at a time.
•split the dataset into small subsets (mini batches) and compute the
gradients for each minibatch.
•For each minibatch, the mean gradient is computed and the weights
are updated.
•Just like SGD, the average cost over the epochs in mini-batch gradient
descent fluctuates because we are averaging a small number of
examples at a time.

Batch Gradient Descent Stochastic Gradient Descent
Computes gradient using the whole Training sampleComputes gradient using a single Training sample
Slow and computationally expensive algorithm
Faster and less computationally expensive than Batch
GD
Not suggested for huge training samples. Can be used for large training samples.
Deterministic in nature. Stochastic in nature.
Gives optimal solution given sufficient time to
converge.
Gives good solution but not optimal.
No random shuffling of points are required.
The data sample should be in a random order, and this
is why we want to shuffle the training set for every
epoch.
Can’t escape shallow local minima easily. SGD can escape shallow local minima more easily.
Convergence is slow. Reaches the convergence much faster.
It updates the model parameters only after processing
the entire training set.
It updates the parameters after each individual data
point.
The learning rate is fixed and cannot be changed
during training.
The learning rate can be adjusted dynamically.
It typically converges to the global minimum for
convex loss functions.
It may converge to a local minimum or saddle point.
It may suffer from overfitting if the model is too
complex for the dataset.
It can help reduce overfitting by updating the model
parameters more frequently.

Comparison